# Week 5: Machine Learning 2 – Applications

### Week 5: Machine Learning 2 – Applications

#### Summaries

• Week 5: Machine Learning 2 - Applications > 5a Machine Learning Application: Probabilistic Modeling 1 > 5a Video
• Week 5: Machine Learning 2 - Applications > 5b Probabilistic Modeling 2 > 5b Video
• Week 5: Machine Learning 2 - Applications > 5c Topic Modeling > 5c Video
• Week 5: Machine Learning 2 - Applications > 5d Probabilistic Inference > 5d Video
• Week 5: Machine Learning 2 - Applications > 5e Machine Learning Application: Prediction of Preterm Birth > 5e Video
• Week 5: Machine Learning 2 - Applications > 5f Data Description and Preparation > 5f Video
• Week 5: Machine Learning 2 - Applications > 5g Methods for Prediction of Preterm Birth > 5g Video
• Week 5: Machine Learning 2 - Applications > 5h Results and Discussion > 5h Video
• Week 5: Machine Learning 2 - Applications > 5i Summary and Conclusion > 5i Video

#### Week 5: Machine Learning 2 – Applications > 5a Machine Learning Application: Probabilistic Modeling 1 > 5a Video

• Today I’m going to talk with you about modern probabilistic models for massive data.
• So the idea behind probabilistic modeling is that it’s an efficient framework for discovering useful patterns in massive data.
• We took 1.8 million articles from The New York Times, and then each block here is what’s called a topic.
• We can take large collections of brain data- this is brain data collected while people are undergoing psychological experiments- and uncover hidden patterns of brain activity.
• The idea behind probabilistic modeling involves what I call the probabilistic modeling pipeline, where we take our knowledge of the domain- that’s right there- use it to make assumptions about the data.
• These assumptions are about what hidden patterns live in the data, even though we can’t see them.
• Then we take our data and square the assumptions and the data to discover the particular instantiation of the patterns that we find in our particular data set And then, finally, we use those patterns, over here, to predict and explore and do whatever it is we want to do with the data.
• So this is a framework, really, for doing customized data analysis, and it’s become crucial to many fields.
• In neuroscience we make some assumptions about the data.
• In genetics, we make other kinds of assumptions about the data, in social networks, still other assumptions.
• In doing that, in separating those key activities, it facilitates solving modern data science problems.
• So our goal in statistical machine learning is to develop this pipeline into a flexible, powerful, and easy-to-use way to solve real world problems with data.
• The challenges to doing that are one, to develop new ways to build flexible models, in other words, new ways to express many kinds of assumptions that we might have about our data.
• Two, to develop algorithms that work on many problems and with massive data sets.
• I want to have algorithms that can accommodate many, many assumptions and work with large data sets that we frequently encounter.
• These ideas move forward when we use them to solve new kinds of problems to find new challenges and how to make assumptions or computer under those assumptions.
• Probabilistic topic models are models for analyzing large collections of text to uncover their hidden themes and then use those hidden themes in downstream analyses.
• The idea behind topic models is that as more electronic texts have become available to us, we have new needs- needs to organize, visualize, summarize, search, form predictions about and, generally speaking, understand these big collections of documents or other texts that are suddenly available to us in electronic form.
• The key idea at the highest level is that topic modeling first takes that collection and discovers the hidden thematic structure that lives inside it.
• We have topics like “Stars,” “Astronomers,” “Universe,” “Galaxies.” And here is a topic about atmospheric science- “Ozone,” “Atmosphere,” “Measurements,” “Stratosphere,” “Concentrations,” and so on.

#### Week 5: Machine Learning 2 – Applications > 5b Probabilistic Modeling 2 > 5b Video

• So how does topic modeling work? Well, the basic idea is to start with an intuition, the intuition that documents exhibit multiple topics.
• What topic modeling does, and I’m going to describe for you first the simplest topic model, called “Latent Dirichlet Allocation,” is take those assumptions and formalize them in a formal, probabilistic model of text.
• So for Latent Dirichlet Allocation, or LDA, the idea is that there is a collection of topics here that exist across the collection.
• OK? So before I said a topic is a group of words associated under a single theme.
• Here is a topic about neuroscience, words like “Brain,” “Neuron,” “Nerve” with high probability.
• Here is a topic about data analysis, words like “Data,” “Numbering,” “Computer” with high probability.
• Now, each topic is a full distribution over a fixed vocabulary.
• I emphasize here that the word “DNA,” for example, has some probability in the neuroscience topic, but it just has low probability.
• I look up which topic that blue button is associated with and choose the word at random from the corresponding distribution.
• That’s where I choose the word “Analysis,” or “Computer,” or “Numbers.” Sometimes I choose the yellow button, and then I choose words like “Gene.” Sometimes I choose the pink button, and I choose words like “Life,” and “Organism.” And the idea is that I repeatedly draw buttons from this cartoon histogram, and then look up the corresponding topic and draw a word from that topic, and that’s how I read an article for Science.
• What’s important to see from this picture is that the topics are fixed for the whole collection, but each document exhibits those topics to different degree.
• We’re using it here as well to model text, to model documents as heterogeneous collections of topics, where the topics exist for the whole collection, but each document exhibits them with different proportion.
• OK. But as we talked about, the idea behind topic modeling is that we want to discover all this topical structure.
• The statistics problem- the machine learning problem- is to take our existing stack of documents and fill in all of this structure that we imagine exists in the collection, the topic assignments for each word, the topic proportions for each document, and of course, the topics themselves that can summarize the collection.
• That’s what topic modeling is doing, making assumptions about how topics manifest in large document collections, and then somehow inverting those assumptions to take the large document collection and infer what all the topical structure is that underlie it.
• The idea is that there are k topics, say 100 topics in a collection, and those are represented by beta k, beta 1, beta 2, all the way up to beta 100.
• For each document- that’s the document plate- we first choose a distribution over those topics, that’s theta.
• That’s z, called the topic assignment, and then, choose the corresponding word from the topic that it’s associated with.
• So if z points to the blue topic, then we choose Wdn from the blue distribution over words.
• In this case, that conditional distribution is the distribution of the topic proportions, the topic assignments, and the topics, given a big stack of articles.
• From that collection of documents, we then use that posterior distribution to infer the topic proportions, the topics themselves, or the topic assignments, and then use the posterior expectations of those variables to do whatever it is we want to do- information retrieval, document similarity, document exploration and navigation- whatever it is we’re trying to do.
• This is what I meant in the very beginning of this segment when I said that in topic modeling, we first uncover the topics- that’s computing that posterior distribution of the topics and all of the other structure- and then, we use what we just uncovered, annotate the collection as though it’s the truth, and then use those annotations, i.e., the posterior expectations, to do whatever it is we’re trying to do.
• What we’re going to do is ask for 100 topic LDA model, using an algorithm called variational inference.
• What are the topic proportions for each document? What are the topics that describe the whole collection? All right, here’s that article I showed you in the beginning.
• Now I’m going to ask, what are the topic proportions for this article? OK, you can see on the side of the figure that there is now a real histogram with 100 topics in the x-axis and only handful of the topics are activated.
• OK? So even though the model has 100 topics to play with to describe this article, it only activates a handful of them to describe the words of it.
• I can look at the most frequent terms from the most frequent topics that describe this article.
• It’s a picture of ingesting Science and plotting all of the topics that we find.

#### Week 5: Machine Learning 2 – Applications > 5c Topic Modeling > 5c Video

• Discovery of topics happens through the posterior distribution- the conditional distribution of the unobserved quantities given the observed quantities.
• Topic modeling algorithms approximate this posterior.
• To understand how LDA works, how we find these topics, these semantically coherent patterns of words, we only need to think about what the posterior means.
• In one goal, it wants each document to allocate its words to few topics.
• The punchline is that LDA wants each document to only exhibit a few topics.
• LDAP pays a price if it uses many topics to describe a particular document.
• The second goal, mirroring this goal, is that in each topic LDA wants to assign high probability to very few terms.
• OK? So LDA does not want a topic where every term has some probability.
• Rather, it wants topics where very few terms have high probability and most terms have low probability.
• If I put a document in a single topic, this makes it hard for the topics to have high probability to few terms, because we need to use many topics to cover the terms of the document.
• If I put very few terms in each topic, that makes it hard for documents to exhibit few topics, because I need to use many topics to cover the terms that the documents have inside them.
• LDA casts the problem of topic discovery as a posterior inference problem in a probability model.
• Topic modeling meaning building models of text that use these latent themes in different ways.
• Further, algorithms for topic modeling, which we’ll talk about a little bit in the next section, let us fit these models now to massive data sets.
• If we want to find topics that change over time, we can fit what’s called a dynamic topic model, where the topics no longer just are a static set of distributions over words, but a changing collection of distributions over words changing by time.
• Here, we’re fitting a dynamic topic model to Science from 1880 till 2000.
• One of the topics looks like this, where in 1880 it has words like electric, machine, power, steam, iron, battery, wire, smoothly changes through time.
• So rather than finding a single pattern of words, this topic model has found a smoothly changing pattern of words that captures something like scientific apparatus and devices.
• The output are these complex sequences of topics.
• Here’s another example where we use a topic model to analyze two different types of data simultaneously.
• So this is a topic model that’s fit to both images and captions.
• What you can see in this picture- so we take a bunch of images, a bunch of captions for those images, and fit a topic model.
• Then in this picture, the topic model is trying to predict what words might be used to annotate these images.
• We can use topic models in a geospatial way to capture things like events.
• These are just a few examples of ways to expand on that simple LDA model to build more complicated models of text and topics.
• At an even higher level, topic modeling is a case study in doing text analysis of probability models.
• Topic modeling research seen through the lens of this image either develops new models, making different assumptions about text with different kinds of latent variables, develops new algorithms, new ways of approximating that posterior distribution, which is the central mathematical object that we’re interested in, or develops new applications, visualizations, tools, new ways to use the discovered patterns to do things with text.
• So before I talk a little bit about those algorithms, even at a high level, I wanted to show you a simple example of how to fit a topic model.
• With this, these simple five lines of R, where we read the documents, we set the number of topics, we set a couple of hyperparameters.
• You could see topic 20, for example, is about neuroscience, and topic 14 is about cells, and topic 13 is about proteins.
• So this is a picture of the topic model visualization engine, which is some work by Allison Cheney.
• You can take the output of a topic model and use it to build a browser of your collection that uses those topics as scaffolding.
• So if you have a completely unorganized collection of documents on your hard drive, you can run that R code, and then put the output through this topic modeling visualizer, and you have an automatically-created navigator of your articles.

#### Week 5: Machine Learning 2 – Applications > 5d Probabilistic Inference > 5d Video

• Given a model, how do we use an algorithm to discover the hidden patterns in the data? All right, again, I refer to this chart.
• So this is the probabilistic pipeline, where we make assumptions, discover patterns, and then use those patterns to predict and explore our data.
• How do we take our model, take our data, and infer the hidden variables- like the topics- from the data? What do we need to do inference successfully? We need scalable and generic inference, inference that works in a lot of different kinds of models and that scales to massive data sets.
• So stochastic variational inference takes a massive data set and an estimate of the patterns.
• Here is the massive data set and here is an estimation of some patterns in that data.
• It goes through this loop where we repeatedly subsample our data.
• Then use that inferred local structure to update the global structure, to update our patterns of the data, and repeat.
• In contrast, traditional algorithms have to cycle through all of the data over and over again in order to make progress.
• The idea is that each data point is governed by both its local latent variable and the global latent variable, but not by other local latent variables.
• The goal in inference is to approximate the posterior distribution of beta and Z. And the key problem when trying to scale inference to massive data are the global latent variables, beta- the topics.
• This is what makes inference slow when we have a massive data set or massive document collection.
• So what variational inference does- which is the type of algorithm that stochastic variational inference builds on- what variational inference does is solves the problem of inference, finding that conditional distribution by finding a distribution that’s close to it.
• Even in the case of a simple variational family, solving that optimization problem is hard with a massive data set.
• It has really been enabling modern machine learning to scale to massive data sets.
• This is yet another example of stochastic optimization enabling machine learning to do powerful things with large data sets.
• Subsampling the data gives us a noisy gradient of the variational objective function because the true gradient of the variational objective function with respect to the global variational parameters- the distribution of the global hidden variables- that true gradient involves all the data.
• When we subsample the data and scale, we get a noisy gradient.
• I’ve talked about topic modeling, which really sits in the larger field of probabilistic modeling, which sits in the larger field of statistics / machine learning / data science.
• First, we assume our data comes from a model with hidden patterns at work.
• Recall this picture from the topic modeling segment where we assume that our data came from a model that had topic proportions, topic assignments, and topics.
• We use an algorithm to discover those patterns in the data.
• We fill in all of that hidden topical structure by using something like stochastic variational inference, which lets us approximate the posterior distribution, which is a function of our model, at scale, with large data sets.
• Finally, we use those discovered patterns to predict about and explore our data.
• These all posit a model, solve an inference problem using something like stochastic variational inference, and then take the output of that inference, the approximate posterior distribution of the hidden variables given the observations, and use it to visualize downstream the data in new ways.
• So this again is the probabilistic modeling pipeline, where we make assumptions about our data with hidden quantities at work.
• We then discover what those hitting quantities are in our actualized data set, using an inference algorithm.
• Then we use those discoveries to do what it is we’re trying to do with our data.
• Our perspective is that customized data analysis is important to many fields.
• In other words, using our knowledge about the world to make assumptions is separate from the computational problem where, once we’ve formalized our assumptions into a mathematical model, we have a statistical or algorithmic problem to solve when we want to uncover patterns from existing data.
• In this way, this pipeline facilitates solving collaborative data science problems.
• Everybody gets together to understand how to perform predictions and explore the data.
• For example, Bayesian nonparametric modeling is our models where the number of components or topics can expand with the data, and can be expanded to models where we can learn things like latent trees and latent graphs that underlie the data.
• We need inference that scales to massive data sets and that’s generic to many, many, many models.
• Stan is a language where we can write a model down as a program and then compile that model down into an inference algorithm that takes data as input and spits out approximate posteriors.
• This is a new goal for machine learning because it facilitates making this pipeline easy to use for many kinds of problems and data sets.
• This is part of the same point that something like Stan or other probabilistic programming systems allow domain experts who don’t necessarily need to know much about inference to write down models and explore what those model say about their data.
• Finally, I want to point you to an amazing paper by John Tukey called The Future of Data Analysis from 1962.
• This paper really speaks to me- and at least- about what it means to use probability models to solve domain specific problems and to build out probabilistic modeling as a language for solving many kinds of problems with data.

#### Week 5: Machine Learning 2 – Applications > 5e Machine Learning Application: Prediction of Preterm Birth > 5e Video

• I’m happy to share with you today our last experience with the prediction of preterm birth, an effort of a team of computer scientists and physicians for the last couple of years.
• First of all, I will talk about the problem of preterm birth and why is it such an exciting application for machine learning.
• In part three, I will talk about the methods we used for prediction of preterm birth.
• So what is preterm birth? Preterm birth is the birth of a baby before 37 weeks’ completed gestation.
• A usual pregnancy lasts 40 weeks, but preterm birth is when the baby comes earlier, before 37 weeks’ gestation.
• The statistics of preterm birth worldwide are alarming.
• According to a recent report from the World Health Organization and other entities, it is estimated that about 15 million babies are born preterm every year in the world.
• Preterm birth is not only the leading cause of newborn deaths; it’s also the leading cause of long-term disabilities.
• Advanced obstetrics distinguish between indicated preterm birth that accounts for 30% of cases of preterm birth and spontaneous preterm birth that accounts for the rest, 70%. Indicated preterm birth happens when the physician decides to stop pregnancy, because its continuation would frighten the life of either the baby or the mother.
• Spontaneous preterm birth accounts for the remaining 70%, which is the largest proportion.
• There has been a lot of research on the risk factors or etiologies of preterm birth.
• For indicated preterm birth, the physician can decide to stop pregnancy because of a fetal indication or an obstetric indication.
• These include history of preterm birth- if the mother has a previous preterm birth, she’s likely to have another one; cervical insufficiency; multiple pregnancies, such as carrying twins or triplets; race, genetic factors; infection; short interval between pregnancy; the age of the mother; psychological health; lifestyle, socioeconomic status, and so on and so forth.
• Despite lots of research, preventing preterm birth is still an unsolved problem.
• Previous research has largely focused on individual risk factors and less on combining them to lead to preterm birth.
• Among clinicians, the strongest and widely used risk factors are history of preterm birth and short cervix.
• History of preterm birth is not known information for those women who are giving birth for the first time, nulliparas women, and these represent 40% of pregnant women in the United States.
• A 2011 study revealed no trials of the use of any risk-scoring system to prevent preterm birth.
• The second group include past history, such as the number of abortions, short interval between pregnancies, previous preterm delivery, history, and so on.
• If the woman has a history of preterm birth, this would cost 10 points.
• The score between 0 and 5 is classified as lower risk; between 6 and 9 as medium risk; and if the mother scores more than 10 points, she is deemed to be at high risk of preterm birth.

#### Week 5: Machine Learning 2 – Applications > 5f Data Description and Preparation > 5f Video

• Remarkably, current preterm treatments were not in use when this data was collected in the early ’90s. As a result, this data set is a compelling benchmark for natural incidence of preterm birth, independent of any treatment.
• We have received a large matrix of about 3,000 women described by 445 features, or variables.
• Overall, you can see, in this snapshot, a mix of binary features, numerical features, categorical features, and you can see a substantial amount of missing values in the data.
• A substantial number of featured data, as I said, is missing.
• First of all, we handle the complexity of the data by organizing features into groups.
• The DMG, Demographic group, includes the features, such as mother’s age, marital status, whether she has home phone, car, family support, income- all this being a proxy for the woman’s socio-economic status- her education, race, insurance information, et cetera.
• This group includes features ranging from the total pregnancies, number of previous preterm deliveries, number of previous spontaneous deliveries, number of induced abortions, number of stillbirths over 20 weeks, number of previous neonatal deaths, number of previous children still living, number of spontaneous miscarriages, and number of induced abortions.
• At each visit, that is 24, 26, 28, and 30 weeks gestation, a set of feature groups is collected.
• At each visit, there was a group of features that were collected that we described earlier.
• We focus our study on the three major visits at ticks T0, T1, and T3. The T0 data set is composed of 3,002 examples, described by 50 features.
• T1 contains 2,929 examples, described by 205 features.
• Finally, T3 represents data for 2,549 women, described by 316 features.
• By separating the data into multiple data sets, we can focus on specific sub-tasks of preterm birth in the effort to devise a refined model.
• The aim being, can we predict spontaneous preterm birth, to most complicated problem? Can we predict preterm birth for a nulliparous woman without history of preterm birth? Plus, size and feature count for each preterm versus fullterm birth problem are presented here.
• Since features are obtained from various sources- history, measurements, questionnaires, et cetera- they are not always uniform.
• In particular, yes/no features are converted to binary, 1/0 values.
• Categorical features are converted to a set of binary features, while unusual values are replaced with reasonable approximations.
• Features with arbitrary ranges were also normalized to the 0, 1 interval, so as all features are of the same scale, and hence have the same importance.
• Consider some of the features from the DMG group.
• The BPPHONE, has home phone, and BPCAR, use of a car, are yes/no features.
• BPMARITL, for marital status, is a categorical feature with four categories.
• The AGEMOM, age of the mother, and SCHOOLYR, total years of schooling, features both have unusual values, and different integer ranges.
• We replace unusual values in the last two features by rounding off from below and above.
• The second challenge we face is the problem of missing, incomplete data, a very common problem in data science.
• Our main objective in this work is to retain as many features and examples as possible.
• We prefer to fill in or complete the missing values, rather than deleting features.
• Since a substantial number of features, as you saw, is missing, we follow a simplified approach, and treat features equally, whether they are randomly or structurally absent.
• The most common values for categorical features or the mean value for numerical features.
• Some features require nontrivial processing steps, and for those we sometimes include the range, mean, or median, and other features in the computation as well.
• All 11 of these features can be randomly missing due to the patient not undertaking a test, or measurements not being reported after a visit.
• The red column gives the number of missing values for each feature.
• For several of these features, such as 1-4, we prefer to complete the missing values with the mean of the actual responses.
• Other features, such as 5-7, have too many missing values, and so we reluctantly remove those features from the data set.
• Finally, some features, 9-11, have a particular meaning, weight, height, at different points of the pregnancy.
• If the weight is not measured during visit one, then we set feature 10, WEIGHTV1, to the weight before pregnancy, feature 11, plus the average difference between the weights for all the mothers at visit one, and their weight before pregnancy.
• Whether the missing values are structurally missing or randomly missing, we went through all the features case by case to impute or delete the missing values.
• Some missing values are approximated by the value of another feature.

#### Week 5: Machine Learning 2 – Applications > 5g Methods for Prediction of Preterm Birth > 5g Video

• In the previous section, I explained the kind of data we deal with and the steps we took to process it for machine learning.
• Now let’s talk about the methods we use to derive prediction models for preterm birth.
• Each patient is a data point, also called example using the machine learning jargon, and described by feature vector x and the discrete label y. y could be minus 1 or plus 1.
• We would like to predict at different time points, and hence, we have training data sets for each take t0, t1, and t3.
• Before we jumped into model building as an exploratory data analysis step, we assessed the sparseness of the data.
• We applied the simple nearest neighbor algorithm to the data, a method you know from the previous modules.
• We chose the Euclidian distance and also the cosine distance that’s based on the angles between data points.
• We expected that this phenomenon would lead to challenges in finding a general model that fits the data.
• By this exploration, we wanted to have a sense on how hard the problem will be for us to find a good model.
• The general picture is to use the data at hand as input to train machine learning algorithms to find the classification function f. Once we have f, we can use it to classify a new patient.
• To explain this, here’s an intuitive example drawn from the machine learning literature- how to derive a model that recognizes trees.
• Given a set of trees, can we find a good function f that can accurately classify a new plant as a tree? Let’s suppose the t f is just an expression or statement.
• Well, in between these two statements, we can think of many other expressions ranging from the very, very general to the very specific statement.
• We need to find an expression in between with a good generalization ability.
• Remember, we use a training set to find the model, and we test the final model on a test set.
• The axes represent the model complexity versus the error that the function makes.
• If we choose very simple models, we will end up with very high error rates on the training examples.
• If we make the model more complex, we will learn training examples by heart- the very specific expressions in the tree example, and hence, make no mistakes.
• When you include in the big picture the test set error, a model that’s doing poorly on the training set is likely to do poorly on the test set as well.
• A model that is doing wonderful on the training set will also do poorly on the test set, because it was unable to generalize.
• Selecting the model should be done in a way to avoid these two extreme problems, so should remain somewhere in the middle where the models are not too specific or to general.
• Even better, we could aim for two hyperplanes with the largest possible margin between the points of the two classes.
• Sometimes data is simply not linearly separable, and we need to account for that problem as well.
• With simple calculations using the definition of the distance between a point and the hyperplane, we have that d plus is equal to d minus, which is 1 divided by the norm of the vector w. For an example to be classified correctly, we need to satisfy one of these two constraints for each of h1 and h2, which we could combine together by multiplying by the label yi to make the constraint that we will include in the objective function.
• The second term here involves c, a positive constant, determining the trade-off between maximizing the margin and minimizing the misclassification.
• Margin of error occurs if the slack variable is between 0 and 1.
• In order to avoid tuning two cost parameters C plus and C minus, we set C plus times n plus equal c minus times n minus where n plus and n minus are the number of positive and negative examples, respectively.
• Technically, a kernel computes the dot product between the data points in higher dimensional space.
• Our concrete aim is to obtain a model with low test errors.
• True positives represent the number of preterm birth cases that were predicted as preterm birth by the model.
• True negative represent the number of non-preterm cases that were predicted as non-preterm birth by the model, while false positive represents the number of non-preterm birth cases that were predicted as preterm birth and false negative- these are errors as well- that represent the number of preterm birth cases that were predicted as non-preterm birth.
• The metric we used are the sensitivity, which represents the percent of positive instances that are correctly predicted as positive, and the specificity, which represents the percent of negative instances that are predicted as negative.
• Since the negative class is the majority class, it is not difficult when you build a model to obtain high specificity rates at any tick To ensure a fair balance between sensitivity and specificity, we use geometric mean, or g mean, which is the square root of the product of the specificity and sensitivity.
• Each data set is randomly divided into train and test set with 80 to 20 ratio.
• We don’t use a test set at all during the learning stage.
• This is just meant to evaluate the model or to calculate the out-of-sample error.
• Each class is split proportionally between the sets.
• We then apply a five-fold cross validation to the training set to determine the best model and optimal parameters.
• The best model is then tested on the unseen test set, and confusion matrices for various subsets of the data are recorded along with sensitivity, specificity, and g mean for each of them.

#### Week 5: Machine Learning 2 – Applications > 5h Results and Discussion > 5h Video

• So we spoke about the data and the preprocessing steps we took.
• As the started diving into the data, we first have some exploration sessions to get familiar with the features and to determine how individual risk factors interrelate.
• Learning association rules from data aim to find strong relations that can be expressed as rules between sets of feature values.
• An association rule is an expression A implies B, where A and B express conditions on features describing the patients in a data set.
• We consider P of A the probability that the event A happens, estimated by the number of occurrences of A in the data, same for the event B. The strength for a rule A implies B is evaluated by the support of the rule that estimates the probability of having both events A and B happening.
• The confidence of the rule, probability of having the event B happening, given that the event A has happened.
• In the basic framework given two thresholds minimum support and minimum confidence, a rule is said to be strong when its support is greater than minimum support, and its confidence greater that minimum confidence.
• A rule A implies B can be illustrated with a Venn diagram.
• The confidence measure assesses how much the sets A and non A are included in the set B. A rule A implies B is interesting if it has a high support, a high confidence, and is even more interesting if the confidence of the rule non A implies B is low.
• QuantMiner is based on the optimization of the numeric variables or features with genetic algorithms to handle both categorical and numerical data in the rules.
• An exercise we found really useful was to sit with the physicians and explore the data using association rules.
• Another interesting rule we found- we have derived using QuantMiner is the following, expressing that 70% percent of nulliparous patients who gave birth after 36 weeks of gestation, had the cervix measured at visit one ranging between 26 and 45 millimeter, and they didn’t have an infection found since visit one.
• This exploration helped us understand better the data we are dealing with, and led to good discussions with the clinicians.
• The number of rules was quite overwhelming and not really meant for prediction.
• We experimented with elastic net models since the data is highly structured, and correlation among groups of feature of predictor variables is likely.
• Other categories of features we’re simply discount because they were either, no longer prevalent, such as DES exposure, or the data set lacked information on that feature, “Head been engaged,” for example.
• We run the analysis twice using seven and 13 points as our boundary or cutting point between low and high risk based on the distribution of the data.
• As a reminder, this is a secondary analysis on this data sets.
• Previous performance on this data with logistic regression without model selection were modest in the range of 18% to 33% for both sensitivity and specificity for nulliparous and multiparous women.
• We increase the sensitivity and specificity by about 20% to 30%. Specifically, linear support vector machines provide a robust baseline for the quality of performance one can expect from algorithms applied to this data.
• Support vector machines with nonlinear kernel, RBF, outperforms linear support vector machines for the full data set.
• This noticeable improvement gives weight to our supposition that nonlinear methods would work better on this data then their linear counterparts.
• When we consider the entire or full data set, the linear and RBF/SVM performs better with increasing ticks, T0 to T3. That is, as the pregnancy progresses, this reflects our intuition.
• For spontaneous-only data, there is no improvement by using nonlinear support vector machines or increasing the tick.
• We consider the nulliparous data only to be the most difficult of the three data sets.
• Here, for example, for the full data set, the features are ranked according to the overall importance in all the runs.

#### Week 5: Machine Learning 2 – Applications > 5i Summary and Conclusion > 5i Video

• Our best performing algorithms attained accuracy rates of 60%. This demonstrates that more accurate prediction of preterm birth is not an elusive task.
• Preterm birth is such a challenging and complex real-world problem that pushes the boundary of machine learning state of the art methodologies.
• Besides being a Multi-factorial problem, we face the problem of missing data, skewed data distribution, and sparse data, but not only.
• Features describing a pregnancy can be either static- example, history of preterm birth, race, age, parity of the mother- or dynamic.
• The etiologies of preterm birth are believed to be different as pregnancy progresses.
• The reasons for preterm birth very, whether it is an extreme, severe, moderate, or late preterm birth.
• To add another layer of complexity, preterm birth can happen for different reasons, which means that we face here a case of overlapping classes.
• A patient may experience a preterm labor, and the premature pre-term rupture of the membranes, pPROM.
• What is the proper method for handling such overlapping classes? A last important aspect is that today, we are not even close to understanding why preterm birth happens to develop the right interventions.
• With all these challenges, we bring in the preterm birth problems within the machine learning arena as a real life, challenging, and exciting problem to solve.
• The preterm prediction study data we have used is not very big.
• Recent reports from the Centers for Disease Control and Prevention show a decline in the rate of preterm birth from 2007 to 2014.
• Still, about one in 10 babies is born preterm in the United States, despite immense efforts to address the problem.
• With larger and richer data sets and advanced nonlinear and temporal methods, there is a hope to achieve an actionable and clinically useful preterm birth prediction system, and hopefully save hundreds of thousands of babies.