# Week 4: Machine Learning 1

### Week 4: Machine Learning 1

#### Summaries

• Week 4: Machine Learning 1 > 4a What Is Machine Learning 1 > 4a Video
• Week 4: Machine Learning 1 > 4b What Is Machine Learning 2 > 4b Video
• Week 4: Machine Learning 1 > 4c Classification > 4c Video
• Week 4: Machine Learning 1 > 4d Linear Classifiers > 4d Video
• Week 4: Machine Learning 1 > 4e Ensemble Classifiers > 4e Video
• Week 4: Machine Learning 1 > 4f Model Selection > 4f Video
• Week 4: Machine Learning 1 > 4g Cross Validation > 4g Video
• Week 4: Machine Learning 1 > 4h Machine Learning Summary > 4h Video

#### Week 4: Machine Learning 1 > 4a What Is Machine Learning 1 > 4a Video

• A machine learning method is a computer algorithm that searches for patterns in data.
• If you look at this data just like that, then you probably find it plausible to say that there is a pattern.
• So in this case, if we just look at the data as it is, then the pattern would be that the data is separated into two clouds of points.
• So if I ask you where is the next point going to be, you would probably predict that it’s either going to be here or it’s going to be over here, and it has slightly higher probability to be over here, simply because that cloud is larger.
• If I give you a bit more information and I would tell you, for example, that first all of these points come in one by one, and then the points over here stop and then we start seeing points down here.
• Then at some point, I ask you where the next point is going to be, then you would probably predict that the next point is going to be down here.
• Now, these points in the plane look like something that you might see in a statistics textbook.
• So we start with what I called raw data here at the very top.
• Basically what it means is that we feed our data into the computer and let it extract some kind of measurements on which the actual machine learning method is then going to operate.
• Now, in the second step when we have our working data here, this is the data that we actually use a machine learning method on.
• The next step once we have our own preprocessed working data is to mark patterns.
• So what we do there is we look at our data and we mark a pattern as a pattern.
• Instead, you try to give the computer examples of what you mean by a pattern, and then let it extract that knowledge or some kind of description of what is typical for this pattern, so that can be applied to other data points.
• Then we’re going to split our data, and let me first focus on one part of it.
• So basically we take our entire data set, we split it say into two halves at random, and then we focus on just one part of it.
• This is called the training data, and it consists of data points that are now marked as a certain pattern.
• If we think of the pencil example again, our raw data would be photographs of pencils.
• That is our data, the collection of all these vectors.
• So now the entirety of mark data points gets split into two parts, and we focus on one part of that, and that is our training data.
• Now what it does is it looks at this training data at these marked patterns and tries to extract a description of a typical pattern from this data.
• What that gives us is our trained model or our predictor, which we can then apply to data.
• So once we have the predictor, once we have a trained model, we can get a new data point, a new image that is not marked and predict what its marks should be, what type of pattern it is.
• So that is what this other batch of data is for, for which we did our split.
• That is typically called the test data, again in machine learning nomenclature.
• That is again labeled data, that means data in which the patterns are marked.
• Now we can use our trained model here, apply it to this test data set, and predict for each pattern in there what it is supposed to be.
• So down here we have one part of the data that is blue, and then here we have the other part of the data that is red.
• What we would like to predict now, the pattern that we have here is which of the two colors does our data point belong to? So the task is extract that pattern from this data and then for a new data point that somebody gives you that is supposed to be generated by the same data source, decide should be colored red or blue? Which of the two groups does it belong to? And a very simple way that you can do that is by just putting a line between the two.
• This data here is two dimensional, so this is the kind of data that we would see if we only extract two scalar measurements from our data.
• So even though this data is separated by this plane, it might just not generalize well as a predictor.
• So if you think about how we could avoid that problem, then the easiest solution seems that we kind of put the plane in the middle between the two data sets, and that is what I’ve tried to do here.
• So one reason why it is not easy is that placing this line in a good way that generalizes well so that it will still work well for future data points that we have not seen is not an easy task.
• Another problem is that if you have a lot of measurements per object or per data point, so if your dimension gets very high as we typically see in modern applications where instead of two dimensions you might have 1,000 dimensions or even a million dimensions, then these kind of geometrical methods are not that straightforward.
• Everything else you need is a single point contained in the plane.
• So you need one point to fix where in your space the plane is, and then one single vector to determine its direction.
• If you have a vector like this, then it turns out that in addition to representing the plane, it also lets you decide which side of the plane points are on.
• So if you want to decide for this point here on which side of the plane it is, computationally what you do is you project the point onto this vector, and then compute the length of the distance to the plane.
• If I take one particular point and it ends up being here, it’s orthogonal projection on the plane is here, then what I would try to compute is the length of this distance here.
• So if you now get a new data point, somebody gives me a new data point and it ends up here then I would again compute my projection onto this vector, it would tell me that I am on this side of the plane, and that would tell me that my prediction is the data point should be colored red.
• If I get a data point down here, then it would be blue, and so on and so forth.

#### Week 4: Machine Learning 1 > 4b What Is Machine Learning 2 > 4b Video

• Now, in the procedure that we have just seen, the term learning refers to calibration.
• That is something that I would really like to emphasize, that the learning and machine learning does not simply mean that we’re learning something from data or that we’re extracting knowledge from data.
• What we mean by it is that we learn from data what a pattern is.
• Instead of coding into the computer- into a program- what we mean by a pattern, we let the computer look at examples and let it learn from these examples what a pattern is and what is the typical description of a pattern.
• There are several types of these learning problems.
• So that is a type of learning, where we have Marked patterns- where Marked patterns are available- that is called supervised learning.
• These are collected under the term Unsupervised Learning.
• Now, I should also mention that this classification into two types of learning methods is not really exhaustive.
• There are methods that are kind of in between supervised and unsupervised learning that have a bit of both.
• So there are other types of learning methods, but these are two very important types.
• Historically, this learning term comes out of the history of machine learning, which is in artificial intelligence and cognitive science.
• In cognitive science in the 1960s, researchers were thinking about what it means to formalize- how you can formalize the idea that some entity sits at the table- it might be a human, an animal, a computer, or whatever.
• Today’s machine learning really comes out of computer science and artificial intelligence.
• That is how computer science and statistics started to form this field at the intersection that is now called machine learning.
• So a computer is good at performing a lot of simple tasks, like adding things up or sorting a list.
• The humans who programmed the computer might have made the mistake, but the computer itself, once it’s correctly programmed, will not make any mistakes.
• We’ve gotten better at in machine learning, but object recognition is still pretty much an unsolved problem.
• That gives you some idea of how machine learning is related to statistics.
• I want to show you one more example that is a little bit different, but I think it’s really a very nice illustration of how learning and statistics are related to each other in a very precise manner.
• We have a pendulum that is mounted on a sled and the sled can move sideways on a rail and it is controlled by a computer.
• So all the computer can do is it can move the sled left or right, and it can control the velocity of the sled.
• Now, you can think of the sensor information- the state being fed back into the computer- as a computer’s eyes and ears.
• The system measures the data at 10 Hertz, so you get 10 measurements per second being fed back into the computer.
• You can derive the equations of motion, [? solve, ?] and feed the results back into the computer.
• What this system does is it’s supposed to learn without any knowledge of physics.
• It’s just supposed to learn how the world works from data.
• I mean, if you think of a child that’s learning to walk, the child is not learning to walk by sitting down with a physics book and studying classical mechanics.
• Then the computer takes these measurements and we interrupt and do a training step.
• So from this training step, it now tries to learn something about the world.
• We’re trying to do is we’re going to try to learn this function from data, and that is a regression problem.
• So we’re solving this problem of learning how to balance this pendulum upright by solving a very classical statistics problem, and regenerating the data along the way.
• The pendulum actually never touches the optimal point directly over the cart because all the computer does is take some action pretty much at random.
• This is, I think, a very compelling example of the relationship between statistics and learning.
• I think everybody would find it plausible to say that this machine has actually learned something, and it has done so by solving a regression problem.
• To summarize what we’ve seen so far, our definition of a machine learning method is a computer algorithm that searches for patterns in data.
• We have seen that an essential part of that idea is that we learn what pattern means by looking at data examples.
• That is why statistics and machine learning are so interleaved.
• Now, machine learning is not the same thing as statistics, and it originally tries to solve slightly different problems.
• One thing that is maybe culturally typical of the machine learning community is that we’re pretty much happy to use any tool that lets us solve our problems.
• We’re not married to a particular type of tool, but statistics has become so important that it’s sometimes hard to distinguish these days whether a certain method is a machine learning method or a statistics method.

#### Week 4: Machine Learning 1 > 4c Classification > 4c Video

• We’re interested in finding a machine learning object or a mathematical function called a classifier that predicts which class in your data point belongs to.
• One consists of blue points, and one consists of these red points here.
• If you remember the processing pipeline that we have discussed before, then you can think of these points as lists of measurements where each point corresponds to one object.
• Then we interpret that list as a vector and plot these lists as points in a vector space.
• Each data point is one point in that feature space.
• A classifier formally is a function that takes as input a point in our feature space.
• We put a point in our space into this function, and it tells us a class label.
• The important point is that the classifier decides which class a point belongs to simply by where it is located in the space.
• We model that data source by distributions, one distribution for each type of data points.
• Here what we see is the distribution of each type of data points plotted as a density over our feature space.
• Down here is feature space, and up here we have the density of the data points.
• In this case, what I’ve done is I’ve separated these data points by a wide line down here.
• Conversely, over here we have all points for which the red class has higher probability than the blue class.
• An assumption that often works well in classification is to simply assume that each mistake is equally bad. So a mistake- what is a mistake? A mistake is if my classifier classifies a point as being a member of a given class, and it actually belongs to another class.
• Your data points here might be measurements that are the outcomes of some form of medical test.
• So you take the measurements recorded for each patient in the test, write them again into a list, and regard them as points in a space.
• So in this case, if you assume that these blue points here correspond to healthy patients and the red points to sick patients, then what you would want to do is shift this classification line further into the blue class.
• Each point in the training data set has a class label.
• All we do to classify it is we find the training data point that is closest to x. And then we assign the class label of that training data point.
• We simply search for the closest point in the training data set and say if it’s close to that point.
• In the k-nearest neighbor classifier, we find the k closest training points.
• Then we could choose the next simplest thing would be the 3-nearest neighbor classifier, where we just find the three closest points to our input point.
• So to understand what the problem here is, I have plotted a very simple nearest neighbor classifier here with five data points.
• Now, of course you know that you never want to use a sample of five data points to decide anything because there is simply no statistical power in that.
• Assume we have these five data points here.
• So if I get an input data point that is over here, the closest of these five points would be this point here.
• I would assign the class label that is associated with this point.
• These are all points which are closest to this point among my five points.
• Over here are all points- this cell consists of all points for which this point is closer than the other four, and so on and so forth.
• So that separates my feature space here into these cells which belong to one of these points, which means everything in this cell would be assigned the class label of this point here by my nearest neighbor classifier.
• If the training data is very large- so nowadays, it might mean a billion data points- then what would you have to do in order to compute a single classification result? Somebody gives you a new data point.
• What you have to do is you have to run through the entire list of a billion data points.
• For each of those, you have to compute the distance between your point and the training data point.
• You have to sort them, which again is expansive, and then select the shortest three distances or the shortest five or however you have chosen your k. So that works fine if you have five data points.
• It does not work presumably if you have a million or a billion data points.
• So pre-compute this division of the space here into cells and then classify according to that, instead of searching the data points.
• For 100 data points, it already looks like this.
• Basically what you have to do is you have to represent all the surfaces and decide which side of the surface your data point is on.
• So the drawbacks of these classifiers are that, in a large data set, you have to find the nearest data point, and that is expansive.
• The expanse also grows with dimension because you have to compute the distance, the Euclidean distance, between your input data point and each training data point.

#### Week 4: Machine Learning 1 > 4d Linear Classifiers > 4d Video

• We are now going to discuss a particularly important class of classifiers, so-called linear classifiers.
• A linear classifier is simply a function where that boundary between the regions is linear, which means it’s a straight line in two dimensions or in higher dimensions it’s a plane or a hyperplane.
• So here’s again our picture of a straight line separating these two classes.
• In the nearest neighbor classifier recall that the problem was that we’re using the entire data set as a classifier.
• So if in this case if the origin for instance would be down here then we could choose some vector that points from the origin over here.
• In terms of actually computing the output of the classifier it’s also very efficient because all we have to do is subtract one vector from the data point and then take a scalar product with the other.
• Here is again much too small toy example, which is only for illustration.
• If you look at this point here then if this point had happened to fall just a little bit further up it would already be on the wrong side of this line.
• So if we assume the next data point comes in then it would be conceivable- by just looking at this training data- it would be perfectly conceivable that maybe it’s here or here.
• So for instance we could assume that both of these classes here follow a Gaussian distribution.
• What we’re doing here, effectively, is we’re approximating that classifier under this distribution assumption by a straight line.
• Now, one of the important principles in supervised learning is that we try not to make distribution assumptions on the data source, as we have done here.
• The first thing we note here is if we just look at these two sets then in order to determine what happens in the middle- or whether actually- even whether the two sets are separated by a plane, it doesn’t actually matter what happens on the inside of the set.
• So in this region here inside the set whether this data point here is here or here doesn’t make any difference for any given line or plain to separate these two sets.
• The only thing that matters in order to decide that are the points here on the outside.
• So what we’ve done here is we’ve drawn the convex hulls of both of these sets.
• It’s a minimization problem because the objective is to make this line here as short as possible.
• The margin in particular is this distance here between the plane and the set.
• So if we have two- our two sets of data points here and find this classifier with a maximum margin hyperplane, that classifier is called support vector machine.
• Empirically we have observed in over 10 years since this- almost 20 years since this classifier was invented- that it is one of the most classic powerful classifiers that we have.
• What I’m showing you here is the type of data that comes as input in this problem.
• So if you look at this picture here and just write it out as this matrix of numbers in which the computer stores it, that is over here, you can actually see the shape of the two in there.
• Kind of accidentally because- simply because the higher numbers that correspond to black over here use more digits.
• So the part that is black over here looks more white over here.
• Now we would like to use this as a- this entire matrix here as a single data point in our feature space.
• Here I selected a few numbers- I trained a support vector machine and selected a few numbers that ended up being misclassified.
• If you look at these- so here I just trained with two classes.
• These here are two fives that my classifier classified as six.
• If we use a linear classifier on this problem than what we’re doing effectively is we’re approximating this curved line here by something that’s straight.
• So you can see here is my decision surface and there’s a couple of blue points here that end up on the wrong side of the plane, the part that gets classified as red.
• So if I would now get a new data point- remember that all of these points are thrown away when I actually classify.
• So if I would again see a point like this red point over here I would assume that it’s blue.
• You can see that the line actually comes back here.
• So this cuts up the space into a region here that is red but everything out here is blue.
• Once you are on the far side of this line, down here, again you would again classify as blue.
• So this- you can see that this plot here, the machine- the support vector machine I plotted here, actually uses both tricks.
• One of the most severe limitations of linear classifiers is that they just subdivide the space into what’s on one side of the line and what’s on the other side of the line.
• So what do you do about that? And it is possible to use linear classifiers on problems was multiple classes, so more classes than two, by stitching them together.
• By using one classifier for each class that just classifies in this class or not in that class.
• If you recall how that classifier worked by simply assigning the class neighbor of the closest data point.

#### Week 4: Machine Learning 1 > 4e Ensemble Classifiers > 4e Video

• So far we have discussed two types of classifiers, those who are nearest neighbor classifiers and linear classifiers.
• Now I want to discuss a different type of classifier that has proven very useful in practice, and the summary term for this type of classifier is ensemble classifiers.
• Ensemble classifiers are based on a very simple idea, which is that we train many classifiers that are weak, meaning they are not particularly good, and then we combine the results of these many different reclassifiers into a majority vote.
• So the error rate is simply the proportion of misclassified points, and that means it is the expected number of errors that we would see classifying data points generated by the end-of-line data source.
• Now, when do we call a classifier weak? So certainly if almost every data point would be an error, so if the error rate would be close to 100%, that would certainly be a reclassifier.
• Actually we don’t have to go that far, because if you think about a two-class problem, and you assume that the two classes are roughly of the same size, so about 50% belong to one class and 50% belong to the other class, then you can always get an error rate of about 50% by just classifying at random.
• So we would try to get the error rate below 50%. The best we can possibly hope for is an error rate of 0%. That doesn’t usually work in practice, so in practice a good classifier might be something with an error rate on of 5%, so about 95% of the points were would be classified correctly.
• Now, what we mean by a re-classifier is a classifier which is not 50%, which would be the worst error rate in this context but, instead, an error rate that is slightly below 50%. And if we have a total of M classifiers that achieve an error rate that is slightly better than 50%, then the idea is to combine them into a majority vote.
• So we have M classifiers and, again, to avoid ties we choose M as an odd number, and now there are two choices in this vote, which is one class or the other class.
• The question is, can a vote by majority identify the correct choice? In our classification problem, the voters are the classifiers and they vote either for one class or the other class.
• If we assume we have M classifiers here, f1 through fm, so these are our classification functions, just as before.
• We put a new data point, the feature vector of a new data point, into that classifier and it outputs either one or two or a class label in the range of class labels.
• Then we combine these classifiers by just taking the sum of all of those, and since the classes are plus 1 or minus 1, we get the majority by simply taking the sign.
• So if all of these classifiers here add up to something positive, then more of them have voted for the positive class than for the negative class.
• Now, does the majority make the correct choice? If we assume for the moment that each classifier makes the correct choice with a probability between 0 and 1, so some value p between 0 and 1, and we assume for simplicity here that this is the same for all classifiers.
• What I’ve done here is I’ve plotted the outcome of this formula, and down here on the horizontal axis is the number of classifiers that we use in the majority vote.
• You can see here for a single classifier it starts at 0.55, but then you can see that it increases quite quickly, and here for 150 classifiers we are very close to a probability of 1 actually.
• So this is an indication that if we can somehow in an actual classification problem mimic these assumptions, that we get close to these assumptions that each classifier makes the right choice with a given probability and that they are stochastically independent, we can actually by voting together a number of classifiers that is not particularly large, like 150- which is not a lot of numbers to add up for a computer- if we can add up together these 150 classifiers, even though each classifier is not particularly good, on aggregate we get a very good classifier.
• What I’ve done here is I’ve increased the probability to 0.85, so that in itself would be a classifier that’s not a reclassifier anymore, but it’s also not particularly good.
• Now we can see that just adding together a few of these classifiers already achieves almost perfect classification.
• We, in some way, train a lot of bad classifiers or not very strong classifiers on this training dataset.
• So the training data creates dependence between these classifiers, and ensemble classifiers, in each ensemble classifier, they use very different strategies to train these individual reclassifiers, but you can usually interpret these strategies as simulating in one way or another independence between these classifiers.
• So there are strategies that try to modify the training data in some way by randomization or by deterministic strategy by weighting data points up and down to get weak learners, weak classifiers that are behaving almost as if they were independent.
• So two important examples of these classifiers are AdaBoost and random forests.
• Random forests, in particular, is one of the methods that you are very likely to see as an off-the-shelf classifier implemented in some software package.
• So if you want to have a state-of-the-art classifier, in most cases I guess you would probably end up with a support vector machine or a random forest nowadays.
• Now, in order to build a random forest, we need to come up first with a weak classifier, and the weak classifier that random forest uses is something called a tree classifier.
• So what we have to do now in order to build a classifier is we have to carve up this square into regions that belong to one class or another class.
• We will see that tree classifiers actually very naturally can deal with any number of classes.
• So in order to build my tree classifier, all I do is I decide for one of the two axes and split it up at some point.
• So then I have to assign a class label, and if I want to train this classifier, all I do is I take my training data and once I have picked the split here, I assign the class labeled by a majority vote between the training data points.
• Now, the reason why that is called a tree classifier is because I can arrange these decisions that I’m making here in a tree.
• Now we’re going to use these decision trees as weak classifiers in an ensemble, and in order to do so, we have to decide on a way that allows us to make these as independent as possible, because if we just train the same decision tree classifier over and over again on the same data, we always get the same tree.
• Now, if you try out this classifier, then you will find that it works well, but it probably doesn’t work as well as maybe a support vector machine or another state-of-the-art classifier.
• In order to compute the next split, put an additional hyperplane into our classifier, we again randomize a small subset of axis.
• So we train m of these trees in total, and then we classify by majority vote.
• One thing you notice immediately is that you can actually see the fact that this consists of classifiers that consist of axis parallel planes and that are first combined in a tree and then averaged together.
• Ensemble classifiers combine reclassifiers by voting, and it is generally beneficial to have a lot of variance between the individual classifiers.
• A random forest is one of the best classifiers that are available.
• In general, of course, what classifier works well may depend on your problem, but if you do not know much about your problem or if you do not want to do a lot of engineering, then probably as an off-the-shelf method, the random forest or an SVM is about the best you can do.

#### Week 4: Machine Learning 1 > 4f Model Selection > 4f Video

• Training a classifier involves a problem that I have glossed over so far.
• Now, to understand what modern selection is about, it’s useful to first look at the types of errors that we have to consider when we are training a classifier.
• One type of error that we have already encountered is the training error.
• So the training error of a classifier is simply the number of misclassified training points divided by the total number of training points.
• The reason why we divide here, why we normalize is that otherwise this number would depend on the size of the training set and we want this to be a property of the classifier and not of the training set.
• Now, the training error only tells us how the classifier behaves on the training data, but the training data is just an example set of values generated from the actual data source.
• So what we are really interested in is the performance of the classifier on data generated further on from the data source.
• Now, if we knew the true distribution of the data source, in that ideal case we would define the prediction error as the expected value, the expected number of misclassified points, again normalized- so we get an expected proportion of misclassified points when data is generated from the actual data source.
• In the classifiers we have seen so far, so particularly, for example, in the tree classifier, there are two values that we have to think about when we train the classifier.
• Now, if we get a data set like this and we want to minimize the training error- if this is our training data, we train a tree classify and we want to minimize the training error, then in order to achieve optimal training error, the best thing we can do is just keep splitting until every point is perfectly assigned to a class.
• Then I have achieved perfect training error, because no single point is misclassified.
• So what I’ve done here is I fit it too closely to the specific properties of the example set of my training set.
• So if I keep improving my classifier on the training data until it classifies every point perfectly and make it more and more complex, then I have to assume that on new data generated from the source, it might not actually do that well.
• When you look at the tree classifier, then you see that the number of splits is a parameter that controls, in a sense, how closely I can fit to the data.
• If I allow my classifier an arbitrary number of splits, it can just keep refining and draw little lines around and little corners around every single data point in the training set, so that it allows it to over adapt to the training set.
• Then it will not generalize well on more data that does not look exactly like the training set.
• This property is controlled by the number of splits and I can think of the number of splits as an input parameter of my classifier training algorithm, of my tree training algorithm.
• There are similar input parameters for other classifiers that also in some sense quantify how complex a classifier can be.
• In machine learning, it has become fairly common to refer to these complexity parameters of classifiers as hyperparameters.
• So in a tree, if I want to store a tree classifier, if I want to characterize a tree classifier completely, I have to record all the split locations.
• Before I can decide on those split locations, I have to decide on the number of splits.
• Similarly in the tree ensemble, the split locations are the parameters and the hyperparameters are the number of splits and the number of trees.
• We use this additional trick that allows data points in the training set to be on the wrong side of the hyperplane.
• So the model would be a family of classifiers, such as all trees with four splits.
• So in the tree example, model, all trees with k splits, parameters, the split locations, and hyperparameter, the number k of splits that we allow in total.
• Typically, we cannot perform model selection as part of the training, because if we just try on the training data to optimize our training error, then that would lead to the model becoming more and more complex, fit the training data perfectly, and then we overfit.

#### Week 4: Machine Learning 1 > 4g Cross Validation > 4g Videov

• So the problem we’re trying to solve is, how do we select an adequate model based on sample data? Again, we do not know the distribution of the underlying data source.
• We substitute labeled sample data, training data, for this data source.
• At each stage, we determine the split point by minimizing the training error on the training data set.
• In the tree case we really just literally minimise the training error on the training data.
• In order to come up with a method for model selection, we first recalled that the basic premise of supervised learning is that we use sample data as a proxy for the actual data source.
• In order to design a method, we think about what would we do if we actually knew the underlying data source, the true distribution associated with the underlying data source? And then, we apply the same principles.
• We still train our classifiers on a training data set.
• If we would be able to do that, then the best strategy we could possibly use is to train a number of classifiers with different values of the hyperparameter on the training data, and all of them to the Oracle, and see what numbers come back.
• We use one part of it as training data, on which we train our classifiers, and use the other part to approximate the Oracle that tells us what the prediction error is.
• So we split our training data into two sets, set 1 and 2.
• So if we have a method that over fits through the training data, and adepts a classifier too closely to set 1, then it will probably not perform very well on set 2.
• So the prediction error would be the error that we would have to expect when we apply the data to the true distribution- when we apply the classifier to the true distribution- of the data source.
• Now, we approximate that using a second trending data set.
• So we must not perform classifier assessment using the same data set that we have already used for model selection.
• So cross validation splits the data into three sets.
• There’s a third set called the validation set, or hold out set.
• So once we have done so- once we have trained our classifiers with different hyperparameter on the training set- then use the test set to select the one with the smallest prediction error.
• Once we have done so, we estimate the performance of that final classifier on the validation set.
• So if we would estimate the performance of the final classifier on the test set, which we have already used to select that classifier, then the classifier might have over-adapted to the combination of training and test set.
• So these are these contests or challenges where somebody has a data set, and different teams try to predict something on that data using machine-learning methods.
• So sometimes- I mean, definitely- if the data that was used to evaluate the classifier in the end was actually used somewhere in training or tests, the results are definitely not trustworthy.
• To implement a cross-validation method, basically there is one decision that we have to make, one design decision, and that is how do we split the data? So we have three data sets, the training set, the test set, and the validation set.
• We have to decide how large these set should be, what proportion of the overall training data that we have at our disposal should go into the training set, into the test set, into the validation set, and we have to decide how we split them.
• So the validation set is a set we set aside in the beginning.
• Then we have to decide what do we do about the training and the test set.
• So we could opt for a proportionally-large training set, and a smaller test set, or the other way around.
• So let’s assume we have already taken our validations data out of our initial data set and have set it aside.
• We have to decide, do we want to use most of the remaining data as training data, or do we want to use most of it as test data? So a large training data set has the advantage that a classifier trained on more data is going to be more accurate in expectation.
• So we can expect that the more examples a classifier has available for training, the more accurate it will be on data from that actual source.
• So in order to get a fair assessment of the classifier, it would be a good thing if we use a large training data set.
• Otherwise it could happen that, if we use a very small training data set, that the classifier that performs well on that small data set does so because, again, it over adapts to the idiosyncrasies of that set.
• What I could do is, I can train a classifier on the set here.
• If I do this for every data point in my set and average the results together, I will still get a good estimate of my prediction of my predictive performance.
• The drawback of that strategy is that if I remove the signal point, my training data set each time just looks almost exactly the same.
• If I draw multiple training data sets from my underlying data source, there’s going to be some variation in that data.
• So for each value of K- so for each of these capital K blocks- we take one block out of the data and set it aside for testing.
• So we cycle through these K blocks, each gets removed once, and we train our classifiers on the remaining data.
• These estimates are obtained from separate training data sets.
• We use some part of our overall available labeled data and use it to approximate the properties of the data source.
• To interpret machine-learning results, you always should ask yourself, if they are based on any form of cross validation- which in supervised learning, they usually are- did the researchers or engineers who trained the model have access to the validation data? Or how much did they know about the validation data? If they knew anything about the validation data that helped them to improve their results, then these results are possibly confounded.

#### Week 4: Machine Learning 1 > 4h Machine Learning Summary > 4h Video

• Now it’s a bit difficult to say something about big data that is in any way precise, because nobody really knows what big data is.
• It does get used more and more to refer to the kind of data that gets automatically recorded by all kinds of digital processing.
• This data has a few properties that are really worth thinking about.
• So one is that often, this kind of data is inhomogeneous.
• Because it gets generated by different sources or in different contexts, if a company logs data about their users, then they might get some data about their payment behavior.
• They might get some data about their behavior shopping on an online server, and so on, and so forth.
• So these are different sources that generate data that still pertains to the same user.
• One recent trend in machine learning is that in order to deal with big data, we are looking at methods that are simple and fast, simply because the data sets are getting so large that fancy, sophisticated methods simply are not feasible on these data sets.
• There’s another thing about big data that I think one should keep in mind, and that is that this data is often low, signal in the sense that it’s automatically logged data from things like web servers or sensors.
• You have a lot of text, or a lot of data that sits on your computer.
• What is contained in this data set is a fairly small amount of information.
• If you have a lot of data that does not contain a lot of information, then typically simple methods work better.
• Neural networks were state of the art machine learning methods back in the ’80s. And then they went a little bit out of fashion.
• One thing that they seem to excel at in particular is feature extraction from visual data and speech data.
• One thing that you might want to keep in mind is, if you see an application that involves running machine learning methods on visual data, like images, or possibly video, or on speech, or audio data, then these days, it will probably be hard to get around some form of deep learning if you want to achieve state of the art results.
• So these are two things that you will hear a lot in the context of machine learning these days.
• So what has happened in the last 10 or 15 years that the industry- in particular, the high tech industry- has more and more demand for machine learning methods.
• So many machine learning researchers have moved from academia into industry.
• You have to deal with large data sets that are stored in databases, so you have to access these databases in a way that interweaves well with the machine learning methods.
• What I hope you take away is that machine learning is a set of methods, or computer algorithms, that search for patterns in data.
• We always try to not code into the algorithm what a pattern is, but as much as possible, that data determine what pattern means.
• Maybe the most important thing is that, because our time was brief, we have only seen a tiny, tiny snapshot of what machine learning really is.