Welcome to Evaluating Machine Learning Models. After watching this video, you will be able to: Define train/test split. Evaluate classification models using accuracy, a confusion matrix, precision, and recall. Interpret a confusion matrix. Evaluate a regression model using mean squared error and other kinds of error terms. And define R-squared in the context of variance and goodness of fit. Before we go into evaluating machine learning models, let’s talk about a very important step. You don’t want to feed all the data from the data set to train a machine learning model. It is important to split it into a training set, which will teach the model with lots of examples. And a test set, which will act as the new data you will use to evaluate how well the model performs. To calculate accuracy, let’s consider a ”Will I pass or fail my biology test?” example. Now, assume your model has been trained and has made some predictions on the test set. Let’s assume Pass is predicted on the left and Fail on the right. You represent “pass” with green squares and fail with red squares. You can calculate accuracy by taking the number of observations that the model predicted accurately and dividing it by the total number of observations. The misclassified points are highlighted in grey for clarity. That will give you 70%. Another way of looking at accuracy of classifiers is to look at a confusion matrix. A confusion matrix is a table with combinations of predicted values compared to actual values. It measures the performance of the classification problem. On the y-axis, you have the True label and on the x-axis, you have the Predicted label. True positive means you predicted pass and it was pass. True negative means you predicted fail and it was fail. False positive means you predicted pass, but it is actually fail. False negative means you predicted fail and it is actually pass. A good thing about the confusion matrix is that it shows the model’s ability to correctly predict or separate the classes. In the specific case of a binary classifier, such as this example, we can interpret the numbers in the boxes as the count of true positives, true negatives, false positives, and false negatives. When evaluating a classification model, accuracy alone is not always enough. In some situations, a data scientist will look at other metrics. Let’s start with precision. Let’s look at our “pass or fail” example. Looking only at our pass class, precision is the fraction of true positives among all the examples that were predicted to be positives. Precision is “Total correct predicted pass” divided by "Total observations as pass,” which is 4/5, or 80%. Mathematically speaking, it is the true positives divided by the sum of true and false positives. An example where precision may be more important than accuracy is a movie recommendation engine. Precision involves the cost of failure, so the more false positives you have, the more cost you incur. For example, precision may be more important than accuracy with a movie recommendation engine because it may cost more to promote a certain to movie to a user. If the movie was a false positive, meaning that the user isn’t interested in the movie that was recommended, then that would be an additional cost with no benefit. Now, let’s take a look at recall. Recall is the fraction of true positives among all the examples that were actually positive. Looking at all the observations that are actually “Pass” consider the number of “pass” observations the model got right out of the total true “pass“ observations. That is 4 out of 6, or approximately 66.7%. Mathematically speaking, it is the number of true positives divided by the sum of true positives and false negatives. When you have a situation where opportunity cost is more important—that is, when you give up an opportunity when dealing with a false negative—recall may be a more important metric. An example of this is in the medical field. It’s important to account for false negatives, especially when it comes to patient health. Imagine that you are in the medical field and have incorrectly classified patients as having an illness. That is very important because you could be treating the wrong diagnosis. In cases like this where precision and recall are equally important, you can’t try to optimize one or the other. The F1-score, which is defined as the harmonic or balanced mean of precision and recall, can be used in this situation. It is calculated as two multiplied by precision and recall divided by precision plus recall. Rather than manually trying to strike the balance between precision and recall, the F1-score does that for you. Consider a grades example where you want to know the grade you will get on your final exam, given your midterm scores. You have fit your regression line and the dots in blue are the grades the students received. The differences between the line and the blue dots are called errors. That is, the predicted values minus the actual values. There are many ways in which you can evaluate the performance of a regression model, the first is the mean squared error. Mean squared error, or MSE, is the average of squared differences between the prediction and the true output. The aim is to minimize the error, in this case MSE. The lower the MSE, the closer the predicted values are to the actual values and the stronger and more confident you can be in your model’s prediction. Another variation of error measurement is the root mean squared error, or RMSE. It is the square root of the MSE and has the same unit as your target variable, making it easier to interpret than the MSE. You also have the mean absolute error, or MAE, which is the average of the absolute values of the errors. R-squared is the amount of variance in the dependent variable that can be explained by the independent variable. It is also called the coefficient of determination and measures the goodness of fit of the model. The values range from zero to one, with zero being a badly fit model and one being a perfect model. Values between zero and one are what you can expect from real-world scenarios as there is no perfect model in real life. In this video, you learned that: It is important to divide your data set into a training set and a test set. A confusion matrix measures the performance and accuracy of a classification problem. The mean squared error is helpful for evaluating regression models. And the R-squared value represents the amount of variance in a dependent variable that can be explained by an independent variable.