Modeling structure, accuracy and bias. Our goals in this lesson are to first summarize the methodology for building accurate predictive models and really to figure out what accuracy actually means. Then along the way, we're also going to identify potential sacrificing in obtaining accurate models. If accuracy is the overall goal, what can get sacrificed in that quest? So start out, let's talk about accuracy. Is your model accurate? In the previous example, we looked at how accuracy and the error rate, where the number of correct predictions versus actual results drives accuracy in our model. Most of all, if we have an error rate that's above a certain threshold, we want to continuously train our model to get that error rate as low as possible. That's in the quest of accuracy, the number of correct predictions over the total number of predictions. But isn't accurate model always useful? Let's take a look at an example from a medical machine learning model. This model is trying to identify whether a tumor is malignant, which is a medically positive results, and benign. This can be confusing, but yes, when you get a test result back in medicine, if it's a bad outcome, it tends to be a positive test. We start out by looking in the upper left at true positive. This is where the model identified that this was a malignant tumor and that in reality it was. This is a positive result because the predicted result was true, the actual result was correct. The number of TP results for this model is one. False positive, you can think of as a false alarm. The predicted result was positive, but the actual result of the model was incorrect. The reality is that these tumors were benign and the machine learning model thought that these tumors were malignant, that is a result of one. False negative, on the other hand, is when the model predicts negative and the actual result was positive. In this case, the reality was that these tumors were malignant, the machine model thought that it was benign. That constituted eight results in our study. Then a true negative is the reality was benign, and the machine learning model predicted benign. True negative is just the model predicts negative, the actual result was negative, so it's correct, it's just in the other direction. Overall, these are the four quadrants of these predictions that this machine learning model has made. Overall, we think about accuracy. The correct results are in green, and overall, out of 100 results, 91 were correct. You would say that that is an error rate of just nine percent, 91 percent of the time, the model gets it right. But let's look a little bit closer. If accuracy was the only thing guiding us in this case, we would say, okay, great, this model is accurate, it might not be ready for medical diagnosis at 91 percent of the time, but it's at least ready to make some good decisions to maybe help the doctors. But then we look at reality. Let's look at the false positives and false negatives. Well, all we care about this case is identifying a malignant tumor. In reality, there were nine malignant tumors, and we only guessed right with one of them, the true positive where the model predicted malignant, and then the reality of malignant where we did not predict eight of them correctly. In this case, the model can only see one out of every nine tumors as malignant, which is really not a great model. This is just an example of how accuracy as the overall goal can really trick you into thinking that a model is useful when it may not be. Something to think about later on. With that in mind, how do we build accurate models? Well, we start with the data that we're feeding the model and picking accurate variables or what we should feed the model to look at. The first rule with accurate model building is, you want to make sure the data that you are giving to the model is not already correlated. A good example of this is that we know that people in San Francisco also live in California, but a model does not. So if we feed it a spreadsheet that contains San Francisco in California in different columns, the model may get confused. Second, we oftentimes are not going to get a complete dataset. Data will be missing and incomplete. This is part of the art of building a predictive model. We need to inference where data can be missing and where we need to build predictions often. How do we compensate for, let's say, knowing 80 years of income data, but only 20 years of loan repayment data? We'll have to balance those later on. We'll talk about how to do so in a way that does not bias the data. Then finally, we think about domain expertise. Actually having a very complete view of the problem that we're trying to solve in the different factors affecting it, to the values that we're feeding to the model, actually paint the whole picture of the problem that we're trying to solve. This is part of being a social scientists, being a researcher all mixed into one when you're building a predictive model. To put these three practices into reality, let's improve a model for accuracy in the following example. The scenario that we're given here is that a new navigation app has hired us to help work on their predictive model. They want to come up with a new app that not only shows when traffic is in the app, but will predict how long the traffic will last to help you get to your destination sooner. The key question and problem for them to solve is, how long will a traffic jam last when it occurs? They give us the following datasets to start: traffic jams by location, car speeds during jam, traffic volume, average time spent in traffic, and average speed while in traffic. So lots of data to go through, and this is great, this can help us build our model. Let's think through how a basic model that did not apply accuracy standards, we'll just go through and get this model up and running. The first variables we'd think about were time spent in traffic, location coordinates, traffic volume, average time per traffic jam. We then classify these data points to tell the model exactly what to think through. We want the model to compare traffic volume, average time spent in previous traffic jams at this location, and then make a prediction. Compare the car speed historical data and predict the speed increase. Once the speed is the percentage of normal speed, mark the traffic jam done. Look at the different speeds, look at the history, and then make a prediction, again using the datasets provided. Now, this model could reasonably predict based on the average speed, increase of cars in a jam, how traffic patterns exist, and whether the traffic jam is done. But we can build a much better model, just thinking through some of the factors we just discussed. Let's think like a researcher and build a more accurate model. First, let's look at the correlated data we could potentially remove. Do we really need min-max speed and average speed, or are the min and max data points going to already exist within average speed? Do we even need the minimum and max speed? The next is, we're going to look for missing an incomplete data. Thinking in this case of a navigation app, do we always have the car speed or do we only have the car speed when the app is open? How would that affect the model? A reality in this case is that if cars are stopped in traffic, users might be on their phones, they might exit out of the navigation app to go text someone. Obviously not a safe practice, but something that is done in reality, we need to be considerate of all of these different factors. Then domain expertise. How does traffic actually work? Here are some basic principles. When no one's in front of us as a driver, we accelerate, when someone is slowing in front of us, we break, and if there's a lane open with faster traffic, we switch lanes. All of those are factors to consider. You actually need to go out and get stuck in traffic a few times. Not the most fun, but something that you need to think through to figure out how traffic actually works. All that considered, here's a better model that we could build with all of those factors. The first, let's take our variables. We just need traffic volume, we just need average speed, we'll get rid of min-max while in traffic, and historical traffic locations. We will classify the model, we'll tell it, we need to plot the average speed based on time and then predict acceleration and deceleration based on traffic volume in the location. Overall, the model, because we've taken in some domain expertise, we just need to know, when x number of cars begin decelerating trend, that is a jam, and we can use acceleration trend to determine when the jam is done. We can predict the length of jam based on traffic volume using acceleration and deceleration. The model will be more accurate and we'll have a better idea of traffic just because we've now included how traffic actually works, it's really just a factor of how many cars are decelerating versus accelerating. To sum up, accuracy, while it is the number of correct predictions over the total number of predictions, is much more than that. When we think through whether a accurate model is the end-all be-all, is it actually useful? How do we make a more accurate model that ends up solving the problem at hand better? As a researcher, that is the goal that we want to aim toward. That is it for this lesson. We'll see you in the next one.