One of the first steps in pre-processing our data will be to identify any missing values and then determine how to best handle those. To be able to find missing values if the data is represented in a typical relational table, anytime we have no data at the intersection of a row and a column, we're going to need to address that. Now sometimes we'll find that there may be rows that are missing values for multiple features and in these cases, maybe it would be best to drop the entire row. In other cases, we may only be missing a single value and we need to determine what to do in that case. Additionally, there is also the possibility that we may have a feature that is missing most of the values in each of the rows as well, so each of these different scenarios need to be able to be located and identified. The problem is that it depends upon how those missing values are represented. How easy it may be to locate them and be able to identify them. You see different programming libraries may handle missing values differently. For example, if we're using Python and we were to import a CSV file into a data frame, what will happen with our missing values would be that they would normally be entered as something like NaN or not a number, and so if the library were to do that, we could use utilities available from the library to be able to search for those particular values and there are other ways of representing null values as well in Python, we could have NA as well. The point here is, is that if that has been performed consistently, then that would make finding those values fairly straightforward. But in some cases, a missing value is not just simply a blank that can be identified in this way, but instead, we may end up with data that has been entered by someone in a way that is going to mess up your data stream, so maybe someone puts a question mark or just a dash or something in a field. Well, of course, in that scenario, that is not going to now be represented as NaN, it's now going to in fact convert that entire column into a string, typically. Now you're going to need to be able to locate and find this value that has caused that change in the casting of that column to a datatype that's not appropriate in this case at least. We do have functions that can help us with these things. But of course, typically they are going to be looking for those specific flags like the not a number value as well. If we wound up with a column that was cast to a string like we would have here in this scenario, then it's likely that those functions would not be that helpful. Now we do have other options and, as you'll see on the next slide, we could use a number of different mechanisms to perhaps determine how we would want to replace the missing value, so step 1 is to find it. Step 2, what's the best way to replace it? You'll see that there are different approaches. One approach might be to take the other values that are actually in the table and find their mean or their average, which in this case might be 87 and so we'd insert the value 87 into that record. Now, doing that would preserve the overall mean of the data set. But there are other approaches that we can use as well. Let's have a look at some of those now. Looking at some of the options for handling our missing values, you can see that we have a number of different choices where we would be imputing the missing value. The term imputing really just means that we're going to be providing some other or estimated value, something along those lines that's going to replace that null or missing value. We looked at using a mean imputation on the previous slide. Of course, that would primarily be applicable to continuous variables, and of course, we could also perhaps apply mode imputation, where instead of finding the average of the values, we would be looking at the most common value and that's not going to be very useful when we're dealing with continuous variables. But certainly it would be applicable if we're dealing with categorical values, and so we would apply that perhaps if we're dealing with categorical values. However, these approaches may not always give us the best result. In some cases, we may be better off preserving a more randomized approach by using one of these other techniques. If we were to look at, for example, the use of substitution with substitution, we would be finding a new record that is actually not in the sample, so we would be going to another source to obtain that record. Then we would use the value for that particular missing value there and just simply substitute it. We have an approach called hot deck imputation. In this case, what we're going to do is we're going to search for a similar record and we are going to draw that from the sample that we are already working in. We're going to try to find a record where the other feature values are very similar to the feature values on the record that has the missing value and then we just would replace that value from that similar record. Cold deck imputation is very much like hot deck imputation, where we will be pulling a similar record, but we would be pulling that from a different sample. We would have maybe another sample set that we were able to obtain elsewhere. It's not being drawn from within the same sample that we're already operating on. Then of course, with a regression imputation, we're basically making an estimation or a prediction based on really a machine learning algorithm. In this case, we would be taking the data that we have, whatever it may be, and our model could then be used to predict the value that perhaps would best be used in the missing location. Now all of these techniques will or could cause an influence on your machine learning models skill and its effectiveness. You may find that simply dropping the record or maybe removing the feature that is missing many values has a better effect on the predictive skill of the model than using any of these techniques. Really you need to test them out and see which is going to produce the best results in your particular case. But at least you know which of these you could attempt to apply and of course, some are simpler than others. But by the same measure, our goal is going to be to end up with a data set that has no missing values, that is going to be able to train a machine learning model effectively to make a skillful predictions and classifications.