Data science really has three parts. Machine learning or modeling is one such part. But there are also the parts that come before and after the modeling itself. Prior to actually building or training a model is the preparation of the data for input to that model. That includes collection, cleaning, sanity checking, and figuring out how to generate features that can be input to the model as derived from the data. What's the model stream? We actually have to deploy it in practice and that involves a number of things, for example, considering various real-world systems constraints , producing actionable outputs. In other words, things that decision-makers can actually act on and decide upon based on the upwards. There are additional practical considerations to consider such as things like model drift. What happens when you train a model and then over time the model becomes less accurate due to a variety of different reasons? Machine learning has a couple of different classes of approaches if you will. The first set of approach is in supervised learning. In supervised learning we take data that has labels, a priority, and train a model so that we can do things like predict future outcomes. Prediction is one such application of supervised learning. In particular, we might take a data set, train a model, and then try to figure out what might happen next, or what might happen if we make a change to the environment via some change to the input features. Regression is another type of supervised learning where the outcome that we're trying to predict is often a continuous valued function or some binary zero, one output. Regression is the first type of supervised learning problem that we're going to learn about. In supervised learning, we sometimes also talk about classification. In other words, given a particular data point we try to figure out what class or type that data point belongs to. A classic classification problem is a spam filtering problem, where you have a piece of email and you try to decide is this piece of email spam or is it legitimate email? Unsupervised learning is another type of machine learning problem where the data does not have labels, a priority. Essentially you basically just have a bunch of data points in a multidimensional feature space. A common type of unsupervised learning approach is clustering, where we take this data and we try to group related data points together. One particular application of clustering is recommendations. For example, you might have a bunch of data points like things you've purchased in the past and a recommendation algorithm might try to cluster other things into that group, making recommendations on things you haven't yet bought but are similar in some way to the things that you've bought. In summary, there are a bunch of steps in the machine learning or data science pipeline where modeling or machine learning is really only a tiny part of what's actually going on. There are a bunch of steps prior to modeling that involve preparing the data: measuring the data, ingesting it, linking it with other disparate data sources, cleaning it, providing sanity checks, checking for missing values, zero values, etc. Subsequently taking that data set and creating features and representing them in ways that can be suitable for input to training a model. Finally, we get to the step of modeling where there are some important decisions like, which particular machine learning methods should we use? How do we tune the parameters of that particular method, and so forth. But you can see that's a very small part of the overall process. Once the model is trained, we can talk about deployment. How we basically deploy this model in practice so that it provides actual outputs in real time, and how do we maintain that model over time as our environment changes? To reiterate data preparation actually affects the final predictive accuracy almost generally more than the modeling step does. Like modeling it often contains a bunch of parameters that need to be tuned, and much of the effort in a data project is rooted in the preparation of the data. Now unfortunately this particular part of the data science pipeline is not really treated or addressed in many textbooks, including the ones we're going to use in this class, and so it's still remains somewhat of a black art. But you'll get some experience with mangling data, preparing it, representing it, engineering features, and so forth in the labs in this class. Let's take a little bit more of a closer look at some of the steps in a data science pipeline. The first set of steps is ingestion, fusion, and exploring the data. Here, we're really just trying to understand what's in our data set that we have. Some of the applications or some of the exercises that we might do in that sequence of steps is to do things like look for outliers. Are there any zero values in the data? Are there data points that are way far afield from the other sets of data points? These particular data points that are zero or way far afield might indicate errors in collection or recording of the data and we might want to make some decisions about what to do about our outliers. Looking at distributions can also help us understand the nature of the data set. For example, if we're looking for thresholds or if we're trying to understand whether or not different features in our data set might serve as useful ways to separate data into different classes, then looking at distributions can sometimes help us figure out whether or not particular features or representations of those features might serve as good candidates for features as inputs to our models. We can also apply this analysis to look for correlations. Try to determine whether or not different features are correlated with one another, as well as to understand whether there are fundamental patterns in our data that we need to account for. For example, temporal patterns. There might be patterns such that a particular hour of the day or a particular day of a week, or particular weeks in a year, or particular seasons exhibit characteristics that are wildly different from characteristics that other hours of the day, days of the week, weeks or seasons. This is extremely important for understanding whether a model that we trained based on today's data might be valid at some point in the future. Data cleaning involves performing essentially a bunch of sanity checks. We might check for values that are missing, values that are zero, values that are negative, that should never take on a negative value, or data that's simply of the wrong type. For example, if you've got a column of data that's supposed to be integers and suddenly you see text in that column, that might imply, for example, that there's some misalignment of data somewhere in the data set, you may need to go in and fix or correct the data. You may also make some decisions about ways to process or clean this data. Smoothing out missing values if they exist or normalizing data if outliers occur, or normalizing data points if certain data points exhibit outliers but you still want to keep some aspect of those data points. Once we've cleaned the data set, of course, then comes the time to create and represent the data as features that can be input to the models. There's a wide variety of transformations and processing that we can do on data in order to take a raw data set and ultimately represent it as features. Discretization is one such operation where we might take continuous valued features and map them into discrete size buckets. Here you can already see there's a parameter. We have to figure out what our bucket sizes might be. If you have a huge data set, similarly you might choose to sample it. You might choose to take every 10th point or every 100th point, and you might choose to do this in different ways. You might choose to say sample at random, or you might choose to sample periodically, every second, for example. There are a bunch of different ways that you might choose to sample, and there are a bunch of different rates at which you might sample. Again, we've got a bunch of parameters to consider when engineering and representing our features in this way. You might choose to perform more complex transformations like performing a Fourier transform. If you have a time series, you might choose to take it into the frequency domain and represent your data as coefficients to the Fourier transform. Or if you have certain types of categories or set membership that you'd like to represent, you might encode the data in a particular way. One popular way of doing that is through something called a one-hot encoding. Finally, you might aggregate the data. If it's temporal data, you might bin quantitative values into the discrete time bins. Again, you're going to have to make decisions here about the size of your time bins. Do I use one-second bins, five-second bins, one-minute bins, one-hour bins, and so forth. Your choices here ultimately may affect how accurate your models are. All of the decisions that you make in engineering representing your features ultimately affect the accuracy of your model, and they're very important to consider. These are the types of things that we will explore further in the labs. After we prepare the data, we then come to modeling. Now models are typically what machine learning courses cover, including various considerations such as how to tune the parameters of the model. When selecting a model, there are a bunch of considerations such as, the accuracy of a model given a particular data set, the complexity of the model, how expensive it is to train, and how the model works considering the assumptions it's making about the underlying data. For example, a particular model might assume that there's a linear relationship between the features and the outcomes that you're trying to predict. If the relationship is not in fact linear in the real-world, then the model may not do a very good job at predicting the outcomes. Let's talk a little bit about the training and testing process that goes into modeling. It's important to split the data set into training and testing so that we're not evaluating the model on the same part of the data that we use to train it. From our training data, we engineer and extract features, and we also extract our labels or our outcomes in the case of a supervised learning algorithm. We train the model, that produces some model that's learned. Then once we have that model, that model can take features as inputs and produce outcomes or predictions. Given that model we might like to test it, so we take a different part of our data set that we put aside, we use that for testing. We engineer the features in the same way, we apply the model that we just trained, and we get a set of predictions. Now how do we know whether the model is any good? Well, remember that just like with our training data, the test data also has labels. Now we can compare the predictions that the model made against the labels that are in our data set and then evaluate how well the predictions match the labels. There are various metrics that tell us how well we're doing there, and we'll talk in further detail about metrics in a later lecture. A common way to evaluate a model is using historical data. Using the training and testing approach that I just talked about. We'll have an entire lecture on evaluation, but just to list out some of the methods here so that you're familiar with some of the terms in case you see them sooner. A common way of testing the model is using an approach called cross-validation, where if you have a data set you'll pull out some part of it for training and a smaller part for testing. You'll train the model, you'll test it on this separate part of the data, and then you'll repeat that process, pulling out different parts or subsets of our data to train and test over time. Again, there are different ways to perform cross-validation and we'll talk about that in a later lecture. There are also various ways to compute how accurate the model is. Some of those are listed here, and again we'll talk about the details of those in a subsequent lecture. One of the challenges that we face when evaluating a model is trying to figure out if we've done anything novel or useful. Often we're interested in saying that we've done something better than just re-learning the labeled data set. In other words, the machine learning algorithm is only valuable if it's doing something that humans already weren't doing that well through the standard or normal process of label. Some of the things that we try to say when we evaluate a model or maybe that not only it's as accurate as the labels that the human produced, but maybe the model is able to do it more quickly or more efficiently or earlier or faster than a human would have been able to do it. Or maybe it's able to do it at a scale that humans aren't able to do it. Another way of evaluating models is by deploying them and evaluating them in the field. While this is certainly possible, it's also a bit more challenging because you need to have a field deployment. So you have real-time constraints in terms of when you need to make decisions about data that's coming in. In this class we'll tend to generally use the first approach of taking a data set, training a model, and evaluating it based on historical data. After a model is trained, then comes the hard part of deploying it in the field and maintaining it. Deployment has a number of considerations that we'll talk about in more detail as well, but one of the most important is translating what the model is telling us into actionable outputs. In other words, policy. If we're trying to decide which neighborhoods to send more police to, then the model needs to tell us something that makes it easy for the police force to figure out what they should actually be doing. In other words the model should provide concrete guidance or advice towards decisions, as well as perhaps a confidence level. How confident is the model in that prediction or that advice that it just provided. A second aspect of deployment and maintenance is detecting model drift, and in particular determining when it's necessary to retrain a model. As I mentioned, a model that's trained, for example, in the winter when there are no leaves on the trees or when people are in their houses and not outside, might have completely different effects or predictions than one in the summer when people are outside in the parks and when there's trees on the foliage and when the days are longer and so forth. In summary, although this class is called machine learning for public policy, I really want you to think about machine learning in terms of this broader pipeline. It's not just taking a model off the shelf, throwing some data at the model, and getting some prediction or output, there's a whole bunch of steps before the modeling takes place and preparing the data, and there are some steps that take place afterwards that ensure that the model that we've trained is useful in practice.