Hi again. In this lecture we'll talk about machine learning and the different types of machine learning. The machine learning, as I mentioned, is a sub-field of artificial intelligence. It's mostly focused on how do we get computers to learn from data without explicitly programming them? They're often used for prediction tasks. For example, we might have data on past credit card transactions, and we might be interested in predicting whether a new transaction is fraudulent or not, so we might look at past data in order to make this decision. Or we might be interested in determining whether an email is spam or not based on past data. We might be looking at a task of analyzing images for a driverless car and figuring out whether the object in front of a car is another vehicle or a person, or a tree, or something else. We might be interested in recognizing speech and understanding speech like with Alexa or Siri. In short, there are many kinds of prediction tasks that use machine learning and these have applications in a variety of industries, ranging from healthcare, to finance, to manufacturing, to human resources, and so on. Now, it's important to understand machine learning is not one single technique, they're really a large set of techniques all of which come under the umbrella of machine learning. In fact, there are many types of machine learning. For example, one way to think of machine learning is in terms of supervised techniques, unsupervised techniques, and reinforcement learning techniques. Supervised learning is the idea of building a predictive model based on past data, and these data have clearly labeled input and output data. For example, we might have data on emails in the past and nice and clear labels on which of those past emails are spam emails and which ones are not. We might then want to learn from it. This is a classification task which is using past data, which has nice labels of inputs and outputs to learn how to label future data. Unsupervised techniques, in contrast, have a lot of input data but you don't have clear labels on output, and so these techniques are finding patterns in the input data. For example, you might have anomaly detection, which is the idea of finding certain data points that look like anomalies or in other words, they look different from all other data in there. Similarly, we talked about clustering, previously, which is the idea of grouping a set of data points into different groups, such that data points within a group are as similar to each other, and data points in different groups are as different from each other. This is based on data, but we don't have clearly labeled output that is guiding us on how best to actually break up the data into different clusters. Lastly, we have reinforcement learning, which is the idea of having a machine learning system acquire new data by taking actions and looking at the data to learn and improve its future action. We will look at each of these techniques in greater detail. Let's start with supervised learning. As I mentioned, supervised learning is the idea of learning from data, where you have cleanly labeled output and labeled input. The inputs can be referred to as features or as covariates, and the outputs are often called targets of the model. This is what we're trying to predict. For example, as I mentioned, we have email data and the output that we're trying to predict is whether an email is spam or not. The inputs or the features of the covariates are the actual text in the email. With supervised learning, the idea is that we have cleanly labeled past data which have a correct answer, meaning that certain data have been labeled as spam and certain other data have been labeled as not being spam and now we need to learn how to classify future emails. Similarly, you might have a desire to predict sales next week based on historical data. We might use data on the season, the month of the year, the weather, and other such patterns to predict future sales. Our training data is actually past data which has all these patterns, month, season, weather, and also the actual sales that were realized in the past. Now we're trying to make predictions in the future based on that. Let's look at another example of supervised learning. In a recent research study, my colleagues and I were interested in analyzing social media posts posted by a number of companies on Facebook. We gathered data on over 100,000 posts submitted by large brands on Facebook. We wanted to identify what posts are associated with the highest engagement. That is our emotional posts associated with greater engagement, or humorous posts or posts that show deals and promotions to consumers or other posts. Now, it is very expensive to tag 100,000 posts and label each post as being humorous or not, emotional or not, as offering a price discount or not, and so on. We wanted to automate this process. We use a supervised machine learning technique to do that. To do that, we first need data, a training dataset that has clearly labeled inputs and outputs. The inputs are available to us. These are the words that companies use in their posts. The output is essentially a label that says whether the post is emotional or humorous or not. To do that, we took a sample of 5,000 posts and had human beings label each of these posts. Every one of these 5,000 posts was labeled by a human being as being humorous or emotional or as offering a price discount or being a post that shares a remarkable fact, and so on. These labels were then used as a training dataset for supervised machine learning algorithm that learned what words are predictive of whether a post is emotional or humorous or not. Then that algorithm was used to make predictions for the remaining nearly 100,000 posts that hadn't been labeled by a human being. This is essentially the idea of supervised machine learning, which is you need a training dataset and you learn from that and you apply that to future data. What we found in our study was that our machine learning algorithm did well and often had accuracy of over 90, 95 percent, and sometimes even greater than 99 percent, in essentially being able to predict whether a post is humorous or whether a post is emotional or not. In any business application, if you have good high-quality training dataset, one can apply these techniques in order to make predictions about the future. The key is collecting high-quality data, and that is the most important activity in supervised machine learning. There are a number of very good high-quality off the shelf algorithms that can be applied to make predictions if you've got high-quality training dataset for machine learning. The next set of machine learning techniques are unsupervised learning techniques. Unsupervised learning techniques also take in data, but they don't have clearly labeled output. For example, clustering algorithms that we discussed previously. They tend to cluster our data into different groups, but they are not told in advanced what the ideal clustering looks like, meaning there's no labeled output for them. Similarly, another example is anomaly detection. Anomaly detection algorithms look at a bunch of data and identify data points that look dissimilar to most of the other data. Here again, there's a lot of input data, but there's no clearly labeled output. Another example is Latent Dirichlet Allocation or LDA, which is a commonly used technique for topic modeling, meaning identifying what topics a certain document might cover. Typically, with LDA, you have an input dataset which consists of a large set of documents. The idea behind LDA is that each document likely covers a small set of topics, and each topic itself tends to use the same set of words quite frequently. For example, we might take a large dataset of new stories published in all of the major newspapers and online news media outlets, and feed that as an input to an LDA algorithm. An LDA is trying to identify the topics that these documents cover, but it's not given clearly labeled outputs, meaning that the algorithm is not told that here's a document on politics and here's a document on sports and so on. LDA as a set assumes that each document covers very few topics, and each topic has a few words that it uses frequently. When it takes a training dataset or an input dataset rather, LDA might identify that a certain topic tends to use certain words quite frequently. For example, it might say that here's a topic that tends to use the word Obama, the word Trump, the word speech, and a few other such words quite frequently. But it does not tend to use words like pizza or baseball as frequently. This clearly we can infer is the topic of politics, and that's something that the algorithm identifies on its own. Now, given any document, LDA then looks at the kinds of words that are used in this document and identifies which topics it covers. Given a document, LDA might say that a topic covers sports or a topic covers politics and so on. Once LDA has been trained using a large dataset, it can now be applied to any new document, and it can automatically classify these documents and identify the topics in there. In this example, you see a passage that LDA might analyze and it looks at certain words that are used in this document. With each of these words it identifies certain topics that these words are related to. For example, arts, or education, or children. Then it identifies a set of topics that this document covers. Now, in addition to unsupervised learning, we also have this idea of reinforcement learning. Reinforcement learning usually does not take in large training datasets. Rather, the algorithm learns by testing or trying various actions or strategies, and observing what happens and using those observations to learn something. This is a very powerful method and has been used in a number of robotics-based applications. It is also at the heart of a software created by Google, which was called AlphaZero, which was an advanced version of Google's Go playing software, AlphaGo. AlphaGo had used training dataset, which is based on past Go games. AlphaZero had no training dataset. Instead, it learned the game of Go by playing Go against itself and once it played millions of games against itself, that was in fact the training dataset that it used to develop the best strategies for this game. Of course, in many settings, experimentation isn't always free, and so you have to balance the cost of experimentation against exploiting the knowledge that we already have. Let's explore that through a reinforcement learning algorithm known as multi-armed bandit. To illustrate how bandit algorithms work, let's consider setting where you have two different ad copies that we have designed and that we would like to try with our customers. We do not know which ad copy is more effective in engaging customers and attracting them to click on the ad. We would like to ideally figure out which ad is the better ad to use. Now, one way to figure this out is to do what is known as A/B testing. That is, we might show ad A to half the users and ad B to half the users. We might do this for some period of time, let's say a day. Then we observe which ad has the higher click-through rate and we might use that ad there on. Now in this graph that you see on this slide, we have two ads, ad A and ad B. Ad A, has a click-through rate of five percent and ad B has a click-through rate of 10 percent. But we do not know this in advance. What we might end up doing is show ad A to some users and show ad B to some users. If we've shown these ads in a randomized version to a large number of users, over time, we learn that ad A has five percent click-through rate, ad B has 10 percent click-through rate. Then we can use ad B from that point onwards. But there is a cost of this learning, because some people were shown ad A and some people were shown ad B. During this learning step, the average click-through rate that our ads experienced was seven and a half percent, which is lower than we would have obtained if we had chosen the better-performing ad. Now, bandit algorithm can do better, and it can improve performance. The way it does this, is that it starts off initially like any A/B testing algorithm, meaning it shows ad A and ad B equal number of times. But it starts to observe what is happening and is learning. For example, it starts to observe that ad B is doing better than ad A. As it learns this, it starts to show ad B more frequently than ad A. It still will show ad A a few times so, it still allows itself to learn and correct itself in case ad A actually will perform better. But over time it starts to weigh ad B more and more, and as a result, if you observe at the end of a day or in this example at the end of 1,000 sessions, this ad which used a bandit algorithm based allocation strategy, ended up having a click-through rate that was much higher than seven and a half percent that we obtained through A/B testing. It was not quite equal to the 10 percent that ad B has, but it's close enough because what it's able to do is it's able to experiment and learn, and exploit that knowledge to also improve the outcomes. In short, a reinforcement learning algorithm is essentially an algorithm that takes actions, observe what happens, and then improves its performance over time.