Okay, in this lecture we'll see how to put some of what we've seen about recommender systems into practice, by trying to build out the simplest possible recommender system that recommends products based on just the Jaccard similarity. Okay, so as usual, we'll start by reading in some data. For this example, I'm going to start using a slightly larger dataset than what we've seen before. I'm going to look at Amazon musical instruments dataset. Of course, you could also run on one of the other datasets, like the gift card datasets if running time is an issue for this exercise on your own personal computer. But otherwise, we're just reading the dataset in exactly as before, this larger musical instruments data set has exactly the same format as the others. Okay, so our goal for this task is going be to make recommendations of products based on users budget histories. So to do so, we're ignoring anything like ratings or reviews. We're just looking at this collection of reviewer IDs and product IDs. So which item IDs were consumed by which users or user IDs. So out of our data set, we only care about those two fields, reviewer ID and product ID. Okay, so the first thing we're going to need to do is to go through that dataset and build these data structures, that is represent for each users, for each user, which items did they consume, and for each items, which users consumed that item. Okay, so we build these two sets, users per item and items per user. So users per item says which users purchase each item and items per user says which items were purchased by each user. Okay, and now we go through our dataset, and we extract the user ID and the item ID for each instance. And then we just add the user to the corresponding user set, and add the item to the corresponding item set. Okay, and we're also going to build this dictionary of item names, just so we can actually interpret the results we produce a little bit later. Rather than having item IDs, we'll actually be able to say, what was the item being recommended. Okay, so going back to the previous lecture, the data structure we just built, users per item and items per user is just that set, U subscript i or I subscript u. All right, now what we're trying to compute is the Jaccard similarity. So to take this equation for the Jaccard similarity which I gave in a previous lecture and write that in code, it's fairly straightforward. We first compute the numerator, which is the intersection between one set and another set. And then the denominator, which is the union between that set and the other set. And we return the numerator divided by the denominator. Of course, this is just a generic implementation of the Jaccard similarity. It says nothing about users or items yet. It's just Jaccard similarity between two arbitrary sets. All right, so what does the recommended system we're going to build consist of? So we want to recommend items similar to a candidate item based on the Jaccard similarity. So our strategy is going to look like this. So we're given some input item i, we're going to first find the set of users who purchased i, then we'll iterate through all other items to figure out which ones are most similar. Every other item, we'll compute the Jaccard similarity with i. We'll sort all those Jaccard similarities in decreasing order of similarity and we'll return the most similar one or several of the most similar ones. That will be our recommended system. Okay, so let's go ahead and try to implement that. Okay, so this is our function most similar. We're just going to start by storing this list of items similarities for all other items, given a query item i. So the first thing we do is to take this user set, which is all of the users who purchased item i, and then we're going to iterate through all other items. We'll skip it if it's the same as the current item we're using as a query. Then we will compute the Jaccard similarity between this new item and the query item. We'll append that to our list. We'll sort our list and then we'll return the most similar items. But we'll sort in reverse because we want the most similar items to be first. Okay, and this function just returns the ten most similar items. [BLANK AUDIO] Okay, so we can use this function to make recommendations. So the query to the system is just going to be a product ID. So we can take some element from the dataset. I've just highlighted the name of this element, so this is something like a record cleaning brush, given that it's in the musical instruments category. So we just take the ID of that item and we're going to run our function to compute the most similar items to that query. What does this output? It outputs a ranked list of Jaccard similarities followed by item IDs. So this is the ranked list of the top ten most similar items, according to their Jaccard similarity with the query item. Okay, and just try to interpret it. We can say, what was the query? It was a record cleaning brush. And what are the items that are considered to be most similar to the query in terms of their Jaccard similarity? And maybe for some of them, it's a little unclear, but they seem to be things that are at least related to record players. So it's saying, there's a lot of overlap between people who purchased the query item and these other ten items that were recommended. That's essentially how a similarity based recommender is going to work. Okay, so that's a first stab at implementing this kind of thing. Unfortunately, its implementation is not very efficient. So what's really slow is that every time we make a recommendation, we have to iterate through all of the items in the dataset. So potentially iterating over millions of items, computing Jaccard similarities for all of them, just to make a single recommendation. Okay, so going back to my previous description of what's going on, the slow part here is that we have an iteration over every single item other than the query item. So we can do this more efficiently as the vast majority of items are going to have no overlap or have zero Jaccard similarity with the query, and we can figure that out in advance. Okay, so here's the more efficient way we can go about computing the similarity function. It's going to be sufficient to iterate over those items that were purchased by at least one of the users who purchased item i. That's going to be the complete set of items that could possibly have non-zero Jaccard similarity. This is what our updated strategy will look like. We'll first find the set of users who purchased i. We'll iterate over all of those users, and then we'll build the candidate set of possible items, items j, if you like, by taking the union of all of the items those users consumed. And then for all the items in this set, like before, we'll compute the Jaccard similarity with i. We'll sort by that Jaccard similarity, and we'll return the most similar items. So the point is that the union of the sets of all the items purchased by users who purchased i, could not possibly be larger than the full set of items. And in practice, will usually be much, much smaller. Okay, so here is our more efficient Implementation. All I've really done to change it, is I've added this new data structure which is the candidate set of items. And I build that data structure by iterating in the next line over all of the users who purchased the query item, i, and just take in the union of the items per user for that user with my candidate set. And that just gives me a list of all of the items that could possibly have non-zero Jaccard similarity. Other than that, the code is the same. Instead of iterating through all items, I iterate through this candidate set, compute Jaccard similarities, pin them to a list, sort the list in reverse order and return the ten most similar. So you can confirm for yourself that this returns exactly the same recommendations, but it does so much more quickly than the previous implementation. Okay, so that's about it. This is a very simple similarity base recommender system based on the Jaccard similarity, and we showed some efficiency issues and how we can make our solution more efficient. So on your own, I suggest making some modifications to this code. In particular, we built a system that says, given a query item, recommend other items that are similar to the query. You can do the same thing for users. So perhaps given a user, recommend other users who are similar to this user. Alternately, you might imagine recommender systems that do things like make recommendations of the form, you'll like x because you liked y. So maybe we can take a given user and our query could be an item that they have given a high rating to. Okay, and then you could find recommendations that are similar to that query. So you could adapt a recommender system so that it takes a user as an input rather than an item as an input and makes recommendations of items that are similar to things that user liked. So I'd suggest taking that code and trying to make this kind of modification to it.