Hello, this is a video guide to assignment 2. This assignment is new in the on-demand version of this course, and there wasn't a corresponding assignment in the previous version of the course. This assignment corresponds to the work on content-based recommenders. In our previous course we had an exercise where people used content-based recommenders, but we felt it was more valuable to have an exercise where you actually implemented some of the content profiles yourself, which you can do either by hand or, as we would recommend, in a spreadsheet program. In this video, I'll give you a brief introduction to the assignment and show you some of the basics for manipulating the data that will give you in a spreadsheet. Assignment 2 starts with a very basic scenario. You have a set of documents and a set of topic content attributes. And a data set describing which document express which content attributes. There's a spreadsheet file which I'm bringing up here that you can open in your favorite spreadsheet. It will work in Google spreadsheets, it will also work in Excel or pretty much any other spreadsheet. And I'm showing it to you with the data in that file. So if take a look down the left, you see the names of these very informatively named documents 1-20. And you'll see across the top ten attributes. Whether the documents are about baseball, economics, politics, Europe, Asia, soccer, war, security, shopping or family. We're using a very limited set of content attributes specifically because we want you to be able to do these exercises by hand. You have two other pieces of data in the spreadsheet we're providing you. On the right, you have two content profiles. These represent whether user one or user two saw and liked, saw and didn't like or didn't see each of the 20 documents. Come back to that in one second. We also have near the bottom of the sheet a very simple summary. DF that counts the number of documents that each of these concepts or terms was found in. So by looking at DF we can quickly discover that Europe for instance appears in 11 of our documents and is the most common concept. While baseball which appears in only four documents is the least common concept in these particular articles. For your purpose you can think of these as news articles that might have been shown to somebody using a tool like Google News. Okay, let's explain the numbers. The ones and zeros in this matrix represent whether the concept is found in the document. So if we look at document one, we're going to see that this is a document, I'll highlight it, that is about baseball, politics, Asia, soccer and family. You can imagine a news story that might tie all of these things together. Like perhaps some Japanese prime minister candidate who is talking about how it's important to build more outdoor fields so that families can play sports like baseball and soccer. If you had read this article and liked it, then in your profile as user one has over here, you would have a positive one. User one found this to be an interesting article. We don't know why, we just know that user one found it to be interesting. User two also read this article and found it uninteresting. You could imagine that user two might have been presented with the header for the article and swiped it away to say I don't want that I want to read something else. On the other hand user two liked document two which you can see here. Reflects a story about economics, politics, Europe and security. Again, you can come up with something about how, Germany is deciding politically what part of its budget to allocate to fighting terrorism across Europe. That seems to be the sort of thing that user two is interested in. Interestingly, user one did not like that article. The whole point of content filtering is to build up a profile of the things that somebody likes or doesn't like. And use to that to help recommend or predict their liking of other items. I want to introduce you to a couple of quick things that for those of you not very familiar with spreadsheets, will find useful. One of them is the idea of being able to make a copy of an entire spreadsheet and work with it again later. So, If you click here in the top left this works in Excel, it works just as well in Google and copy I'm doing that by hitting control c but you could also go to edit, copy and go to another sheet and then again come in there and just say paste. You'll have another copy of the spreadsheet to work with and you'll probably want multiple copies because we're going to have you do multiple different analyses. Now sheet one and sheet two are sort of horrible names. So this might be the sheet that I would label you know, first problem. And I'll leave my original spreadsheet alone so that I don't do anything to mess it up. Next, I might come here and I want to show you how I could do the correlations or dot products among things that are on a spreadsheet. And we're going to start with dot products because one of the first things we're going to want to do in this assignment is to figure out, gee, how much does user one like baseball? Or, why don't we pick something closer on the scale, we'll pick how much does user one like family. So I'm going to create a spot on here, which I'm just going to call user one. And come here and say, we'll what I really want to know is how much does user one, like family? And the way I'm going to do that is I'm going to say every time user one liked an article, if it was about family I'm going to add one to user one's profile. So if there's a one here and a one here I'm going to add them. Every time user one disliked an article minus one. There's a one here that says they disliked he or she disliked this article about family I'll subtract one. That simply put is a dot product. I'm going to multiply the two vectors. If either one is zero and for this purpose a blank is going to be zero I'll add zero. And I can do that using a function in this spreadsheet called SUMPRODUCT. Which will be useful for you in this assignment, but also in later assignments. And if any good spreadsheet will give you a little bit of help, it says, give me a list of arrays, and it will calculate the dot product for you. And so, I'm going to come here and say what would I really want is this range. And then, multiply that by the corresponding range over here. When I do that. I'm scroll back down to the bottom You'll see I got a 0, and I got a 0 because it turns out in this case I have 1 minus 1. And I have no expression of family, and 3 of these at 1 plus 1. There's one other trick in spreadsheets that you're going to find very useful. If you've never used a spreadsheet before, this is a trick that you'll find helpful all the time. This cell is lovely, but it doesn't help you if you have to use this formula over and over and you have to type it in over and over. What we'd like to be able to do is copy this cell over here And have it do the same thing but it's not quite going to work. And the reason it's not going to work exactly right is that in a spreadsheet when you move something left or right or up or down it automatically adjusts these things left, right, up or down. And many of the things we're comparing against are not going to move automatically. So if we think about what we want this cell to be when we do the same question but instead of family, we want to look at shopping. Well the thing we would like this formula to be and I'm just going to copy it and paste it here and we'll see the problem. I got a 0 for shopping, but I got a 0 for shopping because it took J2 through J21. But it also multiplied it by M. And M is a blank column that's all 0s. What I really wanted that to do is to stay as N. If you insert a dollar sign before this the K I don't want a dollar sign before because I want the K to move over to J. But I do want the dollar sign before this. It's not going to change anything in this thing but when I copy it and paste it suddenly, I'm getting a 1 here. And if I check this out, I'm going to see, well, no shopping, positive shopping, no shopping, no shopping, no shopping, all of my shopping adds up to plus 1. Now I'm going to leave it as an exercise for you to be able to do the same thing with user two. But I will tell you that the dollar sign works not only here in the number on the letters, but also on the numbers if you wanted to say look I'm going to use other rows. But I always wanted to use 2 the 21, because that's where my base data is. That I put dollar signs in there and that will work fine. If I take those dollar signs out I can then say, well if I move down use lower rows of data when I compute. One last function I want to show you, though you won't be able to use it yet until you get through the first part of this assignment. The first part of this assignment you are going to build these profiles. And as you see, the first version of these profiles is just counting up. How many articles did you like with this? How many did you dislike? And then, we're going to ask you to figure out which articles are better or worse, based on that taste profile. And one simple way to do that would be to multiply this row of tastes against each document row. Well you know how to do that. That's a dot product. You can use some product. But we might ask you to normalize it. In fact we will in the second step. When we ask you to normalize it you're going to have two choices. You can normalize it by hand. You can go through each of these rows and say, I've got to find out the length of this vector which is basically the number of ones that are in this vector. So for document eight there's four one's and then I've got to go and scale this down. So that each of these is one-half, because that would be dividing it by the square root of four. You could do that by hand, you could compute it here, or you could recognize that a correlation is a dot product that's already normalizing. And if you wanted to use the correlation, there if a function here, and am just going to do something not for any particular purpose just to show you, just say,how much of this two documents, doc1 and doc2 the same, actually doc1 and doc2 are really different, aren't they? Lets finds once that are a little more similar like Doc1 and Doc16. I could say, give me the correlation of this vector and this vector. And I'll see that there's a correlation of 0.21. And that function is built in as well. Just so that you see it at the top, that's correlation. C-O-R-R-E-L. And you give it the two vectors you want to correlate. They can be vertical, they can be horizontal. Okay, back to the assignment. You're going to be asked to do three things in this assignment. And you're going to be asked to submit your results in the form of a list, actually answers individually to the top ranked articles for each profile. You will have six final profiles. Your first two profiles will be user one and user two, doing nothing but counting up likes and dislikes. Your second, will deal with the question of normalization. And your third, will deal with the inverse document frequency concept from TFIDF of saying that perhaps terms that appear infrequently like baseball are more important than terms that appear frequently, like Europe. At least more important in that when they show up they should get higher weight because they're rarer. Will take you through step by step in your instructions each of these in each case we're going to ask you to return the top two selected documents and we're looking simply for correctness. Along the way we'll give you some intermediate results to help you make sure that you're doing things the correct way. If you have questions about using spreadsheets or questions about the assignment that perhaps other students or we can help you with there will be a discussion thread attached to this video and we invite you to Post your questions right in the discussion thread.