Hopefully managed to get anaconda navigator installed. Once you do that, you're going to want to start up Jupiter notebook by clicking launch in the notebook window and that should eventually bring you to a screen where you can navigate to specific files in this one, we're going to look at some examples of how to manipulate a dataset with pandas. You can fire up the notebook by clicking on that link and here we're going to explore the basics of pandas with an example, dataset from the Chicago data portal and you can see the link to that here. I encourage you to take some time on your own to explore that portal and find some of your own interesting data sets to explore. Now Pandas has a function that allows you to load a comma separated value file and you can see basically here that what I've done is provided a commented out version of that which would allow you to download it directly from the website via URl. But I don't want to do that every time I run this notebook, so I've saved the file local and we can just load the local version of the final Jupiter notebooks are divided up into these areas called cells. Where a cell is typically there's some kind of markdown text like you see at the top here and you can edit this by double clicking on the text for example and just typing in there and to execute a cell one just typically hits shift, enter and that will cause the cell to execute. Now if the sellers text that doesn't actually do much interesting. But if there's code sitting in that cell, then the python interpreter will actually execute the code. So here by clicking shift enter, you can see now that we've loaded that comma separated value file into a data frame. What is that particular file we just loaded? Well, it actually comes from the city of Chicago data portal and you can see a description of that here and you can look at any of the city's datasets and if you click on export, you can get a CSV version of that data set. So all I did here was download that CSB locally and load it into the data frame. The first thing that we might want to do after we've loaded a CSV file is to determine what's actually in the data so I can again execute a cell here Showing me the 1st 5 rows of that data frame. So we can see that we've got station identify IRS the names of each station, the date on which this record pertains to and then we've got something day type AW what's that? Well, let's look at that in just a minute and then the number of rides that occurred at that station on that day. What's day type. We go back to our website here and see the data type is described right there. W is a weekday is a Saturday, use a Sunday or a holiday. So immediately from looking at this few rows, we can get an understanding of the type of data that we're looking at once we've loaded that data, we might want to learn some things about it. Like how big is it, how many rows are in here? So we can use the shape function to briefly explore that. That tells us that there are almost a million rows in this dataset and five columns, we know what the five columns are. We just looked at those. You can use the describe function to learn a little bit more about what's going on for particular columns in that dataset. Again, this is interesting because one of the things it does is tell us things like max and min and in particular we're looking for outliers. So there are some stations and dates for which zero rides occurred. That's kind of interesting. We'll have to look into that further. Everything else here looks roughly as we might expect it. You can also get a sense for just basic statistics on your data set. So for example, in the maximum number of rides at a particular station on a particular date was 36,000, that's 10 times the mean and the median. So we'll have a look at what those stations are in just a minute, you can also do things like count how many times a particular value appears in your data set. So you can see for example that there are a bunch of these stations that show up a lot. Quick math will tell you that that's essentially the number of dates that occur during the period of time for which this dataset pertains to. Another thing Pandas lets you do is select values based on conditional. So you could say give me the station with the most rights. Okay, interesting to the station with 36 1000 rides was Belmont North and that occurred on June 28th, 2015 interesting. Is that an outlier is something real happen? We'll leave that as an exercise for you. But you can do a quick web search on that particular date. Try to find out what happened on that date in Chicago and see if you can try to explain. Let's look at the stations with the least amount of rights. Boy there are more than 12,000 stations for which there was a date with no rides. We're going to need to dig into that one a little bit further. Also, a first cut suggests that maybe there are a lot of weekends or holidays, but there are some some weekdays in here. So what's going on here? Well, we're going to have to dig in a little bit more to understand really what's happening. So we already have a data frame that we've created for all the station date combinations where there was zero rise on that particular date. We can group by station name and the type of day to try to figure out which of these combinations are most common and some interesting things pop out of this data set. So for example you can see some stations at the top Madison and Wabash, for which there are a lot of dates for which there was zero rights. Now again we have no idea what's going on yet. We still need to dig in further and as we sort of work our way through the data set, some of these anomalies are going to become a little bit more clear but already. Hopefully you're seeing the value of exploring the data in a little bit of detail so that you understand exactly what's going on in a particular data set before you even dive in further. It should be clear just by doing a little bit of research on the web, why some of these stations at the top report zero dates. So I would encourage you to do a little homework on some of those to find out exactly what's going on. I'm going to dive in a little bit further and look at some temporal patterns. So first let's figure out what date range we're dealing with by looking at the min and the max of particular dates in this data set. So a quick exploration here determines that we're going from January 1st, 2001 all the way through December 31st of last year, those dates are in our data set. So we have all the rides from that period but if we want to actually do some operations on that particular time period, what we're first going to need to do is tell pandas that the date column is of type date time and we want to make that the index for our data set that allows us to do various kinds of temporal operations on the data frame. This particular operation of creating this index does take a little bit of time. Okay, so that took a little bit of time and now we want to see what that did to our data. So lo and behold you can see that the first column, the index is now a date and we can use that index to start to do some temporal analysis. The first thing that we might want to do is sort all this data so that it's sorted by date. So they're nicely, you can see that all of our January 1st, 2001 data is now coming first. Now we could maybe look at some quick sanity checking just to see what this data looks like for a couple of different CTA station. So here are the first two weeks of data from the Garfield red line stop, you can see nicely that we've got holidays, weekdays, weekends and Sundays and things look roughly copasetic, you can see that on Sundays and holidays ridership is lower than it is during the week exactly what we might expect. Looking at the Green line Garfield stop, you can see that ridership is significantly less so already we learned something we're looking at two weeks is all well and good for sanity checking. But if we want to get a deeper understanding of what's actually going on, we might actually want to visualize some of this data. So for this. Well, just quickly use the seaborn plotting libraries to have a quick look at the Green line rides at the Garfield station and okay, no one behold looks all right, but what's that Interesting? There's a huge spike and ridership in 2013. Now, is that a problem with the data or did something actually happen? Well, you could certainly go do some more web searching and find out. But first let's have a look at the red line ridership during that period interesting. So it's higher as we saw before. But then lo and behold during that same time period, the red line ridership dropped to zero. So one might guess from looking at these two plots that maybe the red line station shut down during that time period. And you can actually do some homework and figure out that that is in fact what happened during that period and in fact here's some documentation showing exactly what happened and it is the Red line South reconstruction project. Okay, so if you remember way back at the beginning we were looking at the Madison Wabash station and there were a whole bunch of zero values in that data. So temporal analysis can again, to help us figure out what's going on there. And you can see actually that there's a period of time when this ridership essentially dropped to zero. Well, how do we know whether this is a glitch in the dataset or a real event? Well, some simple web searching can help you figure out the answer to that. And in fact this Wikipedia page essentially tells you that the station closed in March of 2015 and there you have it it's in the data as well. So hopefully from this lesson you've got a good sense of why it's so important to look at your data before you drop it in a model first and how tools like pandas and Seaborn can provide very useful tools for helping you look at and understand the properties of the data that you're dealing with.