In this section, we'll move on to the next all important step in our machine learning workflow, namely Exploratory Data Analysis, or EDA. So what are going to be our learning goals for this section? We're going to discuss different approaches to conducting exploratory data analysis, which I'm just going to call EDA from here on. We'll talk about different EDA techniques, both statistical as well as visual techniques. We'll use sampling to take a peek into our data, and we'll show how to produce useful visualizations when we're actually doing our EDA. So what is exploratory data analysis? EDA is going to be the approach to analyzing data sets to summarize their main characteristics, often with visual methods, and as we'll see what's statistical summary is as well. Why is this EDA useful? I want you to think of it as your initial conversation with a data before getting started. It's getting to know you face with your data set. So this will determine if the data that we're looking at actually makes sense, or if we need further cleaning, or if more data is actually needed. EDA will help us identify patterns and trends in the actual data set. Sometimes this can be as important, if not more important than the actual findings from the modeling. So some summary statistics that we do, typically, deal with EDA looking at the average, the median, the min, the max, the correlations between different columns. Or we can look at the visualizations given our columns. We can look at histograms in order to see the distribution, the scatter plots to see the correlations or the relationship between two different columns, and box plots again, to look at that distribution and to identify outliers and many others. When we're doing data wrangling, or if we're looking at those summary statistics, we're generally going to be using the pandas library, and then for visualization, we will rely on the libraries Matplotlib and Seaborn. So an example of EDA for job applicants summary statistics, suppose we want to examine the characteristics of our job applicants. We can look at the average, we can take the average of all interview scores or maybe break it down by city or job function so that we can compare applicants scores to the averages of their direct competitors. We can look at maximum values, or here are more like the mode value to see which words are most common from all the applicants given the application materials. Finally, we can look at correlations. We can look at the different correlations between technical assessments and the years of experience of our applicants, perhaps breaking it down, again, by type of experience. So we're comparing against their direct competitors, and this will allow us to see if there's any type of initial relationship between different columns, and that may be worth being immediately being aware of before doing any modeling. So let's discuss sampling from DataFrames for just a second. Now, there are many reasons to consider random samples from our DataFrames. For larger data sets, we may want to shrink down to a random sample if the computation would take too long for the larger data set. We may want to train models on a random sample of the data and hold out another set for testing later on. We'll see more efficient ways of doing that in the code on the left later on. We may also want a sample that is indicative of the proportions of the different observations of our outcome variables. Now, what do I mean by this? Let's say that you are working with a data set that's supposed to determine whether or not a certain person has a terrible disease and only one percent of the entire population has that disease. If you want a sample, you also want to ensure, assuming there's only one percent that have the disease, you want to ensure that you keep that proportion in your sample data set. So you want it to be a similar proportion, you want to ensure that you don't end up with a sample that doesn't have anybody with the infection, which is possible or with an over representation of that disease. So that's where this is called stratified sampling may come into play. Now, looking to the code on the left, what does that going to do for us? First, we are assuming that data is going to be a pandas DataFrame, and a pandas DataFrame has the method sample, method is another word for a function that's available for a certain Python objects, and that sample method is going to allow us to take a random amount of rows from our actual DataFrame. So here we're saying n equals five and we're also passing the argument replace equals false, that will be the default value. Actually, if replace equaled true, then we can end up having the same row show up more than once. So this ensures that every row only shows up once in our sample, and then we are going to print, we're taking our sample now, which only has five rows, during.iloc, we're taking all of the rows, that's what the colon there means, and then we're only taking the last three columns, negative three through to the end is what negative three colon means, and we get the output we see here on the right with petal length, petal width, and species, which were the last three columns and the five random rows that showed up with our sample.