So we've looked at the histogram as a way of visually exploring our data. If you wanted to summarize the data numerically, let's talk about first the measures of central tendency that we can use to explain the data. So we often look at the mean or the average of the data. Another metric that you might use is the median, or what is the middle value? That is if you rank order your data from highest to lowest, what observation falls in the middle of that data? Such that 50% of observations are greater than that, 50% of observations are going to be less than or equal to that. So we'll look at both of those commands, both mathematically and within Excel. There are also measures of dispersions that we can use to measure the data, the most common being looking at the standard deviation with a variance that's observed in the data. That's giving us a measure of dispersion from the mean so from the average value. You might also look at the range or the interquartile range to get a sense for how much variation is there in your data. So these are the formulas that we can both the mathematical expression for calculating the average as well as how we can derive that using Excel the command equals the average and then you would enter the cell range for where your data resides. What Excel's going to do, it's going to add up all those values in that range, divide by the number of observations to come up with the average. The median, if you wanted to calculate that equals median and then the highlighted data range. One thing to note and we'll take a look at this in example to customer value in a second is that the median is not going to be sensitive to the extreme values. So if you've got a lot of variation in the value of your customers, some are really high but most customers are really low. When you calculate the average, that's going to be highly sensitive to those extreme observations. The medians is not going to be sensitive because it doesn't care with the exact values are. It's just cares about finding the point where 50% are above 50% are below that value. It doesn't matter how far above or how far below, all that it is concerned with is the rank order. All right, so here is an example where we've done some work looking at the customer base of new customers signing up for telecom service. And when we calculate their lifetime value, technique that we'll look at when we examine customer analytics with regards to studying retention and driving customer value. Really the story that we get depends on the metrics that we're using. If we look at the average customer lifetime value based on this histogram, turns out that it's around $1,200. But if we look at the median customer lifetime value, that is the point at which half my customers are going to be worth less than this amount. Half my customers are going to be worth more than this amount, turns out that it's only $800. That difference between the median and the mean indicative of having a skewed relationship. In this case the data is going to be right skewed. If you look at our distribution, as we go from left to right, from low value to high value we see fewer and fewer observation. The data seems to fall off so that's indicative of that right skewed relationship. So which customers would we want to target? Well, if there were a way for us to identify the customers who have that high lifetime value, it'd be great for us to focus our efforts on these customers. But we need some way of identifying those individuals. Maybe it's on the basis of demographic variables, maybe it happens to be the region that they're in, maybe it's the method of acquisition. So if we have additional variables that tell us something about the individuals who have that high lifetime value. Then we can make an inform decision of do we want to focus on those individuals? Where they're very valuable but they aren't that many of them or do we want to focus on customers who may not be as valuable to us. But maybe it's easier for us to acquire them and they're a lot more of those customers. That's what we're going to look at when we talk about the customer analytics. How hard is it to acquire these different customer groups. What's the difference in value going to be? And what are the levers that we can try to pull to try to attract them? All right, looking at dispersion, again common measure's going to be the variance. So I've given you here the mathematical derivation for the variance, as well as the excel notation equals var and then highlighting the data_range. And you'll see that In the derivation for the variance, what we're really focused on is the key here is this piece. It's the observation minus the average. The reason that it's squared is we really don't care how whether it's a positive or a negative deviation from the average or from the mean. So what we're concerned about is the magnitude of that deviation. So the more dispersion there is around the mean, the higher the variance is going to be. Variance can also be reported in terms of a standard deviation, so stdev which is really just the square root of the variance. Other formulas that you might use to understand how much dispersion is in your data, we can calculate the range. So what's the largest value minus smallest value? So here, we can use the max and min commands to calculate just the size of that range. You might also look at what referred to as the IQR or the interquartile range. So this is the range that contains the middle 50% of the data. There's not a built in command to say give me the IQR, so what we can do is use the percentile command. And so what we're doing in this equation say, let's extract from the data what value is at the 75th percentile and we'll subtract from that the values that's at the 25th percentile, that difference giving us the interquartile range. So these are methods for summarizing data both in terms of the average median, the central tendency and in terms of dispersion. One thing that we always have to contend with are outliers in our data, those extreme observations that can heavily influence our data. In some cases, those outliers might be caused by coding errors. Sometimes in data you'll see that the code 9999 is used to refer to data that's missing. Well, we actually have to remove that before we conduct our analysis, because otherwise it's going to skew things. So when we detect those outliers, it may be a reason to dig in further and understand what's going on. It is possible that they're just mistakes, but any assumptions that you're making about how you deal with those outliers, those are conclusions that should be noted in any report that you're producing. All right, another aspect of quantitative data might be time series or a temporal dimension. So these are time series plots that were produced using television tuning data. So data collected from set-top boxes, for two television programs in the, I believe, early 2000s. When we're collecting this data, America's Next Top Model and Pussycat Dolls Present Girlicious, the dips in the audience level or the dips in the ratings caused by commercial breaks. So the audience staying tuned to the program, commercial break comes on, people tune away, come back when the program resumes. Well that's a characteristic of this type of data. Audience levels decline during commercial breaks. So in any type of analysis, that's something that we're going to need to control for. Another popular time series data is stock performance data. So we can see, do stocks or do indices tend to move together or are they tending to move apart? We'll get into looking at time series data and building forecasting models in a little bit, but just to foreshadow the direction we're going in, one component that we're going to look for is the trend in the data. So for a new product that's launched, as time goes on sales may ramp up. Something else that we might also look at is a cyclical pattern. Is it related to seasonality? Is there pattern in here related to the general state of the economy? And these are factors that we need to know about if they're influencing in this case sales. Because if we're going to be making forecasts going forward, those are variables that have to be included in the predictions that we're making. All right, we've looked already at how do we examine the relationships that may exist among categorical variables. Well, we want to do the same thing when we work out relationships that exist on quantitative variable. A scatter plots is common visual way of doing that so in this particular case comparing the daily returns for the S&P 500 with the Verizon stock. So on the x axis, it's the daily return for the S&P, on the y axis, the daily return for Verizon. And you'll see from the trend line that's been layered on here is that there is generally a positive relationship on days when the S&P went up, Verizon also tended to go up. So we can look at the direction of the relationship, in this case we fit a linear trend to that. May not always be a linear, that's one thing that we're going to keep in mind. We're also going to look at how much dispersion there is. There seems to be a lot of dispersion think of around this line. So from this point all the way down to this point. A lot of dispersion that we're seeing in that scatterplot. Means that if I wanted to know what the Verizon price was going to be on a given day, well I can't just use the S&P because it seems like that's not explaining everything. When we get into talking about linear regression, that's one of the pieces that is going to be key for us is trying to reduce the amount of uncertainty that we have. The more information we have, the better a fit we should see. If we're looking at trying to quantify the relationship between two variables a common technique that we're going to use is to look at the correlation. What the correlation does is assess the linear relationship and that's going to be key because not all relationships are going to be linear. As I'll show you in a second, sometimes that correlation is going to be indicative of the nature of the relationship. Other times, it might not be that informative, so we really have to conduct exploratory analysis to understand the patterns that exist between quantitative variables. Within Excel equals correlm that's going to be the command that we use to calculate correlation, highlight the X range in the data then we highlight the Y range in the data. Those are going to be separated by a comma. And all that we're going to get as output is a number between negative one and one. So negative one indicating a perfect negative linear relationship. Positive one indicating the perfect positive linear relationship. Mathematically and this is something that I hope you never have to calculate by hand, but this is the mathematical derivation behind the correlation coefficient. And the piece that I wanted just draw your attention to just to give you some intuition for what this number really means is going to be this set of terms. So what we're doing is for each observation, we're calculating the deviation from the average X value and we're calculating the deviation from the average Y value and multiplying them together, adding that piece up. Well, what does a positive correlation mean? Positive correlation means that when the X value is above average, which would be given by this first term, the Y value also tends to be above average. Both of them are positive. When is the X value is below the average, it's going to be negative. When the Y value below its average value, that's going to be negative. The product of the two giving us a positive relationship. If we have a negative relationship, when X is above its average value, Y is below its average value and vice versa. So that's where the correlation concept is coming from. All right, so again, we can look at this in terms of the financial performance. So I pulled information, this is data from 2010 on stock performance of some telecom companies. And if we look at Verizon, Comcast and AT&T positively correlated with each other, also positively correlated with the S&P 500. So probably no surprise that these stocks and the S&P 500 generally move in the same direction. All right, so this is intended to give you a sense for the extent of dispersion that you might see depending on the level of correlation. The middle plot that has an r value of 0.9, very high correlation there, very tight cluster, looks like pretty much a straight line. Whereas when we have a lower correlation coefficient, the 0.4 on the right, or the negative 0.7 you see there's more dispersion there. So there still is a linear relationship. If I were to superimpose a trend line we'd capture a negative relationship but the fact that there's a lower correlation coefficient telling us that there is going to be more dispersion there. So these are all examples when the data actually suggests a linear relationship. But as I'd said earlier, it's not always going to be a linear relationship and that's where we need to be careful. So one example of how the correlation can be misleading is that we have outliers in our data. We've got a massive points in this case near the origin, near the lower left and then we've got a single outlier. Well, if we try to draw a best fitting line, because that outlier is going to influence the trajectory of that line, we're actually going to be told that, you know what, there is a strong linear relationship here. But in reality, it's that one outlier that's doing it. If we were to discard that and just look at that mass of points, we would not get nearly as strong a linear relationship. The other piece that we have to be careful about is when we're dealing with non-linearities. So in this case if we're looking at the lower right we have a correlation of zero. Well, that doesn't mean that there's no relationship between X and Y. What it means is that there's not a linear relationship between X and Y. This relationship is a negative quadratic relationship so Y is related to X-squared. Well, it's a very strong quadratic relationship, but it's not a linear relationship. That's why we're getting the correlation of zero. So doing this visual inspection, using scatter plots, using histograms, a very important of the data analysis that we need to do. If we were to just look at the correlation coefficient we see that there's no linear relationship, we might erroneously conclude that there's no relationship at all. So to recap the techniques that we've looked at, we've talked about techniques for examining individual quantitative variables. We've also looked at techniques for examining the relationships among the quantitative variables. Then correlation and scatterplots good for looking at the relationships among these two quantitative variables. We'll come back to examining these relationships more formally when we start getting into prediction and we start using linear regression as one of our go-to tools in that forecasting.