Why do we need descriptive statistics? By the end of this video, you should be able to articulate the purpose of descriptive statistics and explain how statistics can help us distinguish meaningful patterns from anomalies. In addition, you should be able to define exactly what is meant by a variable in a statistical context. Purpose of descriptive statistics. Take a look at the data displayed on this slide. It contains information about Facebook membership for a whole bunch of countries, including the total number of members, number of male members, and number of female members. Even though this data set is fairly small and organized in a fairly reasonable manner, it is still difficult to identify patterns and draw conclusions. It is hard to compare Facebook membership in, say, India to Japan. Now, imagine a much, much bigger data set. Government agencies and private sector organizations assemble data sets that contain billions and sometimes trillions of observations. We can learn very little about these data sets by staring at spreadsheets of raw data. Instead, we need to use statistics and visualizations to summarize the information in raw data sets. At this point, you might be wondering, why can't we simply select interesting observations from an unwieldy data set and examine those cases in more detail? Well, we could do this, but we would not know if we were examining representative or unrepresentative observations. The use of statistics and data visualizations help us distinguish between interesting anecdotes and meaningful patterns. Suppose, for example, that you're interested in examining whether college dropouts generally go on to have successful careers. And it occurs to you that Mark Zuckerberg, the co-founder and CEO of Facebook, might be an interesting case study. Well, indeed, Mr. Zuckerberg has a fascinating career story that involves him dropping out of Harvard. But it's doubtful that Mr. Zuckerberg's experience is representative of the average college dropout's experience. If we were to assume that Mr. Zuckerberg is an average college dropout, our concluding recommendations about whether or not student should, generally speaking, drop out of college would be quite wrong. Likewise, if we were interested in examining the e-commerce industry, it would be inaccurate to assume that Amazon is representative of the typical e-commerce company. If we wanted to understand what makes an e-commerce company successful and what does not, we would need to examine many different companies and learn from a data set that includes substantial variation. So instead of relying on interesting examples that are unlikely to be representative of the population we care about. We are going to use statistics and data visualizations to discover meaningful patterns, patterns that allow us to draw accurate conclusions and generate actionable recommendations. Data sets and variables. Before diving into descriptive statistics, which we'll do in the next video, let's define some key terms. We've used these terms before, but let's spend a minute or two discussing exactly what they mean. A data set is composed of one or more variables. A variable is an empirical measure of a concept we care about studying. In your data set, each variable should have a unique name. It would be extremely confusing if two different variables in a data set shared a name. Each variable should also have two or more values. This is extremely important. A variable, in other words, must have variation. If a variable consisted of only one value, it would be a constant. Yesterday's average temperature, for example, is not a variable. It has a single value and is therefore a constant. Variables, in contrast, take on two or more values. Consider the survey question, do you exercise regularly? Respondents can answer yes or no. We could also imagine a variable that takes on three, four, or even an infinite number of values. Variables assume particular values probabilistically. Consider the variable gender. For any given individual, there is a probability that the individual is a man and a probability that the individual is a woman. And that probability is approximately 50%. Likewise, for any individual, there is a probability that the individual exercises regularly and a probability that the individual does not. We might not know the probability of each value, but that is okay. What is important is that there is some nonzero probability that the variable can assume any one of a set of possible values. Generally, it is useful to assign numeric codes to non-numeric variable values. For example, in the survey question about exercise, we might assign yes responses a value of one and no responses a value of zero. On this slide, there is a table with two observations. Running Ruby reported that she exercises regularly, so she is assigned the value one. Lazy Larry, in contrast, reported that he does not exercise regularly, so he is assigned the value zero. In statistics, variables and the datasets they comprise are what we analyze using a variety of methods and tools.