Hi, I'm Senthil Veeraraghaven. It's a great pleasure to welcome you to week three of model and risks and realties. We often look at historical data to model future realities. In such cases, we often use distributions of random variables to model the future. This week we will look at some approaches to understanding how to choose distributions to model the future. How well does a distribution fit our data? Can we use it for our model? The answers to these questions depend on a mixture of theory and practice. In session one of this week, we will think about examining data and its graphical interpretations. Let's take a look. Hi, welcome to Modeling Risk and Realities, this is week three of our course. I'm Senthil Veeraraghavan, I am an associate professor in Operations, Information and Decisions department of The Wharton School. In this week, week three, we're going to look at choosing distributions that fit your data. For the first step, we going to look at data and its visualization, then choosing among the family of distributions. It could be discrete or continuous. And then, how good does a certain distribution fit? We'll do hypothesis testing and goodness of fit analysis in the last session. So session one we'll do visualization. Session two we'll look at distributions. And session three we're going to look at goodness of fit. In session one, we are looking at data and visualization, we'll look at graphical representation of data. Now we will look at how to fit distributions to your data. Fitting distributions is important because many future events in the world are inherently uncertain. You can ask many questions like, at what value will a firm be acquired? Or how many new units of a new smartphone will be sold when it's introduced? How much dividend will a firm pay its shareholders in the next quarter? Or you could ask, will a firm meet its growth target next year? If not, by how much will it fall short? As long as there is uncertainty in the world and that's stationary. What do I mean by stationary? The future looks fundamentally not very different from the past. As long as the world is stationary, we can use the past data to build models of future realities. So how do we model uncertainty? Real world uncertainty can be modeled using random variables. Some real world uncertainties may include profit values from an investment, percentage growth and revenues, customer orders for a new product, or service time. For example, time to process a loan application. A random variable is a numerical description of such outcomes, and typically we use a capital letter, such as capital X or capital Y, to denote a random variable. Random variables can be discrete or continuous. But all random variables, we can characterize by their probability distribution. So what's a discreet probability distribution, what's a continuous probability distribution, we will see soon. How do you model uncertainty? Let's continue. Before we model random variables using distributions, let's ask the following question, how important is the task actually? How important is it to use the right distribution? I'll make the case for choosing the right distribution, by using two example datasets. These are very simple and straightforward datasets, but quite powerful in explaining the idea. You can find these datasets, they're called Dataset1_Template and Dataset2_Template. They're both available on the course website. What we will do in the next video is we will generate histograms and descriptive parameters for these two Excel files and save them as Dataset1_histogram and Dataset2_histogram. One good way to understand and describe uncertainty is to visualize it. Generating a histogram is a good method to visualize data. A histogram is a graphical representation of the distribution of the numerical data. For instance, a histogram helps us to understand the shape and the scale of the distribution and to understand the random variable. As we go forward, we will examine a family of distributions and their parameters. Now let's look at the Dataset1_Template file. On the template file, I have a dataset which has 250 data points that came from some distribution, and let's try and understand how to fit a distribution. This is our first step of visualizing this and we're going to generate a histogram. As a part of the first step, I will do the following. I will calculate the minimum, maximum mean, and standard deviation. And these are easy to calculate. Minimum is going to be the min function for the entire column. And to fix a column, all you have to do is select it and press F4, which introduces a dollar sign in front of A and dollar sign in front of the numbers. And that will fix the column. And we can have min and similarly we can calculate max. Max is going to be max value in this dataset. And the average function, you see the mean, the sample mean from the dataset and the standard deviation is given by STDEV function. So if you use STDEV function, you get the standard deviation and datas, and we can present everything in two numbers here so we have min, max, mean, and standard deviation. However, there is an easier way to calculate mean and standard deviation and other descriptive statistics. We do that by clicking on the Data tab and looking at Data Analysis link. This tool link may not be available right away. And if it's not available, you don't see it. You go to File, go to Options and look at add-ons and you should see it's inactive. And you can make it active by moving it up or adding it in. Anyway, once you add it in, you go to Data tab, you should be able to see Data Analysis link. So let's go look at the Data Analysis link, and click on it. And it gives me a bunch of options. I'm going to pick Descriptive Statistics. And say okay, and I want statistics for the input range right here. And I want it to go up to 251. And input range again, picking up the input range. And I want also to click on the Summary Statistics and the Confidence Interval and the Output Range, I want to run the same file at this location. So I say OK, and you'll see Mean, Standard Deviation, and the Minimum, Maximum are all provided here. And this is a quick and easy way to calculate what I just calculated using formulas. Once we have the statistics, we can start visualizing our dataset. One way to visualize it as I registered is through look at the histogram. And again for the histogram, we just have to do an initial setup. Here's the initial setup. I'm going to divide this data that goes from 0.09, almost 0 to a 100 into bins. And I'm going to choose bins of can and you can pick the size of the bins. You don't want it to be too narrow, and you don't want it to be too large. And you can use trial and error to arrive at the bins you want to make for your visualization. So you have the bins here. So now let's go click on the data and analysis link. And go down and you see this histogram year. Click on the Histogram and say, OK. And the input range is your data. That's your data, that's your input range and your bin range, you just created it. That's your bin range, and you want the output to be presented at some location and I clicked on the import range. Let me reset the input range quickly again. The input range should be those 250 data points. The bin range should be these 10 bins we created. 10 bins and the output range can be Here. And I also want to click on Cumulative Percentage and the Chart Output for a reason. Let's do that and say, OK. So it gives me the frequency histogram and the cumulative chart and you can see the blue column, the blue line is the frequency chart. It gives you how many data points are in each bend between 0 to 10, between 10 to 20 and so on. And the red curve which we will define very soon in the next section is the cumulative distribution from the data, and that tells you how many data points are below 10, below 20, below 30, and so on. So we have the entire data we should test the histogram. And we can see just by inspection that the probability of being in the bin of 20 is very similar to the probability of being in bin of 40 or bin of 80 or bin from 90 to 100 and so on. So that tells you a little bit about how the random variable that is generating this data behaves. Let's save this file has data set one histogram. Using the Excel sheets and histogram analysis, now we can look at histograms from our two data sets. Here is data set 1 with 250 data points. The blue chart gives us the frequency of the observations at low values and high values. The red line gives us the cumulative distribution. What's a cumulative distribution? We will formally define it in the next session. But now, let's look at the blue chart which is the frequency chart. Looking at the frequency chart, you can see the low and high outcomes are more or less comparable. The number of outcomes at 20 is comparable to number of outcomes, let's say at 80. We open the data set to template file and we have history of past outcomes. Again, 250 data points. Slightly different data points and you'll see why in a bit. And again, we first want to derive some of the statistics for the data set and we can do this very quickly as we saw from data set one. Click on Data Tab, go to Data Analysis, make sure you add in if you don't have it. And you can add in very quickly by going to File > Options and looking at Add-ins and looking at whether Analysis tool pack is added in or not. And once you add it in it should appear as data analysis right here under the Data tab. Click on the Data tab, go to Data Analysis. Pick Descriptive statistics, say OK. And you want to pick the input range with this here data. And you want an output range in the same worksheet, for example. I want to make sure you're selecting the correct input range. And you go to Output range and make sure you have an output range close by. And you want to click Summary statistics and maybe Confidence level and so on, and say, OK. And that will give you the summary history statistics. So you have the mean, you have the standard deviation, standard error everything. And we can, if it's appealing to you, reduce the number of decimals for everything to two decimals which makes analyzing numbers a bit easier. So you have standard deviation is 15.7 and mean is 47, minimum is 3 and maximum is 95. Let's visualize how this distribution looks. Again to visualize this distribution you go to Data, go to Data Analysis, and go down all the way to Histogram and you select Histogram. Before we do that we need to define our bins. To define your bins you need to think about how your data is ranging from a minimum to a maximum. You see the minimum is 3, it's pretty close to 0. Maximum is 95 close to 100, so makes it simple I pick bins of 10. So I get 0, 10, all the way up to 100, and once I've selected the bin I can generate a histogram. I go click on Data Analysis, click Histogram, say OK, and the input range is my data set. Very quickly and bin range is the bin I selected. The selection of bin can be done by trial and error. You don't want it too narrow, you don't want it too broad. And the output range, you want the output range to be close by here. And let's say again, we want cumulative percentage and the chart output to becoming out to say, OK and we have the chart output that we make it bigger for we show a lot view. And we can see, it's quite different from the earlier data that you saw. Dataset one, which is pretty flat here it is shaped like a bell curve. We will look at the same figure in our slides and we understand these two datasets look very different. So, the underlying random variable that is generating this data might be behaving differently. So, we need different distributions. Let's do that in the next session, but let's save this file right now as dataset two histogram. Unlike in histogram one, you will see here there's quite a bit of a difference between low and medium and high outcomes. It looks quite different. We have the same 250 data points, but you can see the low occurrences and very high occurrences are much less frequent than the outcomes in the middle. In fact, the histogram is shaped like a bell curve. So fundamentally speaking, we have 250 data points generated from different uncertainties that look quite different. The idea of choosing a distribution is to find a distribution that best explains the dataset. The datasets we saw are different. How can we model them differently? Now recall, the real world uncertainty we can model using random variables, denoted by capital letters X and Y. We can characterize these random variables by their probability distribution. Using the visualizations that we saw in the last two slides, we can easily argue and I hope we can strongly argue that these two datasets have two different distributions. As we saw, random variables can be quite different in their shapes as seen in the datasets. In fact, random variables can be discrete or they can be continuous. Discrete random variables are countable. Think of number of M&M candies in a box. We don't know how much there are, but we can count them. And since we do not know the exact number in a box, that's a random variable. It can change from box to box. Number of shares that were performed that were sold in a day. That's a random variable, that's discrete. Number of investors who attended to the meeting. Number of cars that went by a toll booth in a given hour. All of these are countable. And therefore, discrete random variables. On the other hand, random variables can be continuous. For example, the exact rainfall that falls in a particular region during the monsoon season. The heights of men and women in the region. Speeds of vehicles passing by on a road or the length of the fabric that's yarned from a ton of cotton. All of this can take any value. And therefore, can be continuous random variables. A random variable is typically described by parameters. Some descriptive parameters are as follows. The most common descriptive parameter is the mean. The mean represents average value of a random variable over a number of repetitions. Standard deviation is another parameter that's often used to described a random variable. It measures the deviation, it measures the spread of outcomes. How far that random variable could be from its average? Variance is another measure of a spread. Variance is nothing but square of a standard deviation. Other common descriptors are median and mode. Median, literally gives the midpoint of the distribution. The value of the random variable that separates the upper half from the lower half, that's your median. It's the 50th percentile. What's mode? Mode is the most likely value of random variable. The most likely possibility that can happen, that's mode. All of these descriptors are useful in describe a random variable. In the next session, what we will do is look at families of distributions that are often used to model realities and try to parametrize them. In session one of week three of our course, we examined how to look at data and examine graphical interpretations of data. This gives us a good spring board to go further and explore several families of distributions used to model realities.