In this video, we'll explore sampling error. Our learning objectives are to describe the concept of sampling error and to create a vector of normally distributed random numbers. When it comes to a population where we take a sample, we have to ask ourselves, are we really interested in the sample or are we interested in the population? Well, but it comes down to what we're really interested in, is the population from which the sample comes? We take a sample to make statements about the population of interest. So there's a difference between a population parameter and a sample statistic. A sample is a subgroup or portion of the population that we choose for evaluation or study, and the population is a collection of all items produced or considered. Of course, given a sample, we have sample statistics and given a population we have the corresponding population parameter as demonstrated on this table. Let's say we have a population with a mean of 100 and a standard deviation of 10, and we take a sample from this population of size n equal to 30. Some of the variations we might end up with include the depictions on the right with the various histograms listed here, and of course, there could be many more. Why don't we get the exact same answer as a population every time? Well, it's due to something called sampling error. We're going to do a simulation in R and know that we'll be simulating a normal distribution in our Rstudio using the rnorm function. However, you can do the same type of simulation with any distribution that's available in R, including the exponential distribution, Poisson, binomial distribution and so on. In R, we're going to first create four random samples of size n equals 30 where the population mean equals 100 and the population standard deviation equals 10. I'm going to assign this random normal distribution different values D1, D2, D3, and D4. You can see here that we have designated the sample size, the mean, and the standard deviation. First, I'll create D1, D2, D3, and D4. Then I'm going to combine all of these variables into one data frame and I'm going to call it norm data. You'll see that I've created data frame and I'm combining four different columns of data. When I view this data, here are my four different columns with variable names D1, D2, D3, and, D4. Now, I'd like to take a look at the results of the summary statistics or descriptive statistics of this data. I'm going to put into summary.all.variables, the data that I'm interested in along with the designation that I'd like to see the standard deviation in addition to the variance. I click "Run" and you'll see that I've created a table of the mean, variance, standard deviation, our values for skewness and the test for skewness, the value for Kurtosis and the test for Kurtosis. Now, I would like to make this output a little bit easier to read so I'm going to create a function where I remove the quotes and I create a table that is vertically distributed or in other words, nqtr. Then I'll run it for this summary, all variables data, and I'd like to just have three points after the decimal place. It's a little bit cleaner and easier to read this way. Next, I'd like to create some histograms of each variable and put them into one plot. I'm going to set the parameters for the graphical output and create a matrix of the number of rows by the number of columns. I want to create a two by two matrix in which I can put four histograms. To do so, I use the par function and or use the mfrow, matrix format row, and do a two by two matrix. Once I've done so, I can create a histogram for the norm data variable D1, variable D2, variable D3, and variable D4. You can see I have an overall population and I've sampled, I've sized an equal society from that population and I've created different histograms. You'll notice that they're not all the same and this is due to sampling error. Now, before I leave these histograms, I'd like to set my parameters back to one graph at a time. To do so, I'll use this function par of mfrow and create a one by one matrix. You've seen with sampling error that repeated samples may not be identical and the descriptive statistics that are calculated from this repeated sampling, of course, with replacement are not going to be exactly the same, even though the population itself is unchanged. This is expected because we're not measuring all of the subjects or units for the entire population. We're just taking a sample. Fortunately, statistical methods allow us to account for sampling error, and make appropriate decisions. In spite of the presence of sampling error, when we have random sampling, it allows us to use sample statistics as point estimators of population parameters. However, we should note that even when we have a sample statistic that is unbiased, it will probably not exactly equal their associated true population parameters. An observed difference between a true parameter value and its associated sample descriptive statistic is caused by sampling error, as we just demonstrated with our simulation in R. If we define sampling error, it is the expected and quantifiable discrepancy between a population parameter and its associated descriptive statistic due to the sample size employed, and in some cases of different descriptive statistics, the variability of the population. But as mentioned before, fortunately, sampling error is quantifiable when we use random sampling distributions. These distributions, like all probability distributions, are based on the principles of classical probability.