In the previous videos, we've talked about this concept of p-values. When we talked about Chi-squared tests and t-tests and ANOVA, I said in a variety of cases that when the p-value is less than 0.05, you can reject the null hypothesis. To understand what this means, it's important to think about what the tests are doing. This language comes from a paradigm called null hypothesis significance testing. In some of the previous videos, when we talked about Chi-square tests, t-tests, and ANOVA, I said in a variety of cases, if the p-value is less than 0.05, you might reject the null hypothesis of no difference. What does this mean? If we think about what the tests are doing, they're testing for the presence of discernible differences. When we talk about rejecting the null hypothesis, we're talking about a way of thinking called null hypothesis significance testing. It's a paradigm for statistical judgment. But understanding what that p-value means is a little complex. A p-value is a probability somewhere between zero and one of getting a test result, a test statistic as large or as extreme, or as different as what you observed if the underlying null hypothesis was true. In other words, how likely or unlikely is it that the difference that we've observed has arisen by chance alone? A p-value of 0.05 implies that we would reject the null, that null hypothesis of no difference between the groups in favor of an alternative hypothesis. When in fact the null hypothesis was actually true when there's no real difference between the groups about one in 20 times or five percent of the time. Understanding this means thinking a little bit about those probability distributions once again. How do p-values come about? For each of our tests, we're comparing our test statistic back to different probability distributions for the Chi-square test it's the Chi-square distribution. For the t-test, it's student's t. For ANOVA, it's the F distribution. What we want to know is for a distribution with a shape that matches both our test and our sample and this will affect the degrees of freedom and therefore the value of our test statistic, what's the cumulative probability of seeing a statistic that is more extreme given that our null hypothesis is true? In other words, what's the probability that we'll see something more extreme than what we saw? This gives an idea of how likely it is that what we saw is by chance alone. You can think about this with the Chi-square distribution by thinking about where a particular x-statistic would divide the distribution. Here there's a thin dotted blue line which represents the x-statistic of seven. The shaded area above, which is under the curve, represents the cumulative probability of an outcome above that x-statistic of seven if in fact, the null hypothesis of no difference was true. In this case, with the three degrees of freedom, our p-value with an x-statistic of seven is about 0.072. When we're working with the t-statistic, then we look at the t-distribution. The dotted line, in this case, represents a t statistic of two and so that vertical line is with a critical value of two and the shaded area above that line is the cumulative probability above that critical value of two which means that about 3.4 percent of the distribution is above and observed t-statistic of two. If our t-stat was two, we would say that our p-value is 0.034. With the F-statistic, again, you might think of an F-statistic of 3.25 with a degrees of freedom in the numerator of three and a degrees of freedom and the denominator of 122, RT or F-statistic rather of 3.25 is that vertical dotted line and the area above is the cumulative probability of more extreme values given our null hypothesis of no difference. It's also easy to understand this if we look at intuition that we get through simulating data, we create data that we understand the distribution of, and we look for where patterns of errors or differences might occur. We can see this false rejection and action and use it as an opportunity to learn a couple more skills in R. So we'll create some simulated data, we'll write some expressions, and we'll encounter a function called replicate, which allows us to iterate over a function many times. We'll look at this two different ways. First, we're going to simulate some samples and visualize the distributions with an indication of means so we can see how the density estimates might suggest whether groups are different or not. But then we're going to simulate lots of samples and look for extreme group differences between them and we're going to do this in a world where we assume that the null hypothesis is true. To do this, we're going to draw random numbers from a distribution. Now, the stats package has a lot of functions related to probability distributions. You may see things like dnorm, or qnorm, or pnorm, or norm. D is for densities, q is for quantiles, p is for high probabilities, but r is for random draws and that looks pretty good to us. That's the thing that we want to work with. We want to take random draws from a normal distribution. There are other distributions available to us so we could try df, or d Chi-square, or dt. We'll use these three if we wanted to draw from the F distribution, the Chi-squared distribution, or from Student's t and we would need to provide arguments for the shape of those distributions as well as indicating how many draws we want to make. If you swap the first letter, you can get some of those other features of the distributions as well. One thing to know is that computers don't do things randomly, and that's a good thing. We don't want computers to do things randomly because we want them to compute things correctly. If we want repeatable random samples, which is what computers generate, we need to set a seed. Sometimes computers get this from the clock or from the Internet we're going to set a seed which will ensure that we're going to get the same results when you try this at home and when I do it here in the studio. A seed is a number that's used to start a random number generation process. If we set the seed and all do the same steps in the same order, we're going to get the same numbers in the end so don't do any extra steps in-between if you want to match what you see here. Let's try to visualize some samples. The first thing we're going to do is we're going to draw one sample with a known mean and dispersion and we're going to visualize it and we'll add an indication of the mean of that sample with a vertical dotted line. Let's jump over to the RStudio session and we'll take a look at how this is done. Here in our studio, I'm going to start by loading in some packages that you've seen before. If you haven't installed them, I've left you some code here that you can uncomment to get them installed. But we're going to use both GG plot and D player later on. Let's set that seed of 1, 2, 3, 4, 5. Now, when we create an example dataframe, we're going to create a dataframe object with a single variable, that's single-variable, we'll call randvar and it is a draw from the normal distribution, 100 observations with a mean of zero and a standard deviation of one. There's sample up in our environment as our data object. Let's look at a histogram of randvar. There's a 100 observations as a histogram. Now let's calculate a sample mean, and we'll do this through piping. But I'm going to use the other pipe operator that you may see in our code, which is from a package called magrittr and it's a percent sign, a greater than sign, and a percent sign. Sample mean is the sample dataframe passed to the summarize function and we get a dataframe of one observation of one variable. We can do the same thing again with 1,000 observations so we'll create Sample 2, another data frame, another histogram which you see has a central tendency more towards zero and another sample mean. We'll also create an example dataframe with five groups of 100 observations. Here, we'll do randvar with 500 observations still with the mean zero and standard deviation one. But I'm going to put in a grouping variable. But we call that draw. This uses a function called rep. Rep means repeat or repeat a pattern. Repeat the pattern 1, 2, 3, 4, 5. 1:5 is a way of saying the vector of integers from 1-5 and repeat each element of that vector 100 times. We'll get 100 ones, then 100 twos, then 100 threes, then 100 fours, then 100 fives. We've drawn all of our samples from the same random normal distribution, we're just saying they're from five different groups. We'll calculate a sample for each of the groups there. If you're uncomfortable with doing it all as one big mass, we could also do it here by specifying each of the random normals for 100 observations and doing the same draws. Now let's plot our simulated samples. Remember from the slides, we want to see a kernel density plot with a visual marker of where the mean is. To do that, we'll go back to GG plot code like what you've seen before, but which includes some additional elements. You can go back and check the code later to see how each of these work. But we're going to plot a vertical line for our mean, a density plot, and we'll put a minimal theme in and some labels. When we do this, we get a curve that looks like that so most of our sample is clumped around here, but it has a central tendency and the mean is a little bit above zero. If we did it with our second sample where we have 1,000 observations, we get a density plot that has a much closer central tendency around zero. The mean is very close to zero. But if we think about the two situations where we took five different draws of 100 observations, you would see that each of those samples has a different distribution and each of them has a slightly different mean, and so even though we know that we created them from the same random normal distribution, they have different means. We can do the same thing for our second group of five samples and we get similar but not identical results where there is some variation in each of our sample of 100 observations around what the mean is. Even though we know for sure that all of them have exactly the same mean and exactly the same standard deviation. This is the first part of understanding where p-values come into play because even though we know that the true population mean is zero, there is some variation from randomness. We can see this in the p-value context of how many times we would see a more extreme t statistic if we try out repeated samples. We're going to do one more set of tests and we're going to examine a difference in sample mean between two samples. We're going to start with an assumption of no difference in mean by design. Both of the samples under comparison are going to have a mean of zero. We're going to draw many pairs of samples, so samples of 30 observations, of 100 observations, and we're going to calculate the difference in between pairs of samples and what we want to know from this simulation is for a given difference in mean, where our difference in mean is truly zero, how many times do we observe a difference that's larger than one in which we know there's no difference? Now, in this case, what we're going to do is assume some arbitrary value for the difference and we'll look at that in the code. But we want to see how many times by pure randomness we get the difference greater than that, just by chance. When we go through this, we're going to change the size of the samples and the dispersion, which is the standard deviation. When we look at the results, we're going to look at a table that has the proportion of samples with a mean difference greater than a specific value. We're going to start with a baseline value we'll increase the sample size, and then we'll increase the dispersion and this will give you a sense of how the probability of seeing those more extreme differences changes as we change elements of our test. This is one-tailed test, we're just going to test for values greater than some threshold. But a two-tailed test does the same thing with more extreme values of an absolute difference. Here's a preview of what things are going to look like. We're going to specify some names for integers to set up our tests so we don't have to write as many numbers each time. We're going to replicate a number of samples for a certain number of times and then create a proportions table for whether the mean differences between the samples for each pair of samples is greater than 0.3 or not. Let's go to the code. In this case, we'll start with this first set of simulations. We'll say that there's a thousand simulations in each of our tests. Within each of our samples, there were 30 observations. The mean of sample 1 is 0, the mean of sample 2 is 0. The standard deviation of sample 1 is 1. The standard deviation of sample 2 is 1 and then what we'll look at is given this function where we say replicate a simulation over the number of sims so 1,000 times, compute the mean of a normal distribution minus the mean of another normal distribution. Here, there's 30 observations per sample with a mean of 0 and a standard deviation of 1, we'll subtract a different normal distribution with a mean of 0 and a standard deviation of 1 with 30 samples. You could think of this like two different groups of 30 observations where we know the true difference of the mean is 0. When we do this, we end up with this vector of mean differences and when we look at what proportion of those mean differences are greater than 0.3, you'll see that about 87 percent of the time they are not, but about 12 or 13 percent of the time they are. If you made this threshold greater, if you made mean difference is greater than 0.4 or 0.5, the number of cases where that was true would decline and so the proportion of cases where that was true would decline. But what happens if we increase the number of observations in each of our samples? We go from 30 to 100. I'll run all this code again. Re-sample our mean differences and we'll recreate our proportions table. Now, the p-value is much smaller. Now, a much smaller proportion of our 1,000 simulations resulted in a mean difference between the samples of greater than 0.3. Now it's about 0.01. The bigger our samples are, the more information we have about the underlying population mean. The population mean difference is much closer to 0, which we know it is. Finally, if we increase the standard deviation, if we increase the dispersion, Here, we'll increase it to 2 and we change those values and we rerun this code. You'll see that now that p-value jumps back up again. Because the greater dispersion means that there's potentially a greater difference between the means just by random chance in pairs of samples of 100 observations each. You can play around with this code, changing the number of simulations, the number of observations, the mean of each group, and the standard deviation, and also that threshold of difference to get an intuition behind how p-values work. P-values are about what proportion of observations are more extreme, or might be more extreme than a threshold that we've computed given the data that we've observed. P-values effectively tell us how unlikely it is that some difference in our data occurs when there's no real difference that it occurs by chance. The common reject if the p-value is less than 0.05 advice is it sometimes misuse standard. That initial 0.05 threshold was chosen as a general rule. There wasn't an initially strong mathematical justification. It was one mistake in 20 seems okay but further work on significance testing and power analysis to codify the 0.05 threshold. Understand that that p-value is telling you how likely it is that you would have seen that difference even when the null hypothesis was true. When you think about your analysis, think of both the p-value and the magnitude of the difference together. It'll give you an idea of how much evidence there is in your data for a difference between groups or among groups, or a difference in independence across factors. No single test alone should be considered conclusive evidence. Even in carefully designed experiments, randomness can sometimes happen and that's how science works. We accumulate knowledge over multiple tests, over multiple samples, over multiple experiments.