Let's talk about a different kind of data task you might have. If you have a continuous outcome or variable of interest and you want to compare that across two groups to see if there's a discernible difference, you may need to use a different type of test. Let's say you've observed different sample means, different means within two groups. Does that mean there's a real difference there? This comes up in a variety of contexts in the study of public policy. Before and after an intervention, treatment versus control groups, but it also comes up in examining differences between two groups; mean incomes across two groups, or the mean cost of insurance across two states. Imagine you have data that's distributed normally. This is how we tend to think that a lot of data are drawn. This is a standard normal distribution, it has a mean of 0 and a standard deviation of 1. On the other hand, you may have two distributions that look like this. In one case, we have the standard normal distribution with a mean of 0 and a standard deviation of 1, but we have data drawn from a second distribution which has a mean of 2 and that same standard deviation of 1. The data drawn from each of these distributions would look a little bit different. In reality, when we're looking at observed data, we might think about things like kernel density estimates. Here's a comparison of two groups of data drawn on the same measure across two different groups of observations. Here, the kernel density plots look similar, but the means do look a little bit different. Our question is, are these means significantly different enough for us to believe that these groups are different in some way? To understand this, we can perform a statistical test. The common test for differences in means across two groups are Student's test or Welch's test, these are both t-tests. When we want to know whether the differences in our sample suggest an underlying difference in the populations, we need to know about whether the difference in their sample means given a scaling factor for how much dispersion there is, is big enough. T-tests have a general test statistic form of a difference in sample means divided by a scaling factor to accommodate differences in dispersion. If the resulting difference in means is big enough given the scaling factor, the resulting t-statistic will be far away from 0. If it's far enough from 0 given the size of our underlying samples, then we can reject our baseline expectation, that is, that there's no difference between the groups. This, again, is that null hypothesis that we've talked about. In R, the basic function performing t-test is t.test, it's another function from within the stats package. Now, the t-test function performs a variety of different tests. Today, we're going to focus on two; The two-sample test or the two-group t-test and the paired t-test, which we use for before and after analysis. It's also possible to pass two vectors or two variables from a dataframe to this function, or to pass a formula and a dataframe name, and we'll look at both. Let's talk about how you set up a t-test first for two groups in R. Here, we're going to see the formula syntax. Now, here, the formula has a left-hand side variable and a right-hand side variable. The left-hand side variable is our outcome of interest, so it comes before the tilde. The right-hand side variable is our grouping variable, our factor variable, or other variable that has exactly two levels. When we do this, we also have to tell R what object holds the data and the variables, so this will be a data argument. In this case, you could say data equals df. Let's go to R and take a look at what it looks like. Bear in mind that defaults matter so I'm going to show you a test with the defaults in place, at least for the first time through. Let's switch over to R and take a look. Here we are in our script, this is the script for this section of the course. We're going to load some objects into memory. If you notice, I've shared with you in our data workspace that's going to contain some data objects, so we'll load those in, but we'll also import some other CSV files. We should have five data objects in memory. Data 1 through 3, and paired df and unequal df. Let's quickly look at a histogram of some of these data just to get a sense of how they're distributed. This is a base R histogram, so you've seen histograms in ggplot. In Base R, if you need to do a quick histogram, you could use hist. When we do that, a histogram is created here. Let's look at how we would do a t-test across two groups. If we look at Data 3, you'll see that there's a grouping variable, it has two levels, control and treatment, and a continuous outcome variable, it's numeric in nature. For this, we'll use a formula. Our outcome variable is on the left-hand side and our grouping variable is on the right-hand side. We tell R that the data comes from Data 3, and now we start specifying options for our test. The first is the alternative hypothesis, we're specifying a two-sided test. Here we're saying we don't know whether the mean of the control group is greater than or less than the treatment group, so we test both ways. We also specify a null hypothesis difference with Mu. Here, our null hypothesis is no difference, this is the default. We tell R that these are not paired samples, and we tell R that the variances are not equal. These defaults are safe choices. Unless you have a reason to go otherwise, stick with the defaults. In our results, we're going to see a confidence interval around the difference in means, and we specify that using conf.level. Let's execute this code and see what it looks like. When we perform this test, we get some results down here in the console. Let's talk about what we've seen. When we look at the results in the console, we're going to see a variety of different things returned to us. First, we'll see a data summary. What objects are under test? In this case, it is the Data 3 dataframe and the variable outcome. We'll see t and that is our computed t-statistic which is the ratio of the difference in means over that scaling factor. We'll see df, which is degrees of freedom which relates to our sample size, and we'll see that p-value again. That p-value is our probability value and it tends to be the thing that people look at when they perform these tests. We'll also get an x-percent confidence interval. When we specified 0.95, we asked for a 95 percent confidence interval which is a range of estimates for the difference in the population means given the data that we've seen. We also see the sample means for those two groups. Here's an example from a different set of results and we'll jump back over to the R console to look at the ones from the test we just performed. Here on our results, in our R session we have our wealth two-sample t-test. Our data is outcome by group, our t-statistic is negative 5.555 with 167.51 degrees of freedom, and our p-value is very, very small. Here, it's 1.069 times 10^negative 7. If our null hypothesis was that there was no significant difference, we would reject it in favor of our alternative hypothesis that the true difference in means is not equal to 0 and the 95 percent confidence interval around that difference ranges from about negative 10.5 to about negative 0.5. The 95 percent confidence interval in those differences ranges from about negative 10.5 to about negative 5. Remember, in general if your p-value is less than 0.5, you're going to reject that null hypothesis where our null hypothesis is that there's no underlying difference in the means. You can also take a t-test object and you can assign it a name so you could use the elements of the test for later use, putting it in a document, passing it to another function. Let's try taking the test we just performed and we'll name it myttest, and then we'll examine the structure. You'll see that just like before, it's a list. Let's jump over to our R session. Here's our t-test. Once again we have our t-test, outcome by group, and data equals Data 3. If you notice I've omitted the rest of the defaults because the basics work for me in this case. Now, we have a new object in our data environment called myttest, and if we look at its structure, you'll notice that it's a list. It has the test statistic, the parameters, specifically the degrees of freedom, the p-value, the confidence interval, all the stuff you might need. Now, let's look at the other way of specifying a t-test in R. If you want to set up a t-test for paired samples, it's oftentimes the case that we do this by specifying a t-test with two variables because we'll have our pre and post variables as our observations before and after a treatment in the same dataframe for the same observation. However, it's also possible if you have a single variable with a grouping variable for pre and post treatment as a grouping variable to do this using the formula analysis we just saw. We'll try that on other setup. Here we're going to specify two different variables as vectors. Note that we're going to specify the dataframe not as a data argument, but as part of the variable name and we're going to add a new argument to our t-test, paired equals T. This tells R each element in each of the vectors matches its pair in the other vector. Each observation in the first vector is compared to the same observation in the second, which means that these vectors must be the same length. As we set this up, we'll take a look at how to do this over in our R session. Let's take a look at the paired_df dataframe. If you look, we have two variables; pre and post, and they're both continuous. We might think about setting up a t-test like this. Paired_df$Pre identifies the pre vector. Paired_df$Post identifies the post vector. We'll stick with that default alternative hypothesis of two-sided test, a mean difference of 0, but we've changed that flag so it's paired equals T, we're doing a paired t-test now. When we perform that analysis, you'll see that we get a slightly different set of results. Here, we get a different summary of our data. We have two vectors; paired_df$Pre and paired_df$Post. We still get our t-statistic, our degrees of freedom, and our p-value. Our p-value is very small, much less than 0.05. A couple of final notes about t-tests. The first is that there are going to be arguments to functions that changed the test. Checking the Help file will clarify when these different functions or these different arguments come into play within the function. Paired equals T tells you it's a paired test. Paired equals F, which is the default, is an unpaired test. There's also the option to assume equal variances or not. In general not assuming equal variance is your better choice, but this is what defines a student's t-test versus a Welch's t-test. In general, stick with the defaults here. Finally, you do have options for changing your alternative hypothesis, so the default being two-sided is a good standard but if you have a reason to test for the second group being greater or less, then you would change your alternative hypothesis here. Note that you can have your data organized in either form. You can have your sample as vectors or as variables, or you can have a single outcome variable and a grouping variable. In the former, you will specify two vectors of data. In the latter, you'll use the formula form and you'll include that data argument.