Hi everybody. Let's talk about an analysis task you might encounter. Maybe you're working with categorical data. You have a categorical outcome of interest, somebody's choice of job, what kind of school they send their children to, or their use of a public service. You have some information about them that's categorical in nature, whether they received a policy intervention or not, their identity group, or what neighborhood they live in. In this case, you might want to examine whether there's a relationship between the two. Did students who had a job training program go into the trades more often? Did issuing vouchers to residents increase the number of parents sending their children to charter schools? When we have these kinds of categorical associations, we want to perform analysis that matches these kinds of categorical data. A gateway to doing this is creating tables. We can examine frequencies of occurrences in our data by creating a contingency table. This is sometimes also known as a cross tab. Cross tabs have a feature where rows contain the levels of one variable and columns contain the levels of another. The cells within this table contain counts. How many observations within our data exist for each combination of the two variables. To do this in R, we can create basic tables and then pass these to another function to create proportional tables, so we can see what proportions of each row or each column are according to each category. We'll look at this in a moment. Let's take a look at how to do this in R. We'll switch over to the R session now. Here's the script for this session. We're going to start by importing some data. The data in this session come from the current population survey in month of May of 1985. This basic dataset has some information on employment characteristics of a group of individuals. We see things like wage, education, experience, and age. Those are continuous variables. We'll set those aside for now. But we also see categorical variables like ethnicity, the region where someone lived, their gender, their occupation, the sector of employment, whether they're in a union or not, and whether they're married or not. Those categorical variables we can use to create these contingency tables. The way we create contingency tables in R is with the table command. When we give R a table command, we need to provide it with one or more than one variables across which the table is created. In this case, in our first table, we're going to create it using the occupation variable and the sector variable. I'll identify the data frame from which the variable comes, and I'll call the variable by name. In this case, chi-square df occupation, chi-square df sector. When we execute this code in our console, we now see our contingency table. With here, we see the results of our table command. Our contingency table here shows us in the cells the number of occurrences in our data for each combination of the two factors. For instance, there were two individuals who work office jobs who are in the construction industry. There were 68 workers in the manufacturing industry. There were 81 people who work in services in all other industries. We can do this for any pair of variables. If we did occupation and union, where union is a binary variable, we would see the proportion of people or the occurrence of people in each of these categories for each of our industries. There is in fact a simpler way to do this if you don't want to write the Data Frame name each time and that's to use the with command. With is a helper which evaluates a function inside a specific data environment, so you could write with chi-square table and then just the variable names. These results will look exactly the same as the ones you just saw. If you want to try a proportion table, you can check the help file for prop table. I've left that code for you on line 28. If you read the help file for prop table, you'll see that prop table takes a table as input so an object that we've created that is a table as input to the leader calculation. With prop table, we pass R a table object. In this case, we'll reuse the code we wrote with with, where we specify occupation and sector, and we specify which margin we're going to take proportions across. In this case, these are row proportions. When we use prop table, prop table calculates row totals and row proportions for each of the levels of our row variable. Occupation was in the rows, sector was in the columns. If you look, 10 percent or about 11 percent of our respondents in the management occupation were in manufacturing. About 89 percent were in other occupations, other sectors rather. We can do the same thing across columns by switching to the second margin of our table. In this case, the proportions are calculated over columns. Let's switch back over to the slides and we'll talk a little bit about Chi-square tests. The Chi-square test of association is a test where we examined whether proportions of a categorical outcome are associated with a particular categorical factor. We compute a test statistic. This is where the test gets its name. This is the Chi-squared. It compares our observed frequencies, the frequencies we've seen in our data from the tables we just created with expected frequencies as if the rows and columns were truly unassociated. Our baseline expectation is that our outcome is totally unassociated with our factor of interests. This is our null hypothesis. If the difference between observed frequencies and expected frequencies is big enough, we might reject this null hypothesis in favor of an alternative that there is some association between the columns and the rows. Let's look at these two examples tables. On the left we have data that we observed in our data set. Let's assume we have a situation where there were some families that received no voucher. Within those families, 155 of them sent their children to public schools and 30 sent their children to charter schools. But among the recipients of some voucher, 45 sent their students to public schools and 19 sent their students to charter schools. If we calculated an expected probability or an expected frequency when those two factors were totally independent, we might see a table like the one on the right, where about 148.6 families sent their children to public schools when they receive no voucher, and 36.4 sent their children to charter schools. Meanwhile, within voucher families, 51.4 sent their children to public schools and 12.6 to charter schools. If you want to visualize the difference, imagine on the left we have the data that we've observed. Notice how the regions within the no voucher and voucher groups are different heights. But in the expected data, the proportions are identical. This is our assumption of independence. Chisq.test is the function we use in R for performing Chi-square tests. It's a function within the stats package and a lot of the things we're going to look at in these videos come from the stats package. The function name is Chi, chisq.test. It performs a variety of different tests. We specifically want the types of tests that are contingency table tests, tests of independence. To do this, we can either give the function two different vectors of the same length. These may be variables from our data set. So from a single data set, we specify two vectors. Or we can give a prepared contingency table, a matrix of values of observed frequencies. We're going to try both in the code. Let's switch over to the R session and take a look. Here in our code we'll look at the example from the slides. I'm going to show you how to set up that data frame directly. Here we have our public school observations and our charter school observations, and we'll specify the row name so it's easy to see our table. I'm going to make sure I rename my column names so that they look like public schools and charter school. Let's take a quick look at our data frame. Add an align to observe the data frame, and there's the observed data from before. When we perform the Chi-squared test, all we have to do is wrap Chisquared.test around the name of our observed frequencies matrix or table. In this case, the results that we get breakdown a couple of pieces of information about the test we just performed. When we look at the results from our function, we see a variety of different pieces of information. R tells us what the data source was, the information that went into the function. It tells us x squared, which is our test statistic. It tells us the degrees of freedom called df, and it tells us the probability value, p-value. This p-value is generally what you're looking at when you perform one of these tests. In general, if the p-value is less than 0.05, we would reject the null hypothesis. Remember that our null hypothesis is that there's no association between the row factor and the column factor. We'll discuss p-values in a future video. Let's switch back over to the R console, and we'll take a look at what we can do next. We can also pass our chi-squared test to an object name. In this case, let's call it mytest. Now when you do this, you can also look at a summary of your test object. It tells you all of the different elements within your chi-square test. When you look at the structure of your test, you'll notice that it has a number of elements to it. These are elements that you can use in other analyses or in a markdown document. Because if we look at the structure of mytest, you'll notice that it's a list. If we look over in the R session, you'll see that the structure of mytest is it has a bunch of named elements, like a data-frame. But that structure of data where we have named elements. If we look at mytest observed, that's the table for the frequencies of row and column variables. If we look at mytest expected, there's the numbers we saw on the slides a few minutes ago. You can pass these objects to other parts of your R Markdown document or your analysis. Let's do one more together. In this case, rather than passing a contingency table to chi-square dot test, we're going to pass two vectors, and they'll come from the same data-frames, so we know that they're same length. We'll use data from that CPS data-frame, the chi-square underscore df data frame. We'll use the occupation variable and the union variable to see if there's independence between those two factors. Here's our chi-square dot test command again. But what goes inside the parentheses? Well, in this case, we want to pass it two vectors. We could say chi-square dot underscore df occupation. RStudio will hopefully auto-complete the first one if we press Tab. I'll put my other variable in the second line, so it's a little bit easier to see. We have two vectors rather than a matrix. When we perform the test, we get our chi-square test of independence again. Here, the data looks a little bit different, but if we look at our p-value, it's very small. 1.091 times 10 to the negative 5, well below that 0.05 level. If we wanted to, we could look at a summary of this table and see its structure as well, and see our observed and our expected frequencies. Our last option, if you want a different way of going about creating tables and getting chi-square tests, is to use a different function from the stats package called xtabs, which is short for crosstabs. Crosstabs or xtabs creates contingency tables with formula syntax. Formula syntax in R is something we'll see in the future. It uses a tilde day to indicate that there's a left-hand side and a right-hand side to a particular function. In this case, with xtabs, we would say that the factors which are our cross classifying factors, the things which are rows and columns, go on the right-hand side. Let's switch over to the R console and take a look at how that works. Here I have xtabs, and my formula is occupation plus union. Those are my two factors. Occupation is one factor, union is the second factor. But now I have to tell xtabs where the data are coming from, the dataset. In this case, the dataset is chi-square underscore df. When we do this, we get a crosstab, just like we did before when we used table. But one of the advantages to using xtabs is we can wrap it in a summary, and when we summarize that xtab, we get our chi-square test of independence for all factors. One nice thing about xtabs is we can wrap it in a summary. The same function as before, but when we summarize that xtabs, we get a chi-squared test of independence for all of our factors. The independence of all the factors in our contingency table. It's a nice little gateway to the fact that xtabs will let us do chi-square tests for more than two factors. You can also take your xtabs and assign it a name. You could say mytable and summarize your table. But note that the structure of an xtabs object is a little different. You don't get those expected and observed values that you did from a table and a chi-square dot test. We'll wrap it up there. But in a future session, we'll look at other ways of analyzing data that are continuous in nature.