This video is on observational studies. We're first going to attempt to understand the difference between a randomized trials and observational studies and then we'll look at how matching, which is one method to control for confounding, attempts to make an observational study like a randomized trial. And then we're also going to discuss some of the advantages of matching. So it's more of a big picture kind of set-up. We're going to consider – we can just consider a simple DAG like this, which so the classic confounding setting where you have a set of variables X and that affect both A and Y, and therefore controlling for X would be sufficient to control for confounding. But just really you can think of X as a set of variables that are sufficient to control for confounding, and this could be based on the disjunctive cause criterion or the backdoor path criterion, so it can come from a much more complicated DAG than this one – here, we're just simplifying. So just imagine, in general, X is a set of variables that are sufficient to control for confounding. So you've already selected those; you've already selected the variables you want to control for. And so our key assumption here is the ignorability assumption, and this is just a reminder that what this is saying is that a potential outcomes are independent of treatment assignment conditional on these covariates. So essentially, we're imagining that treatment is effectively randomized given the set of confounders. So, how do we deal with that in a randomized trial? Right, so we have this problem where in the real world that treatment assignment is partially determined by some variables X and these variables X also affect the outcome. Well, in a randomized trial, the way that's dealt with is by actually randomizing treatment assignment. So what that effectively does is that erases an arrow from X to A. So now if you randomized treatment, right, so now X is not affecting A because you've randomized, so now there are no backdoor paths from A to Y. So that's what randomization tries to do is basically erase that arrow from X to A by manipulating A directly via randomization. So we can think of it as a coin flip for example. So what else happens with the randomized trial? Well what this means is that the distribution of X will end up being the same in both treatment groups. So we have some population of interest which I'll depict with a circle and then from that population, we randomly select people to get treatment A=0, which is the grayish color circle, and then we also randomly select people to get treatment, A=1, that's the bluish color circle. But because we've randomized the distribution of these covariates, these Xs will be the same in the two treatment groups as it was in the original distribution. All right? There's no reason that the distribution of X would be different in the control group and in the treatment group because we just randomly selected these subjects from the big population. So the distribution of X should be the same in the two treatment groups. And so this is one property of a randomized trial, and this is a kind of property that we would like to make true essentially in an observational study. So in a randomized trial, the distribution of pre-treatment variables – and just I wanted to emphasize pre-treatment variables because confounders are things that should be occurring before the treatment decision – so that distribution of X should be the same in both treatment groups. So we should have covariate balance in that sense. So this is basically what we mean by balance, is that the distribution of these covariates should be the same in the two treatment groups. So they should be balanced. So if the outcome ends up differing in these two treatment groups, it won't be because the distribution of X differed between treatment groups. We've already made sure that that's the same. In a randomized trial, we did that by randomizing, and so therefore we've really isolated the treatment effect. So X, these covariates is basically dealt with it in the design phase. All right, so by design, through randomization, we're dealing with these confounders X, getting rid of them through randomization. So, then you might be wondering 'Why not always randomize then?' because that seems like a straightforward way to get rid of confounding. Well one issue is that randomized trials are expensive. For a number of reasons, they're much more expensive typically than observational studies but you have to have it reviewed for ethics; you also have to enroll people in the study; you have to follow them over some period of time. There's lots of protocols, lots of people involved, so it cost a lot of money. And sometimes, randomizing treatment or exposure might be unethical. You know, a classic example might be smoking. We want to know if smoking causes lung cancer, but it would be considered unethical to randomize people to smoke. And also, a lot of people will probably refuse to participate in a trial. So we might ask people, 'Do you want to be randomized to receive an active treatment or placebo?' And some people won't really want to be experimented on in that way, so some – or they just might not want to be bothered because if you're in a trial, there's going to be follow up, you're going to be contacted, maybe people don't want to bother. So, the population that will end up being able to make inference about, it's kind of shrinking. Right? It's shrinking in the sense that people who would refuse to participate in the trial, we can't make inference about, so it's a less general population and they potentially take a lot of time. So you randomize people and then you have to wait for outcomes. So maybe you're interested in the survival rate five years from now or you're – whatever the outcome is, typically some period of time will have to pass before you can see it, and so we have to wait. And in some cases you might have to wait a long time. And it could even be the case that by the time you get your outcome data back, the question might no longer be relevant. Maybe you were interested in comparing drug A to drug B. You have to wait seven years for outcomes, and maybe within that seven years a new drug is available that's even better. So there's many reasons why you might want to do an observational study, sort of the counterarguments to the previous slide. So you can generally have more, a broader population and you can get results much faster and so on. So we'll think about two different types of observational studies. So the sort of more of a classical kind of observational study, which is really a planned, prospective observational study where the research investigators are doing active data collection. So in this case, you would actually involve, typically involve people in the observational study and maybe meet with them periodically every six months or every year to collect data from them. So like trials, you would collect data on a common set of variables at planned times, you would measure outcomes carefully using set protocols. But unlike trials, regulations would be quite a bit weaker because you're not typically actively intervening, so you're not manipulating treatment, you're not randomizing people the treatment, you just sort of observing what happens in reality. And a broader population would typically be eligible for this kind of study. If you're randomizing people the treatment, there's a lot of rules about who you would have to exclude from a treatment, who you would have to exclude from the study, whereas if you're just observing people, it's a much larger group that is eligible. So that's one type of study, and those are actually have some of the drawbacks of randomized trials in the sense that they're still expensive and they can be slow to carry out because you have to wait for outcomes. But they have some advantages over randomized trials – because you're not randomizing, you don't have to deal with the ethical issues and things like that. So another type of observational study, which is becoming more common, has to do with databases, just databases that exist for other purposes that we can use for research. So these involve retrospective data where there was passive data collection. So passive, meaning the research investigators aren't actively collecting data; data are already being collected for some other reason. So one example of that would be electronic medical records. So electronic medical records now are very common. So if you go see your clinician, they record information about you, there's a good chance they're doing that electronically. But the purpose of that is not for research, it's for the clinician to have as much information as you as possible to make treatment decisions. However, those kinds of databases then can be used for research. Another example would be administrative data like insurance claims, there's registries like cancer registries. So advantages of these kinds of data are that you typically have large sample sizes. Electronic medical records, for example, you might have data from everybody in some health system. They can be quite inexpensive because the data already exists, so it's a matter of pulling the data. And there's a potential for very rapid analysis – because the data already exists, we might already have information on treatments, we might already have information on outcomes, it's a matter of pulling the data, analyzing it, so it should be much faster. However, the data are typically lower quality, there's no uniform standard on data collection. So for example, if you have laboratory measurements, the labs might, there might be actually different labs for different people in the study with different procedures and protocols and so on. Also, and even things like taking blood pressure. There might be variability in how somebody actually goes about getting blood pressure measurements, so that could vary within clinic, it could vary across clinics, and so on. And another aspect of this is that research investigators don't have control over when data are collected. So there might be some people, when it comes to electronic medical records for example, some people might have very frequent encounters with the health system. For example, somebody who's sick might show up, might see their physician very often whereas somebody who's in good shape and healthy might not show up very often at all. So some people will have a lot of data on it and some people not as much. So these are two sort of general types of observational studies that we could consider. But of course we're not randomizing treatment, so that's the big challenge with these kinds of data. So with an observational study, typically, the distribution of these confounders, these variables that we are concerned about, will differ between the treatment groups. As an example, imagine that older people are more likely to get treated and younger people are more likely to have, you know, they have the control treatment or no treatment. So I have a hypothetical graph below where you'll see this is what it might look like where you'll notice the treated, the distribution for the treated people which is this blue curve, in this case, there's more data to the right, so the peak tends to be maybe around mid to upper 60s, whereas the graph on the left which is the untreated or the control population, they tend to be younger where the peak of that curve is maybe in the mid 50s. But you also see there's a lot of overlap, so there are older people who are treated and there are younger people who aren't treated. But in general, older people tend to get treatment, A=1. So again, this is unlike a randomized trial where we would expect balance, we would expect these distributions essentially to look the same between the treated and untreated. So now we'll start to think about matching. So matching – with matching, we're going to attempt to make the observational study more like a randomized trial. So the main idea is that we'll match individuals in the treatment group, so A=1, we'll match them to people in the control group, A=0, but we'll match them on covariates X. So we'll have a treated – for each treated person, we'll try to find a control person who has similar or the same values of X. So imagine in our example where older people are more likely to get treated, A=1. So then, in an observational study, there are actually – at younger ages, there are more people with no treatment or control treatment, A=0; at older ages, there are more people with A=1. In a randomized trial, on the other hand, for any age, there should be about the same number of treated and untreated people. By matching, we'll try to accomplish the same thing, where at any particular age, there should be about the same number of treated and controls. So what are some advantages of matching? Well, here we're basically doing the hard work of dealing with confounding at what you could think of as roughly the design phase. And what I mean by that is without looking at the outcome, I'll call that the design phase. So imagine that we have the outcome separately and we're not going to deal with it at all. What we'll do then is, using matching, we can control for confounding without using the outcome, just blinded to the outcome, so the hard work has to do with finding good matches. So we're going to match treated people to control people on these covariates X, and we're going to do that blinded to the outcome. So this is similar to randomizing in a trial where we're controlling for X without looking at the outcome. In a randomized trial, the outcome doesn't yet exist. In this case, imagine the outcome exists but we're going to be blinded to it, we're going to ignore it. Another advantage of matching over other methods for confounder control is it can reveal a lack of overlap in a covariate distribution. So, there could be some treated people who are unlike any of the control people, so that would suggest that those treated people essentially had no chance of getting the control treatment. That's very useful because we really would like to exclude them from the study and we don't want to make inference about people who had no chance of getting the other treatment because that would involve extrapolation. We couldn't have any information about the effect of treatment on people who had no chance for the other treatment. And if you remember the positivity assumption, the positivity assumption has to do with for any set of covariates, there should be a nonzero probability of getting either treatment. So revealing this lack of overlap could help you identify a positivity assumption violation, but then you can exclude those people from the study and have the positivity assumption met. If you don't do matching, you might not notice this kind of lack of overlap. So now imagine we've already matched. Once the data are matched, we basically then can treat it like a randomized trial. So we should have balance in the covariate distribution between treated and controls. Now we can do an outcome analysis that could be relatively simple, just like in a randomized trial where because you've randomize, you can typically do very simple, you know, for example comparing the mean of the outcome between the two groups. We should be able to do something very similar here. So it feels a lot like a randomized trial.