Welcome back. We're going to be reviewing some of the jargon and technical terms that we've been talking about in regard to program and policy evaluation, and do this by going through some common research designs that are used for evaluation in the public sector. Let's review some of our key terms. Research design, it's the overall planning for evaluating a program or policy intervention. Internal validity. This is the strength of that research design to measure intervention effects, which we are trying to establish a causal relationship. We hope to have a strong research design that doesn't have any threats to internal validity and it also to do that needs a good counterfactual. External validity is the ability to generalize the findings from a program or policy evaluation beyond the setting of the research and the research design. Now we have talked about this research design quite a bit so far, the one group pre-test, post-test design. This is used in program and policy evaluation a lot, for a number of reasons, but as we've also talked about, this is not a particularly strong research design from the point of view of internal validity. Let's talk about both the strengths but also the weaknesses of the one-group pre and post-test design. Why is this used a lot? Well, first of all, it's pretty easy to implement and it often doesn't require a lot of data. Two, it can be implemented retrospectively, so what happens a lot is that there's a policy change and then someone, usually a stakeholder says, "Hey, we made that policy change, let's look at some data to see if it worked." Then a data analyst is usually asked, "Oh, can you go back and get data on these variables, the year before the policy was implemented?" Then we'll look at the year after and we'll just compare pre and post and we'll see if the program worked. Also, another strength of this design is pretty easy to communicate about the design and the results, if you're not trying to communicate about threats to internal validity and all the technical stuff. It's easy to create data visualizations that show, hey, the world looked like this before our policy was intervention and it looked like this afterwards, what a great success? However, not so fast because there are so many threats to the internal validity of this research design. We talked about them in both technical and non-technical ways, we have to always be worried about the history, threat, maturation, instrumentation changes, testing effects etc. It's really easy to overestimate the effects of an intervention because a lot of the change from time 1 to time 2 could be because of all these other things going on. Also, a weakness of this design is that it's actually oftentimes used intentionally and politically motivated ways. I should say it's misused. A lot of times people use it to show, hey, since our administration took over or since we made this policy change, things have gotten better, but it might not be because of that at all. Sometimes people know that they're intentionally overestimating the effects of an intervention, but it's good for political purposes so they do it anyway. Here's another research design that is used and awful lot in program and policy evaluation in the public sector. This is called the 2 group post-test design and I hope you're feeling more comfortable reading the graphic representations of research designs. Here we have one group, we have no pre-test information about them, but the group gets exposed to an intervention and then there's an observation afterwards. Then we have another group, we have a group that didn't get the intervention, it could be a group in a different city, or county, or state or province, or even a different country, it all depends on what the intervention is, but we have a group that got the intervention, a group that didn't and we're comparing them at the same point in time after the one-group at the intervention and we don't have any pre-test information about the group. What do you think might be some of the concerns we have here? What's the counterfactual? What's the comparison group and their observation point comparing the two? Is this a good counterfactual? What are some of the threats to the internal validity of this research design? Well first, let's talk about some of the strengths. Again, this is a pretty easy design to implement and often doesn't require a lot of data. This design is often used when the data are already available for one place that got the intervention and another place that didn't and I was, hey let's compare. This place made a policy change, another places didn't, let's just compare them. Also it can be implemented retrospectively, you don't have to be thinking about how to design program or policy analysis from the beginning, you can just, oh let's go back in time, we have the data to do it. Also it's relatively easy to communicate about the design and the results. CDX implemented a new policy, CDY didn't. Here's the differences. But also, I hope you can see pretty clearly that the counterfactual here is extremely problematic. There are so many threats to internal validity, but the big one here is selection. How do we say that those two groups are alike except for exposure to the intervention? We can't say that at all. These two groups are probably different in many, many other ways. This is not a good research design for establishing causal relationships; too many threats to internal validity. This also means that this design is often used and misused to make arguments about programs and policies, politically motivated ways, unfortunately. Well, what about this design? This is called the non-equivalent control group design, where we do have pre and post test information about a group that does get exposed to our X, our policy change, our program, other kind of intervention, compared to a group that's looked at, at the same points in time, pre and post, but did not have that policy implemented. What do you think about this design? Is that control group a good counterfactual? Here there's no random assignment to groups. We're really just looking for, again, another city, another community, sometimes it might be schools. A group of schools made a new curriculum change, some other schools didn't, let's compare the two. There's no random assignment to the groups here. What are some of the threats to internal validity? This design is also sometimes called the difference-in-difference design. Well, here we do have some strengths. We have a comparison group that could be serving as a good counterfactual. The idea for this is, again, it's a good counterfactual if we think that comparison group is just like the intervention group except for exposure to the intervention, so we have to worry about that, but sometimes we can find good comparison groups. That's a stronger counterfactual than the other designs that we've talked about already. Also, sometimes this design can be implemented retrospectively if the data for the pre test observations already exist. Also, we're going to be talking a lot more about randomized controlled trials because that's the only research design we really are controlling for all the threats to internal validity. But there are some ethical issues associated with randomization, especially with interventions in the public sector. In this case, there really aren't ethical issues about randomly determining what group of people might get some more resources or a new policy or a new program versus another group that does not get access to those resources, even if we don't know yet if it works or not. Those are some of the strengths, but some of the weaknesses, they're still threats to internal validity here. Hopefully you already noted to yourself for the big one selection, we really can't say that those two groups are exactly the same and have the same maturation processes going on, the same exposure to historical events, the same data, instrumentation issues. We can't say they're exactly the same except for exposure to the intervention. Also, if the data don't already exist, it's really expensive to collect both pre and post test data into groups of people. Then here we're getting into a situation where the results are a bit more complicated to communicate to, perhaps somewhat stakeholders, but also to the public. We can pick on all of these research designs. They all have threats to internal validity, unless we are operating in a scenario where we're actually doing a true experiment or what's referred to as a randomized controlled trial, or RCT. In this case, we are randomly exposing and that's what the R stands for. Here when you see an R in the graphical display of a research design, it means that these groups were randomly assigned to either get the intervention or not. Here, then we have pretest and post-test information on people and we're comparing the pre and the post and the group that got the intervention and the group that did not. Randomization theoretically means that these two groups should be exactly the same, except for exposure to that x, to that intervention or policy change. Here we're not saying that there weren't testing effects or maturation effects, etc. But here we're assuming that since we're getting rid of the selection issue, that the two groups have the same maturation and testing effects going on and that if we do observe differences in the two groups, it's because of the intervention. This has such strong internal validity. We're really able to say that any changes from Time 1 to Time 2 that are different in the two groups are the results of that intervention. The randomized controlled trial theoretically controls for all the threats to internal validity. It is the strongest research design that we have at our disposal for trying to establish a causal relationship. We will talk in much more detail with some examples coming up soon.