Okay. So, right now, we're going to talk about the importance of defining good research questions for making sound inference. So, the overall idea here is that if we want to start applying statistical procedures to data, we have to have a good idea of what question we're answering in the first place. So, we're going to talk about some examples and then moving forward, we're always going to refer back to a research question when we go through these different examples. Okay. So, we're going to review the importance of well-formulated research questions for quality statistical inference. We're going to look at examples of different inferential approaches using the NHANES data to address explicit research questions. So, like I mentioned, we're always going to look back and say okay, exactly what research question are we trying to answer. In these different examples, the examples will be supplemented with working Python code. So, you can go through, run the code, replicate the analysis results, and follow along with the different inferential examples. But, we always want to make sure that we're answering a good well-formulated research question when we perform a statistical analysis. So, a little bit about good research questions. Well, we know data are everywhere these days; big data, small data sets, design Studies, process data. It's really easy to find a dataset, an electronic version of a dataset, import it into some statistical software and run some analysis. We can get data very easily from many different sources or again we could design studies, design survey samples and collect data from populations. But, wherever we get the data from, inferences based on those analyses, they're going to tend to miss the mark, if we don't have a well formulated research question underlying the study that we're trying to perform. So, we have to ask the question, what really defines a good research question? So, some key aspects to think about when we're defining a research question. First of all, what is the target population of interest? It's a really good idea to write down a very clear and concise statement of what the target population of interest is and then make that clear when you're writing your research question. Second is the research question descriptive or analytic? Now, what's the difference between those? So a descriptive question, we might be interested in the mean income in a specific population. So, we're interested in estimating a descriptive parameter, the average income for that population or we might be interested in more of an analytic question. Analytic questions generally refer to the relationships between different variables. So for example, we might be interested in the relationship between income and quality of life in a specific population. So not just estimating a mean or a total or a standard deviation, but rather quantifying the relationship between two variables. Those types of questions are generally referred to as analytic. Third, has the question been asked before? And will the new study add knowledge that didn't exist before? So, many studies build on prior studies that may have been asking similar questions, but we need to make it clear whether the question has ever been asked before and what exactly the new study is going to be adding to the knowledge that we already have about this particular topic. Then fourth, are the variables readily available, measured appropriately, or feasible to measure using well-established tools? So, you need to make sure that it's going to be possible to actually collect the data that we're interested in and are we using appropriate measures for what exactly it is that we're trying to measure. We're going to talk a little bit more about that with different examples. But, we need to make sure that the variables that we're interested in are readily available and straight forward to measure and we need to make sure that, what we're measuring is actually capturing the concepts we have in mind, that we wanted to measure. So for good research questions, we think about these four properties that we just discussed and if we craft the research question following those four essential properties or aspects that we just went over and we use an appropriate statistical procedure that's well aligned with that research question given the four properties, we can make very good inferences related to that question, but we need to make sure all five of these things go together. The four key aspects that we just talked about and choosing an appropriate statistical procedure that will lead to good inferences. The absence of a good research question and just blindly running analyses, we bring a dataset into some software and just start running different analyses, generating different results, writing up those results. If we do all this in the absence of a good research question, this could very easily lead to poor insights and incorrect decisions. We need to make sure that, the analyses that we're running are well aligned with a carefully crafted research question that will maximize the quality of the inferences that we make. So, here's a bad question, what is the relationship between academic performance and summer internship success? So sounds straightforward on the surface. We're interested in an analytic question what's the relationship between these two variables, one called academic performance, one called summer internship success. But, let's break this question down a little bit more detail. First of all, what's the target population? Well, we have no idea. What population is the author of this question talking about? We really have no idea the way the question stated. Second of all, is the question descriptive or analytic? Well, we see that the author is interested in the relationship between performance and success so this would be framed as an analytic question. That's good because it's making clear what type of analysis the author wishes to perform. Third, will answering the question provide new knowledge? Again, we have no idea. The question just states that, we want to look at the relationship between performance and success. It doesn't say whether it's adding on to any existing knowledge. Number four, how are performance and success even measured? Well, again, we have no idea. What measure of academic performance does the author have in mind is that GPA? is it final exam performance? Is it class attendance? Is it being able to make adequate progress towards a major? We really have no idea. What about success? What defines the success of a summer internship? Is it finishing the internship?, Is it getting a positive evaluation from whoever your supervisor was at that internship?. We really have no idea how these different concepts are going to be measured the way that this question is written. So, pretty much this question only hits on one of the four key properties of a well-written question. So definitely we need to rethink how to write this. Here's a good question that we're going to build on as we go through different examples using the enhanced data. When considering Hispanic adults age 18 plus in the United States in 2015-2016, what is the difference between males and females in mean systolic blood pressure? So, little bit extra words but those words are providing additional detail about what we're interested in. So, let's break down this question. First of all, what's the target population? Well, In this case it's clearly defined. We're talking about Hispanic adults age 18 and above in the United States in 2015-2016. So the target population becomes clear the who, the what the when, and the where. That's the population that we want to make conclusions about. Number two the objectives are clear. This particular question is focused on a descriptive comparison of means. We want to calculate the mean systolic blood pressure for both males and females and then compare those means for this particular target population. Number three, has the question been asked before? We don't know on the surface the way the question has been stated. Probably, but perhaps for other years, we're making it clear that we're interested in generating new knowledge from 2015 and 2016 may be based on a recently collected or recently available dataset. We're getting new knowledge for this specific population in these specific years. Then number four, the measures are made clear. So we have gender, we want to compare groups defined by male and female or sex and then we also have systolic blood pressure. So it's clear that we're measuring this physiological characteristic and we want to calculate the averages of systolic blood pressure for each of these two groups and compare them. So additional detail here makes the objectives of the study clear and this is a good question that we can build on when choosing an appropriate statistical procedure. So, good questions make it very easy to choose inferential procedures. Let's suppose we have a data set collected from a sample of Hispanic adults age 18 and above in the United States in 2015-2016. In this case, that sample is going to be the NHANES 2015-2016 and we want to compare means between two groups, males and females on a continuous variable of interest, systolic blood pressure. Given this information, the inferential procedure that we would likely choose is an independent samples t-test. This type of test allows us to compare means in two independent groups on a continuous variable of interest. Now, an important caveat moving forward. We're going to be treating the data from the NHANES as if they come from a simple random sample. So moving forward when we start to introduce applications of the statistical procedures that we'll be talking about in this particular course, we're going to start simple and we're going to assume that the enhanced data come from a simple random sample. Now, as we learned in course one when talking about where data come from, remember that, complex sample design features for probability samples like the enhanced sample generally need to be accounted for in inferential procedures. We will talk more about complex sample survey analysis later but as we're introducing these procedures and the basics of applying these different procedures, we're going to assume that the enhanced data come from a simple random sample, and start simple with examples of these procedures. Later on in this specialization, we're going to revisit the same examples and take the complex sampling features of the NHANES into account in the analysis when generating estimates when making conclusions about the population. So, when we talk about these different examples, we're more or less setting a baseline under the assumption of a simple random sample generating our estimates performing analyses. Later on, we're going to account for complex sampling features and revisit the conclusions that we make about these target populations.