In this section, we'll look at some of the types of data, we'll be learning how to analyze in this course and as follow up. So, in this short lecture we'll give a brief summary about the types of data that frequently occur in research studies and that we'll deal with analytically in this class over both terms. So at the end of this lecture section you should be able to distinguish between continuous, binary, categorical, and time-to-event data types and give examples of each of these data types. So, we'll first talk about continuous data and that's what we'll first look at in the course. A defining characteristic of continuous data measures is that a one unit change in the value means the same thing across the entire range of data values. So, examples of this might be blood pressure measured in millimeters of mercury, weight measured in pounds, kilograms or ounces, height measured in feet or centimeters or inches, age measured in years, months, salary income level in dollars per year, euros per year, etc. Binary data on the other hand is something that can only take on one of two values, yes or no. So, dichotomous data examples could be whether or not a person contracted polio, yes or no. Whether or not a cancer patient is on remission, yes or no. The biological sex of a person at the time of study, male or female or as a yes/no it could be is the subject male? Whether or not somebody's quit smoking, yes or no etc. We can expand this notion to look at something called categorical data, something that's more than two categories. Binary is a special case of categorical data with two levels, but we can extend this to have more than two possible values. There's two different types of this. There is something we might call nominal, as to which there is no inherent order to the categories. Person's race or ethnicity, the country of birth, their religious affiliation, or perhaps their gender in contrast to biological sex which is frequently defined as binary. Gender can be multi-category as a construct. There are also categorical data types that are ordered, the categories are ordered so that as you move up in categories it indicates an increase or a decrease in whatever's being categorized. So, suddenly income level categorized into four categories from least to greatest or something many of you have seen when you've taken the survey before. The degree of agreement on different survey items usually put into five categories starting with strongly disagree and moving up to strongly agree in that progression. Strongly disagree, agree, neutral agree and strongly agree. So, as you move up in the category numbers you increase the level of agreement. Then we'll also look at two different ways of expressing this idea of time-to-event data. In the first situation we'll consider situations where we have count or event count data collected over a fixed period of time. This is the way things arise in vital statistics and spatial data bases a lot. So for example, maybe the total lung cancer cases occurring in a given year for a state in the United States or for a country, we might look at the number of flu diagnoses per week in a given month. What we have here is we have a fixed period of time where we're looking at the accrual of cases where we don't have information on when each of the cases was identified. So, we don't know when each person who was diagnosed with lung cancer in the given year when they were actually diagnosed within that year. Just know that it happened in that year. Sometimes we have more specific data on the outcomes not only do we know that occurred within a certain period of time, but when exactly it did. I think this is when we have an outcome that's a hybrid of continuous data and binary data. The binary part is whether or not an event has occurred in our follow-up period, whether or not the person quit smoking or whether or not they develop coronary heart disease for example. Then the time is when they were neither had that outcome, they quit smoking or they develop coronary heart disease or when they were lost to follow up. In other words, the last time we checked in with them during the study period and they still hadn't had the event. So if a time when each person indicating either the time of having the event, they had the event or the time they were last seen without having the event. So again, examples of this might be time to relapse after remission. We might follow up people, cancer patients from the time of remission for up to maybe five years after that date to see whether they relapse or not, and if they relapse when they do in that five-year period or the time they're quitting smoking after getting some treatment. So, the reason we look at these different ways to analyze these different data types, different approaches to summarizing them and to performing analyses comparing them across groups. Theomatically, all these things have a common core to them, but the mechanics depend on the type of data we're looking at. So for example, if we wanted to compare blood pressures in a clinical trial evaluating two blood pressure lowering medications. So, perhaps a randomized clinical trial where patients randomized to receive one of two treatments; treatment one and treatment two and we wanted to compare the efficacy of the two treatments with regards to decrease in blood pressure, we could estimate the mean difference in blood pressure change for each of the two groups. So, for each person we measured the change from after they were on the drug compared to before. We'd average that across the drug one group, we'd average that across the drug two group and we compute the difference in those mean changes between the two groups. We can then estimate a 95 percent confidence interval for the true mean change and use a t-test to test for a population level differences in the mean blood pressure change based on the data from our two samples. Don't worry, we will certainly drill down and define what we mean by confidence interval and t-test as we move on in this course. We wanted to compare the proportion of polio cases in the two treatment arms of the Salk Polio vaccine, those who were vaccinated compared to the control. We could estimate or compare the proportions, we can take the difference in the proportions of children who contracted polio in the vaccine group compared to the control group. We could also take that ratio. Taking the difference results in something called a risk difference, taking the ratio results in something called the relative risk or risk ratio. Then we can estimate a 95 percent confidence intervals for these quantities and or use what's called a chi-square test to test for population level differences in these quantities based on the information in our two samples. If we wanted to compare differences in time to contracting HIV, between HIV negative IV drug users in a needle exchange program, and HIV negative IV drug users not enrolled in a needle exchange program. You could estimate an incidence rate for contracting HIV for each of the two groups and then compute the incidence rate ratio comparing these quantities. Construct what's called a Kaplan-Meier curve for each group to provide a graphical description of the time to HIV profile for each group. We can estimate a 95 percent confidence interval for the incidence rate ratio and or use a log-rank test to test for a population level difference in time to HIV between the needle exchange group and the control group. So in summary, the three major types of data that we will deal with in this class and that are of general interest in public health studies include: continuous, binary, and also the stench into multiple categories and time-to-event. There are different approaches that are semantically similar, but mechanically different for summarizing and analyzing these different data types, and that will be the thrust of what we do in both quarters of this class. Tackle different ways to analyze different data types, but emphasize the common connections between these approaches. So the next series of lectures in this module, lectures two through five, we'll talk about ways to summarize the data types we considered in this lecture set. We'll talk of ways of summarizing them numerically for single samples, graphically, and then also when we want to compare results between two or more samples.