So now we understand prediction target, next, let's talk about cohort construction. So why do we need core construction which is identified the set of relevant patients for building a model? So in general machine learning practice we often ignore this step. Were often given a data set to study, then we build a model on this data set. However, for house care predictive modeling, we need to first create the data set before we build a model. Which means we have to define the study cohort and corresponding features. So the reasons for cohort constructions are, first we want to avoid obvious models which are often have low utilities. For example, age predict mortality. If you want to build heart failure predicted model on the entire population, we can easily obtain an accurate predictor model for that which depends heavily on the age feature. We usually have a particular population of interest. For example to build a heart failure predicted model for African American because high prevalence of HF in that population. Third data acquisition cost is another consideration as each patient sample may cause extra money. So core instruction is about defining the study population. So given this prediction target, there's only a subset of patients that are relevant among the whole population, so they are the study population. But be aware often it may not be possible to obtain data from everyone in the study population. As a result, the data set we studied is only a subset of the study population. So to define the study population, there are two different axis to be considered. On the vertical axis we can choose perspective study or rich retrospective study. On the horizontal axis we can choose cohort the or case control study. Then depends on the combination will have four different options, prospective cohort study prospective case-control plus retrospective cohort study and retrospective case-control study. So let's go through them in more details. Next, prospective study versus retrospective study. So let's talk about the differences between these two. In prospective city, we first identify the cohort of patients, then we decide what data elements to collect, then start collecting the data overtime. In contrast, retrospective study will first identify the patient cohort from existing data. For example, electronic health record of patients then retrieve all the datas about those patients. So here is a quiz question about prospective study and retrospective study. Each row represent a particular property. Pick the city that has the corresponding property. More specifically, which one has more noise in the data? Which one is more expensive? Which one takes longer time to conduct and which one is more commonly done on a large data set? So here is the answer. The retrospective study often work on data with more noises as the data are often not created for supporting this research city, but for some other purpose. So you will hear the notion of secondary use of certain data, so that's often referred to a retrospective study. Prospective study are often more expensive and take longer time, Because the data has to be collected from scratch. Because of the costs and time constraint, the size of data set for perspective study is often very limited as a result, it's more common to work with larger data set in retrospective study. So next let's, Talk about cohort study, so that's the other axis we're going to talk about. In a cohort study, the goal is to select a group of patients who are exposed to a specific target risk. For example, if we want to build a predictive model for predicting heart failure readmission, here heart failure readmission means that a heart failure patient after discharge from hospital comes back again to the hospital due to heart failure within a short period of time, for example 30 days. So in this case, if we want to do a cohort study, we should include all the HF patients that were originally discharged from the hospital, because they can potentially be readmitted after discharge. So the key is to define this inclusion/exclusion criteria to construct this cohort. Here is the visual illustration, we start with all patients, then identify relevant patients for a particular risk, for example heart failure readmission. Then we want to build a model to predict the target. So in this case, the cohort contains both the positive and the negative examples. So here, for example, positive means heart failure patient with readmission events, and negative example means the ones without readmission. So cohort study is another very common study design, so in this design we're trying to identify two sets of patients, cases and controls. Cases are the ones with the disease, and controls are the patients with negative outcomes, for example healthy patients, but otherwise similar to the cases. So the key here is to define the right matching criterias between cases and controls. Here is an example of case-control study, in this particular case is retrospective case-control study. So the goal is to predict heart failure cases against controls. And in this particular case, the study population consists of 50,000 patients, and 4,644 of them are heart failure cases and 45,981 are healthy controls. They match the cases on age, gender, and clinics. As we just described, this matching criteria for case-control studies are often very important, especially in this retrospective study setting. So here is a number of different matching strategy, for example group matching or frequency matching. So in this case, instead of matching individual patients, we're going to match these two groups at the group level by checking some statistics. For example, we want the controls to have the same proportion on certain features as cases. For example, percentage of cases are married should be the same proportion in the controls. Then we have this individual matching strategy. Here what we can do is to define for each controls we want to match to a cases on certain features. We could identify those set of matching features, for example, we can match them on age, gender, and the clinics they go to. And trying to match each case to a controls on those features. Of course, if you want to match patients on many features, it become very difficult, very easily you won't find any matches. So propensity matching is another way to get around that by first compute a logistic regression model on the entire population. For example this target we want to predict is the risk for patients to develop that disease. Then we can match the cases and controls with the same or similar scores from the logistic regression. So instead of matching individual features, which we have many features, in this case we're just matching on one dimensional, single score, which is output of this logistic regression. So this logistic regression essentially becomes. A facilitator and help us to convert this high dimensional set of features into a single numbers, then we can match on that numbers. So in this particular case, this score is also called propensity score, and this matching strategy is called propensity matching. There's also a strategy called nested case-control matching. So in this case, we'll take all the cases from a population, then we can match each case with multiple controls, not just one on certain features. As benefit of this particular design is we could have larger data set. And if you just do one-on-one matching, sometimes you won't get good enough data set just because the one patient, this one control you match to is not a good example. Having each case match to multiple controls give us a bigger data set. Even if there are some bad examples in the controls due to data quality issue, we can still remove them without missing the entire control for any given patient.