Welcome to course 5 on combining and analyzing complex data. This one will be taught by me, Richard Valliant and Frauke Kreuter. We've got quite a bit of material to cover in this course. I'd like to give you a brief overview of what's going on. In the first module basic estimation, we'll talk about things like estimating means and totals and quartiles and how you do that in complex samples. In the second module, we'll talk about model fitting, how do you estimate things like the parameters of linear regression models and nonlinear models like logistics, and we'll look at software for both these. In Module 3, we're going to look at some basic methods of record linkage which is becoming a more important topic these days. Associated with that in Module 4 are some ethical issues in linking data and these may vary from one country to another. It's worthwhile what you know something about those. In Module 1 basic estimation, we'll start out within this video and talk about totals and means. In a complex design, a key thing is that you need to account for the weights. If your population or your sample was a little [inaudible] or the population, then everything would have the same weight and a lot of analysis would be simplified. But because of things like varying selection probabilities in the sample design or non-response adjustments or calibration to external control counts, we typically have different weights for different units and you shouldn't ignore that because the weights mean something. Then another thing that you need to account for in complex sample analysis is things like weight straight at multiple stages of selection affect the standard errors that should be estimated. So we need to properly account for those. Fortunately for us, there's software out there available for analyzing complex samples. We'll give you a number of examples of how to use that software. Now, totals are the easiest thing to talk about, so let's think about those. If you've got weights that are scaled in such a way that they take the sample up to the population, they project this small set that you've got up to the big set of the population, then you estimate totals in the following way. You sum over the sample units and that's what I and element of S mean, I indexes the units, S is the set of sample units, sum over those and we take the weight for each unit times its data value and that'll be an estimated total for whatever this Y variable is, income, if Y zero or one, it could be a number of people who have got diabetes or number of people whose water supply is somehow contaminated, it could be all sorts of things. Now, for the mean, all we do is take that estimate total here, and we divide by the sum of the weights. Now, again, if the weights are scaled to estimate population totals, the sum of the weights is going to be an estimate of the number of units in the population. Also, if we were to sum over just the subset like the males in your sample, if you're sampling people, there will be an estimate of the number of males in the population. It'll be an estimate of the count of units in whatever subgroup you sum over. So that's the very handy thing about the standard way of constructing complex survey weights. Now, model parameter estimates, typically, depending on estimated totals, so if you can figure out how to estimate totals, you can typically figure out how to estimate model parameters in their routines and the software that will do that for you. Quartiles are a little bit different, and the software choices are more limited, but here's how the algorithm goes. What we do is we first identify the variable we want to quartile on. So this would be the quantitative variable like income or years of education or something like that. We sort out the file from low to high based on that Y variable and then associated with each unit we've got the weight. So we accumulate the weights until we reach a certain point. So in the case of estimating the median, we want to accumulate weights until we get to the 50 percent point. 50 percent of the sum of all weights is reached and then what you do is you look at the Y value for the first unit that's got cumulative of 50 percent or more of the total weight, and that'll be your median value. Sometimes that requires some rounding off because of the discreetness of the sample. But that sort of thing is built into the software also. You could do other things like the first and third quartiles, just accumulate the way until you get to the appropriate point 24 percent or 75 percent on that cumulative total weight and look at the Y value that goes with that unit and that we will be your estimate of the first and the third quartile. In that way, it's fairly straightforward, the hard part with quartiles is estimating a major precision, which we will talk about also. In the next video in this module, we'll get into the software that's available for some of these analyses.