So, in this next set of lectures, we'll look at multiple logistic regression and we'll take a similar treatise of this as we did with multiple linear regression. First, giving an overview of some results and interpreting them, then drilling down on some of the mechanics of how to create competence intervals for the results, how to test for multicategorical predictors in unadjusted and adjusted models, how to do prediction or estimate probabilities that results to logistic regression, et cetera. So, the first thing we're going to do before we get into the mechanics of some of these things is, we'll just present some examples of multiple logistic regressions to have set the stage and give some context on how to interpret these scientifically. So, after viewing this section, you should be able to interpret the intercept and slope estimates from multiple logistic regression models in a scientific context, interpret the exponentiated intercept as an estimated odds and exponentiated slopes as adjusted odds ratios, and compare the results from simple and multiple logistic regression models to assess confounding. So, let's look at something we started to look at in simple logistic regression context and look at several different predictors individually, and now we can put them together in a multiple model. So, we're going to use data from the National Health and Nutrition Examination Surveys from 2013 and 2014. This was a representative sample of US residents and we have data on 10,000 plus observations on persons 0-80 years old and about 60 percent of the 5,847 adults greater than or equal to 18 years with body mass index as a subset of that group. As we saw before, obesity in adults is defined as a BMI score of greater than or equal to 30. So, we looked at some associations in the unadjusted sense and I'll just review them here. We looked at obesity and biological sex based on these data and solve the log odds of obesity when it relates to sex, where sex is coded as a 1 for females and a 0 for males, intercept at negative 0.74, a slope for sex coded one for females at 0.38. So, that slope estimates the unadjusted log odds ratio of obesity for females compared to males. No other characteristics are taken into account in this model. If we exponentiate that, we get the odds ratio of 1.46 indicating that females had 46 percent greater odds of being obese than males, no other characteristics considered. A 95 percent confidence interval for that ratio is from 1.31 to 1.63. We looked at this in the lecture set on simple logistic regression. Similarly, we also looked at the relationship between obesity and HDL cholesterol levels, the unadjusted association. The resulting regression equation was as follows: The log odds of obesity was equal to 1.20. The intercept of 1.20, plus a slope of negative 0.034 times x_1, where x_1 was HDL in milligrams per deciliter. In that lecture set on simple logistic regression, we actually looked at a visual loess plot to see whether the assumption about the log odds of obesity being linearly related to HDL was reasonable and we confirmed it was. So, what we see is additional or increases in HDL are associated with lower log odds of obesity. So, the slope of Beta 1 hat equals negative 0.034 is the estimated log odds ratio of the obesity for two groups of persons whose HDL levels differ by one milligram per deciliter. So, exponentiating this gave us the odds ratio, the unadjusted odds ratio of obesity for a one unit difference in HDL of 0.967, 95 percent confidence interval that we showed the computation of in that previous lecture set was 0.963 to 0.970. So, obesity and HDL cholesterol, the unadjusted odds ratio is 0.967 or approximately 0.97. The odds ratio of being obese for two groups of persons who differ by one milligram per deciliter in HDL levels is 0.97, higher HDL to lower HDL. In other words, higher HDL subjects have three percent lower odds of being obese when compared to the lower HDL subjects when the difference is one milligram per deciliter. So, these are just reviews on adjusted association. It's certainly possible that taken together, sex and HDL cholesterol will explain more about the obesity than either one alone. It's also possible that the sex of the person and HDL cholesterol level are related. So, it's also possible that each predictor taken individually is explaining some shared information about obesity, if in fact, sex and HDL cholesterol levels are related. So, multiple logistic regression allows for the expansion of the logistic regression model to include more than one predictor in a single model. So, let's just look at this in spite raised the question, look said, it's possible that HGL and sex are related. Let's just look at a visual and see if there's any evidence of that. So, I was just curious, so I did a box plot of HDL cholesterol levels by sex. It showed that females had higher, generally higher. There's lot of shared variation in the distribution for males and females, but females had higher cholesterol than males HDL levels. So, it is some evidence visually that the two are related. Actually, and I'm not showing this, but I looked at the mean difference between these two and it was statistically significantly higher for females. So, we've already shown that sex was related to obesity and now we're showing it's related to HDL which was in itself related to obesity. So, it's very possible that when we include sex and HDL together, they are adjusted, relationships adjusted for each other may look different than those initial unadjusted relationships. So, let's do that. Let's see what happens. So, let's fit a multiple logistic regression where we look at the log odds of obesity models, a linear function of both sex and HDL. It's of the form log odds of obesity equals intercept plus a slope times x_1, where x_1 is sex, 1 for females and 0 for males, plus a slope times x_2, where x_2 is HDL. So, the estimated intercept is 1.25, the estimated slope for sex is 0.76 and the 95 percent confidence interval goes from 0.64 to 0.89. The estimated slope for HDL is negative 0.043 with the competence interval as such. So, in a vacuum on the log odds and log odds ratio scales, maybe it's not so easy to interpret these results. So, let's look at exponentiate them and interpret them in context. So, the slope estimate for sex is Beta 1 hat equals 0.76. Just is with the simple logistic regression, this is still an estimated log odds ratio, but it's a log odds ratio of obesity for females to males. But with the additional restriction that adjusting for HDL level, if we're comparing males, females to males who have the same HDL level. So, this estimated odds ratio is E to the 0.76 or 2.15. So, this odds ratio estimate could be called the HDL adjusted association between obesity and sex. This result estimates that the odds of obesity for female adults is 2.15 times the odds of obesity for male adults with comparable or the same HDL levels. The 95 percent confidence interval for this population-level HDL adjusted odds ratio of sex goes from 1.9 to 2.43. So, it's statistically significant even after adjustment. Again, we'll show and I don't think you'll be surprised, but we'll show how to compute this confidence interval and a spend others likely in a subsequent lecture set section here. Let's look at the slope estimate for HDL, it's Beta 2 hat equals negative 0.043. Still an estimated log odds ratio of obesity for two groups whose HDL levels differ by one milligram per deciliter, but are of the same sex. Now, it's been adjusted for sex. This estimated odds ratio when we exponentiate that slope estimate is 0.958. The odds ratio estimate is called the sex adjusted association between obesity and HDL. So, this odds ratio compares the odds of being obese for two groups of persons who differ by one milligram per deciliter in HDL levels higher, HDL to lower among persons of the same sex. So, roughly, the odds ratio 0.958. So, each additional milligram per deciliter HDL is associated with a reduction in the odds of obesity for about 4.2 percent. This result adjusted association statistically significant as the confidence interval for the population-level adjusted odds ratio does not include one. So, what about this intercept of 1.25? What is the estimated log? What is this estimate? Well, this would be the estimated log odds of obesity when all x's in the model are zero. So, for our binary predictor, that's still a group of, in our sample, males, but for the continuous predictor of HDL. The x_2 value of zero would indicate persons with HDL levels of zero. That does not exist in our sample, it could not exist in real life. So, this is still a necessary piece of the story, but not necessarily scientifically uninterpretable on its own. We could exponentiate it to get the odds of obesity for males, which would be 3.49. It would be odds of obesity for males with no HDL cholesterol. So, with HDL equals zero. So, again, not a scientifically applicable or useful quantity because we can't have persons with HDL zero. But, this is a necessary pieces to the resulting equation either in the slope form or the exponentiated form and will be used when estimating the odds and probabilities of obesity for any single group of adults given their sex and HDL levels. We'll look at doing that in a subsequent section here as well. So, if we wanted to actually present the results that we had looked at from unadjusted and compare it to adjusted, would do something like this. So, I'm actually going to have age quartiles in this table too and I'll speak to that in a bit. I'll be bringing the age quartiles in above and beyond sex and HDL in the next table. But, let's focus here on the unadjusted associations between sex, obesity, sex and HDL and then the model we just looked at where they were adjusted for each other. So, I have a column here that says unadjusted and I present unadjusted associations between obesity, sex, HDL, and age. Then, I have a column here that says adjusted, but since the only entries in the column or for sex and HDL, there's no piece for age, the implication if you solve this in a journal article would be that the final adjusted model they presented, only included sex and HDL. So, let's look what happened to the relationship between obesity and sex. It was positive and statistically significant in the sense that females had higher odds before adjusting for HDL. We also saw that females have higher HDL than males and a higher HDL is associated with lower odds of obesity. So, this when we adjust for the differential levels of HDL between females and males, the association actually gets larger for females because we're no longer mixing disproportionate number of persons in the comparison who have lower HDL among the females, which would bring down the odds of obesity. So, when we take that out the odds ratio of obesity for females to males gets larger. In estimate it goes from 1.44 to 2.15 and the confidence interval shifts up. You can see that these new overlap and these two confidence intervals, so it appears that there was real confounding here going on between sex and HDL. Let me just diagram when I mean the unadjusted comparison between females and males. If I'm going to put just for the moment to make it easier represent a low HDL and high HDL and I'm going to indicate that with the letters L and H. We see that females had higher HDL, so there'll be a higher proportion of females with high HDL. Just draw a few here than low compared to the males who would have a lower proportion with high HDL because they tend to have lower and so the gray. So, some of this comparison was being distorted in the and attenuated because the females were more likely to have high HDL, which was associated with lower odds of obesity, and that's why this unadjusted estimate comparing females to males is lower in value than the one that was adjusted when we removed that differential distribution of HDL levels, higher HDL levels among females disproportionately in our adjusted comparison. Interestingly enough, the relationship between obesity and HDL didn't change so much, but it did attenuate slightly, went down a bit and actually even though their numbers are close in value, the confidence intervals do separate and do not overlap between these two groups. So, some of the association we were seeing, it was a less of an association before we adjusted. It was the smaller decrease on the order of 3.3 percent per unit increase in HDL compared to the adjusted 4.2 percent because of the disproportionate percentage of females who have higher HDL levels and higher being a female was associated with higher odds of obesity. So, that was pulling down the unadjusted association for HDL slightly. So, it's just interesting to see these things side-by-side and given what we also looked in as a heads up. They wouldn't necessarily do that separate analysis of HDL by sex as part of the analysis, but I thought it would be interesting to look at as a precursor to comparing the unadjusted and adjusted results. Let's go to the next table, let's bring in H here even though it was represented in the previous table than the unadjusted association, we can see age was categorized into quartiles. The reference was the youngest age group. The relative odds for the second quartile being obesity. The first was 1.7, for the third to the first is 1.84, and the fourth to the first it was 1.37. So, now, the consistent increase although all three older age quartiles have higher odds than the younger. So, not a strictly linear type of situation log odds scale. But, they would pull out to a constant relative increase or decrease on the odds ratio scale. So, it probably is good that we modeled this as categorical, but nevertheless there is an association. Here's the overall p value for testing whether the overall association between age and obesity statistically significant. Clearly we see this already evidence that it is in the fact that all of our confidence intervals for comparing the odds of obesity for each of the non-reference quartiles relative the reference are significant. But, nevertheless this is the proper test result, proper p value to report because as we noted before you can have situations where none of the three non-reference groups differ from the reference. But, some of these differ from each other and this p value will catch that. We now look at model two, where we've included sex, HDL, and age. I'll let you go through this on your own, but what we see in addition to what we've learned from the unadjusted in model one is that, things don't change much between model one for sex and HDL after we've adjusted them for each other, and when we bring in age to the equation as well. The adjusted associations are similar to what they were when only sex and HDL were in the model. So, it doesn't look like these were further confounded by age differences in those sex or HDL distributions. Well, the estimates vary. The confidence intervals overlap with their unadjusted estimates in terms of the adjusted odds ratios for age. So, it doesn't look like that changed much in the face of adjusting for both sex and HDL. Age is still independently above and beyond sex and HDL, a statistically significant predictor of obesity. So, let's look at one more example, predictors of breastfeeding in Nepalese children. Random subset of data of children 12-36 months old. Wanted to look at, what is the relationship between breastfeeding. We started by looking at the relationship between breastfeeding and sex and then went on to look at some other factors. So, the unadjusted association, you may recall, it was not that interesting. There were log odds of being breastfed as a function of sex where sex is one for females and zero for males, resulted in an intercept of.85 in a slope per sex of negative.02. We exponentiate that, we get an odds ratio unadjusted of being breast fed for females to males of.98, slightly lower odds in the females, but it was not statistically significant, especially as the odds ratio was very close to the null value one itself. The unadjusted association in age however, in this age group from 12 months to 36 months was not surprisingly negative. The log odds of being breastfed decreased relatively sizeably per increased month of age. So, the slope of negative.24 for age as interpreted as the log odds ratio, estimating the log odds ratio of being breast fed comparing two groups of children who differ by one month in age. We exponentiate this, we get an odds ratio estimate of.79, 21 percent lower odds per increased month of age. It's statistically significant as this was all in the lecture on Simple Logistic Regression, where we computed this confidence interval that went from.73 to.84. Because, I'm just going to remind you, we already investigated this unadjusted association, the nature of it, in the lecture on Simple Logistic Regression. But this is the LOWESS plot looking at the relationship between the log odds of a, breast feeding and age to see whether it was consistent change, and it is consistently decreasing, and whether at least it could be roughly estimated by a line. While we see some curvature here certainly, we're not doing it a great injustice by fitting a line to that. So, that's what we had done before and that's why we presented the results, we just did with age being continuous. So, we wanted to look at sex, age, and we can bring in other characteristics of the children and the mothers, so we might bring in the parity category of the mother. I put this into four categories. If this child we're studying here is their first child, they had no previous children, that's the reference. One previous child, two previous children, and greater than two for the other three categories. Also, parting the mother's age, and of course that may be related to parity category, especially, but nevertheless, we brought all these things in and look at the unadjusted results here. So, we can see even though looks-like greater parity was associated with lower odds of being breastfed consistently although it wasn't quite a dose-response relationship, looks like having more children was associated with lower odds across the board. This construct was not a statistically significant predictor of whether a child is breastfed, nor was mother's age. There was a slight reduction in the odds that a child is breastfed with increasing years of mother's age, but it's not statistically significant in the unadjusted sense. Let's look at what happens when we look across various multiple regression models. The first one here looks at the relationship between breastfeeding, sex, and age taken together. We can see that the differential in odds between females and males gets larger with this estimated odds ratio.76 and estimated 24 percent lower odds of being breast fed for females compared to males of the same age because now we're adjusting for age, but it's nowhere near statistically significant that confidence interval was all over the place and includes one. Really doesn't look like there was much change in our understanding of the relationship between breastfeeding and sex after adjusting for age. You notice the results for age are absolutely identical to what they were in the unadjusted sense so, there was certainly no confounding there by sex. If we look at subsequent models here so, I'll look at this second model brings in maternal parity, on top of things, and if you look carefully at sex and age, they are very similar again to what they were in the model where they only adjusted for each other and in the unadjusted associations. While the estimated odds ratios vary a bit for the different parity categories, the message is still the same amongst children of the same sex and age. Odds in the sample of being breastfed went down with increased parity of the mother and even though these estimates look dramatic, there's a lot of uncertainty in each of them and the overall construct is not statistically significant. So, the resulting sex and age adjusted association between breastfeeding parity is not statistically significant. I just went ahead and put in maternal age as well in this last model just to see if that changed anything or changed itself. If you look through this, the results for the other three are pretty much comparable to what they had in the previous adjusted setup, and parity was still not statistically significant, nor was mother's age. Also, reported the baseline odds here, this would be the exponentiated intercept and that would describe different groups in different situations. In this first one where we only had sex and age in the model, this would be on the estimated odds for male children who were newborns. Well, that sounds like that might be relevant, again our sample only included 12-month-olds to 36-months-old so it doesn't quite cover a group in our sample. But the reason these are so high, and we'll talk about this when we get to the section on estimating probabilities of the outcome for different x combinations, is that the starting odds of being breastfed with low ages in males is very high if we transform this into an estimated probability, it would be very close to one. Because remember, probabilities odds over one plus odds. So, in summary, multiple logistic regression is a tool that relates the log odds of a binary outcome y to multiple predictors x1 to xP, generically speaking, via a linear equation of the form that says the log odds that y equals one is a linear combination of our xs and also includes an intercept. So, generically speaking, each slope beta hat I, I equals one to p is the estimated log odds ratio of y equals one for two groups who differ by one unit in the predictor xi, adjusted for all other xs in the model. These beta had Is, these slopes, can be exponentiated to get adjusted odds ratios. The intercept beta non-hat is the estimated log odds for the group with all xs equal zero. This may not be a relevant quantity, depending on the predictor set x1 through xp, but it is still necessary to specify the regression equation fully, and can also be exponentiated to get what I'd call starting or reference odds for the results what we build on to get the odds for other groups given their xs. In subsequent sections, we'll show how to estimate the confidence intervals for the various odds ratios we've presented here and in other situations as well, how to estimate predicted probabilities of outcomes from multiple logistic regression, and we'll talk a bit about what it would mean to have good prediction from a multiple logistic regression model.