In this lesson, we'll motivate generalized additive models by considering how we might perform nonparametric regression on several predictors. Suppose that we have more than one predictor and a general form of a model like this might be something that we've seen plenty of times before. Write our response y i is equal to some function of our predictors plus an error term, where the error term has the typical assumptions of being zero mean independent from one another and having the same variance. Now here f is a function that takes in several predictors and not just one. If the model is additive, we can separate f into the sum of simpler functions of our predictors. We might be able to take our f here and say that it's equal to sum f_1 of x I_1 plus all the way up to some f_P of x_i p. Basically the function has a specific form. We can separate out our different predictors and we just have functions of each individual predictor. Lets consider some examples of functions that are and are not additive. Consider function one here. Here we have an x_1 plus x_1 squared plus x_2 plus x_2 squared plus x_1 times x_2 times the sine squared of x_3. Now, this function is not additive, and it's really not additive because this term here, we can't really separate that outright. This function is an f, say 1 comma 2 of x_1 and x_2 as opposed to something like this term, which is say, an f_1 of x_1 because we have a function like this, our model is not additive. We can't separate it out into functions of the individual predictors. Now let's contrast that with model number 2, right? In this case, we do have an additive model because we can call, we could do this in different ways, but maybe we call this our f_1 of x_1 and then this function here is an f_2 of x_2. Of course that constant term pi there, you could put in the F2 if you want, you can leave it out as a constant term. There's not a unique way to do this. It's just that there needs to be some way to do it. The third function is also additive. We can define f_1 being 1 plus the log of 0.5 times x_1 squared and then we can define an f_2 being negative x_2 and f_3 being just x_3. Now that we've defined what it means to be additive, let's step back and think about what it might be like to try to fit a model that has many predictors using a nonparametric regression. It turns out that some of the methods that we've studied already, say the kernel Estimator generalizes relatively nicely. Now we have our model I've written in shorthand here. This is just the AIF covariant class. That will include Xi_1, Xi_2, etc all the way down through Xi_p. The kernel estimator would be as follows. It would be this f hat function and it looks pretty complicated but if you go back and study what this looks like for just a single predictor, you'll notice some similarities. There's something like a weighted average going on and the weights are kernels. This K h of x minus x_i, this whole term is a kernel, but now it's a multivariate kernel. We'll have to see what this h means. Basically, this H subscript denotes the fact that whatever KH is we said, it's a multivariate kernel and we'll define that more clearly in just a moment it depends on H. The dependence on H will actually be a matrix so H will be a positive definite matrix. Now consequently, because H is positive definite, its inverse exists and it also has something like a square root. Let's first say H inverse exists. We could also say that something like the square root of H exists. What that means is we have basically a matrix H raised to the 1.5 times H raised to the 1.5, that gives us back H. Then we could also use the notation H to the minus one-half, and that just means that we're taking the inverse of the square root of H. Then finally, so we'll say that's notation 1, notation 2, it's just we take these bars around H, that means take the determinant of H. Really what's happening here, if you think back to kernel estimation from the previous module, we had something called a bandwidth that would control how smooth your fit would be, and this H is doing something very similar, it's just that now we have many dimensions on which we can control the smoothing, and so H has to be multi-dimensional. Now, all of that was going into defining this K_H, and it will look something like this. So we'll have the determinant of H raised to the minus one-half times a kernel evaluated at H to the negative one-half times a vector U. This is called a scaled kernel, and it's really just shorthand notation. What matters here is that K of Z, if we call this here Z, K of Z is a kernel, which means that it's positive, and that if we took the integral of K over Z, it would integrate to 1. That's a multi-dimensional integral because Z is a vector, but similar to the definition of the kernel from our study of non-parametric regression. Now, the goal of this slide is not for us to understand necessarily all the inner workings of this estimator, it's really just to present the fact that our kernel estimator can be generalized, but there's also some issues with it. The generalization is one that the fits can become quite complex in higher dimensions, and visualization really becomes difficult or impossible, and so that means that you can't use some of the techniques that we've used in previous lessons where we play around with the bandwidth a little bit and find a nicer fit, that becomes really difficult when we get higher dimensions. One idea is to try to simplify the function f. It will still be a multivariate function that takes many inputs, but if we can simplify it in a way where we allow it to be just an additive function, then fitting a model becomes a bit easier. That's what we have here. We can think about our expected value of the model being not just some arbitrary function of all of the inputs, but really a function that is separated out, it's an additive function. We now have an f_1 of x_1 up through an f_p of x_p, and we sum these functions together. Each of these f functions, each one of these are arbitrary, but they're smooth, so they have some nice number of derivatives and they're functions of the individual predictors. Again, this formulation is called an additive model because the predictors enter the model additively. Notice that the formulation is pretty flexible and allows for a large class of models, but it does seriously restrict the original f that we started with. We're making some restrictions, but we still have lots and lots of options. It turns out, the really nice thing here is that the class of functions that are additive is much more flexible than the standard linear model that we would study in our standard regression course. That's true even if we allow for polynomial terms for transformations, and the reason for that is, here, we're not imposing some transformation or we're not imposing a polynomial term. For example, we're not requiring that f_1 is the square root of x_i,1, we're really saying that it can be any function. With this restriction, we'll end up using non parametric techniques from previous lessons to learn each individual f_j. Let's note a few things. First, having an individual f_j for each predictor x_j might be a little bit overkill. If we know the relationship between the response y and x_j is linear, then we could refrain from wrapping x_j in a function to be estimated from the data and just allow it to enter linearly. This formulation might look like the one here, and specifically, I've taken this term and we've taken it out of an arbitrary functions f_1, and we've just multiplied it by a coefficient. Here we're saying that the relationship between x_1 and our response is linear and there's no need to try to learn the linear function. We already know it, we're supposing. Now, the advantages here are that we can account for categorical predictors in this form, x_1 is a categorical predictor. We can have it enter the model in this way, and then we can think about its effect adjusting for the other predictors that are potentially wrapped in a function f_ j. These adjustments are now non-linear, but the additive nature of the model allows for this interpretation, so we can still hold onto in some cases, the nice interpretation of linear models. Now, another thing to note is that additive models won't work well when strong interaction terms between the predictors exist. In such cases, we might consider adding interaction terms like the one that I've added here. This last term, there's an interaction term because it's a function that takes in both x_1 and x_2, we can always add terms into our additive models like this, it's just that things get complicated once we do this even in lower dimensions. The third thing worth mentioning about the additive model framework is that it can be easily extended to non-normal responses in just the way that we've seen for generalized linear models. For example, if our responses Poisson, we might consider the model that's given here, we take the log of the rate parameter, same thing as the mean, and we set that equal to some predictor. Now, the predictor is linear in some variables, for example, it's linear in x_1, but it's potentially non-linear here but we do have the additive structure, meaning that each predictor enters its own function. X_2 enters f_2 and x_p enters f_p etc. Let's consider a quick example. Consider this ozone data. It comes from a study of the relationship between atmospheric ozone concentration, and meteorology in the Los Angeles basin, and the studies is from 1976. Just to note a number of cases with missing variables have been removed for simplicity. Here researchers were interested in modelings ozone concentration, so that's the O3. We wanted to model O3 and it's measured in parts per million as a function of temperature, our predictors will be temperature, inversion base height, and inversion top temperature. Now, we've left out these other variables just for simplicity, of course, the full study might take into account these variables, but let's start out simple. Assuming that the model is correct, we can think about plots that might help us interpret our model, these plots give us a sense of the marginal relationship between individual predictor variables, for example, temperature and the response ozone concentration. For example, we see that if we adjust for the other predictors, the ozone concentration is roughly linear for temperatures up through 60 degrees. Up through 60 degrees, things look somewhat linear, and then the slope changes, it seems like there's another line roughly, there's some curvature, but maybe there's another line here, that is better at explaining the relationship between temperature and ozone concentration, adjusting for the other variables above 60 degrees. Now, there is some slight curvature but you might think, okay, well maybe two lines do it and there's just a point at which those lines change. Now, for the IBT variable, we see that the estimated fit also includes some curvature, but if you notice the confidence bands, these dotted lines, the confidence bands for this fit would allow a line to pass through them, we might be able to get a line, something like that to fit through those confidence bands and given that, this is a visual way of assessing whether we actually needed the function like that f around IBT. We don't seem to have great evidence that a non-linear fit was necessary here, and instead, we might just include this term in the model in a linear fashion. So far in this video, we tried to just gain some intuition about what an additive model is. Y, it's more powerful than some other techniques like standard linear regression, or multi-dimensional non-parametric regression. We also tried to get a sense of how we might interpret the model based on some plots and we'll get a lot more practice with that in future lessons, and we'll also see a basic overview of the math of fitting these additive models and then, of course, we'll get practice in R2.