In this video, we'll review the main assumptions necessary for using linear regression models, and then we'll give a broad overview of how we can generalize the linear model to work with data that violates the standard linear regression assumptions. Recall from our study of linear regression that certain assumptions must hold true in order to use linear regression models. I have the four assumptions listed here. The first one is that we'll assume the parameters enter the model in a linear way. Another way of saying that is if we ignore random error, the relationship between the response and the parameters is linear. We might write that in terms of expected value. We could say the expected value of the response, I'll write the response as a vector, so this Y is a vector of all of the response measurements. Maybe it's a column vector, so I'll write it as a Rho transpose here to save some space. We'll say the expected value of the response is equal to x times Beta, where x is the design matrix of the linear regression, so that's a first column of ones and then every subsequent column is the measurement of the Jth response variable. Then Beta is just the vector of parameters for the model. That should include an intercept and then a slope term associated with every individual predictor. The second assumption states that we assume that the errors are independent. Independence of errors implies that the errors are uncorrelated, and there are a few different ways to write this, so the uncorrelated piece, we could say that the covariance between Epsilon i and Epsilon j is equal to 0, so there's no covariance, and that should be true for all i not equal to j. Of course we could replace these Epsilon ij's with Yij's, and that would be an equivalent statement. We could write this in matrix-vector form as long as we would say the covariance of the vector of error terms is a diagonal matrix, so all of the off diagonals would be zero, meaning that there was no covariance between these terms. Now, the third assumption for linear regression is that the variance of the error term is constant across all measurements. Mathematically, we could write that as the variance of Yi, or the variance of the error is equal to sigma squared, and so there's no subscript I on that Sigma squared, and that means that it's the same for every value of I. Now, the last assumption is that the errors are normally distributed, so of course different ways to write this. We could say that each Epsilon i is iid, normal, 0 mean and variance Sigma squared, that's pulling from the third assumption. We could also write this in terms of the distribution of the response Yi. Now, that would have a normal distribution. It would be centered at the linear equations, so be centered at the ith linear equation, Beta naught plus Beta 1, x, i, j, et cetera. One thing worth noting is that if we meet the normality assumption, then the second assumption about independence is equivalent to uncorrelated. Because if you have a set of normally distributed random variables that are uncorrelated, that implies independence. When you have normality and uncorrelated, you get independence for free. Let's see what it would look like to break these assumptions in a specific way. Suppose that researchers are interested in predicting whether a political candidate will win an upcoming election, and they'll do this by studying past elections. They might have a data frame where each row is a political race and the columns would be different variables, so you might have one column for the response variable. In this case it would be binary, either the candidate won or they lost. An important predictors, other columns might be the amount of money spent on the campaign, the amount of time spent campaigning. Whether or not a candidate is an incumbent or not, meaning, whether they are in office seeking re-election or somebody outside of office looking to be elected. In this example, what assumptions might be violated? Well hopefully, you've noticed that the response is categorical and thus it's not continuous and certainly not normal, so we violate the normality assumption. In fact, the response in this case as Bernoulli, which is a special case of the binomial distribution. Now in addition, we'll see that we will also violate the constant variance homoscedasticity assumption. But let's quickly review the binomial distribution so we can see how that's the case. Let's start with a random variable Yi, and let's assume that it's binomially distributed with parameters N and Pi. Notice here that we have a set of random variables, Yi, where I ranges from one through n, and each one of these random variables has a different probability of success, Pi. We can interpret Yi as the number of successes among n trials, with a probability of success Pi. Now, one toy example would be that Yi records the number of heads in say, 10 flips of a fair coin where fair coin tells us that Pi in this case would be equal to 0.5. Now, here are some important properties of the binomial distribution. First, let's write down its mass function. The mass function tells us the probability that our random variable will take on a particular value. We'll call the particular value lowercase y. It ranges from 0 up to n, the possible values of a random variable. Now the PMF will be equal to n, choose y times Pi raise to the y times 1 minus Pi raised to the n minus y. Again, this tells us how to calculate probabilities associated with binomial random variables. Just note that if n is equal to 1, then Yi is a Bernoulli and the formula simplifies pretty nicely. For example, the 2s are combination term. This first term here in that case would be 1. Now the mean of a binomial, which is the expected value of our binomial random variable, which we could also denote Mi is simply equal to n times the probability of success Pi. You could compute this straight from the definition of expected value, which you've learned in your intro to probability course. Now the variance of our random variable when it's binomially distributed, which we could denote sigma i squared, will be equal to n times Pi times 1 minus Pi. Again, you could compute this from the definition of variance for a discrete random variable that you've learned in a previous course. These quantities will be important when we consider how our data from the election example violate the linear regression assumptions. Back to our election example. For each row in the DataFrame, the response, whether candidate i1 are lost, will have a different probability of success or a different probability of winning. We think of that probability is related to the predictor variables, for example, whether the candidate wasn't incumbent or not. Now that's our Pi. If we see this, we should note that Yi, so the random variable that models wins or losses in the data-set is clearly not continuous or normal. We violate continuity and normality of the response. Now we also noticed that the variance of Yi is n times Pi times 1 minus Pi. Notice that this variance will change and be potentially different for each i, so that means for each one of our rows in the DataFrame, we could potentially have a different variance. That tells us straightaway that we violate the homoscedasticity assumption, we don't have a constant variance throughout the data-set. What would it look like if we modeled this data with linear regression? Well, consider these two plots. On the left, we see a plot of the residuals versus the fitted values from data that do meet the linear regression assumptions, so I simulated the data that constructs this residual plot on the left just to be sure that we do meet these assumptions. This plot shows just what you would expect, random scatter of the residuals around zero. But now look at the plot on the right, here I fit a model to data where the response is Bernoulli. The residual plot does not look anything like what we would want. This provides further evidence that our regression assumptions are grossly violated. What's the solution? Well, the solution to this problem, when the response is Bernoulli, or more broadly, when the responses binomial. The solution here is to use binomial regression, and a special case of binomial regression is logistic regression, which we will study later in this module. Both of these models are a specific case of a generalized linear model. In the next video, we'll briefly study the important property is that constitute a generalized linear model. Now so far, we've seen generalized linear models extend the linear regression framework to allow for non-normal responses, such as counts, for example, the binomial model. Later in this course, we'll study another set of tools that will allow us to relax other assumptions. Nonparametric and semiparametric models are flexible tools that allow us to model nonlinear data. We'll focus on these tools in this course, namely the tools in number one, nonlinear structure, nonparametric models, generalized additive models. But it's also worth mentioning that if correlation structures exist in the data at hand, time series models, spatial models and mixed effects models might be helpful. Those models and two won't be exposely covered in this class, but it's helpful to know that they exist so that if you need them, you can take a course to learn how they work or study them on your own and implement them.