In this section, we'll move to learning some basic statistical concepts that'll be imperative for your journey in machine learning, as well as data-driven decision-making. So what are our learning goals for this section? In this section, we're going to cover what to keep in mind, when we're discussing estimation versus inference and statistics, we're going to discuss the differences between parametric and non-parametric approaches to modeling, we'll discuss the different common statistical distributions that we will see in the real world, then finally, we'll introduce the difference between frequentist and Bayesian statistics. So starting off with estimation versus inference, when we talk about estimation, essentially what we want to keep in mind here is that an estimate is just going to give us an estimate of a certain parameter, such as the mean from our sample data. So in order to calculate the mean, we just say the sum of all of our values in a certain column and divide by the number of values that are there to get the average value. Now, estimation is only a part when we talk about statistical inference. When performing statistical inference, we're trying to understand the underlying distribution of the population, including our estimates of the mean, as well as other parameters such as the standard error of the underlying properties of the population that we're sampling from. In order to get the standard error, we would use something like the equation that we see here, where we just have the estimate of the mean and subtract every value from that and then see what is going to be the average distance that we are from our estimate of the mean. Now, machine learning and what we've defined as statistical inference are very similar. We'll see the degree to which much of what we learn throughout this course and what we use is actually intertwined and built from the foundations of statistics, now applied before we even had the computing power that we have today. In both machine learning and statistical inference, we're using some sample data in order to infer qualities of the actual underlying population distribution in the real world, and the models that would have generated that data. When we say here our data-generating process, we can think of the linear model as an example of a data-generating process representing the actual joint distribution between our x and the y variable. We may care either about the entire distribution when doing machine learning, or just some features of our distribution, such as just getting the point estimate of our mean, for example. Machine learning that focuses on understanding the underlying parameters and individual effects of all of our parameters require tools pulled from statistical inference. On the other hand, some machine learning models tend to only mildly focus on all of these different parameters of the underlying parameters of our distribution, and just focus instead on prediction results, or just those estimates. Now, I want to introduce a business example that we'll use throughout to help bring context to what we've learned so far and what we'll learn throughout this course. The example that we're going to be using is an example of looking at customer churn. What do we mean by that? Data related to churn will include a target variable for whether or not a customer has left the company. So obviously, we don't want for customers to be leaving the company, so we want a lower churn rate. Data related to churn may include the target variable for whether or not a certain customer has left. We'll also include features to help us make that prediction about whether a future customer will leave, such as the length of time that we've had that customer as a customer, the type and amount of purchases that customer has made, and other customer characteristics such as age, location, and so on. Churn prediction is often approached by predicting a score for individuals that estimates the probability the customer will leave or not leave. So 0.99 means they're very likely to leave, 0.01 means we're probably going to hold on to that customer. So when we talk about estimation here versus inference, we can estimate the impact of each feature. Think for every additional year someone has been a customer, that being the feature, they are 20 percent less likely to churn, so giving an estimate for the value of each additional year. When we talk about inference, we'd be expanding it to getting the interval as well, so we'd get the statistical significance of the estimate. So for example, using what we just said, rather than just a 20 percent less likely to churn, we can have 95 percent confidence interval on that estimate, saying that the effect is either between 19 percent and 21 percent. So then we'd be fairly confident that each additional year that we've had a customer, they are going to be between 19 and 21, meaning 20 percent is a good estimate of how much less likely they are to churn. On the other hand, the 95 percent confidence interval can be between negative 10 and 50 percent, meaning each additional year we're very uncertain if it's 20 percent, as a point estimate. For all we know we can actually have a negative effect or we can have a much stronger positive effects. We just don't have the statistical significance that we do if we're thinking about somewhere between 19 and 21 percent.