Hi, my name is Youngho Park. In this week, we're going to talk about the way to develop the realistic forecasting model in the context of North American team sports leagues. As you may recall, in the previous week, we've been talking about the way to develop the forecasting model by using the salary information between the two teams in a match, and the model worked quite well in terms of its accuracy rate and the Brier Score. Even though there are some structural differences between the two leagues, I think the salary information between the two teams will still give us a good insight about the way to predict the game results. We are going to go through the same analog procedure as we did before with professor [inaudible] , and we're going to see how better or worse the model is when it comes to the North American team sports leagues. In this notebook, we'll build our forecasting model using the MLB data. As long as the model specification concerns, same goals with the MLB in which we're going to use the salary ratio between the two teams in a match as an independent variable to forecast their game results. Then we'll move on to the betting odds model to obtain the finite game results as well. Finally, we're going to compare the performance between the two models in terms of the prediction rate and the Brier Scores. Again, the salary information among MLB players were obtained from the website below. If you would like to take a look at the salary data and how the salary data is structured, then you can just go ahead and click on the link here. Before we proceed to the further data analysis, as always, we have to import all the libraries first. Then let's expand the display mode here so that we can see the results more clearly, and then we're going to import two datasets: MLB dataset and the salary dataset. Then now here you can see the raw data format from two datasets. I'm pretty sure that you are very much familiar with the way MLB data is structured as we've been working on the MLB dataset thoroughly from the previous courses. I'm not going to do the exploratory analysis to make sense out of the MLB data, but if you would like to take a look at the dataset, then you can just go ahead and check out the raw dataset on your own by using some handy codes that we have used for the exploratory data analysis purposes. Then now we have a new dataset, and a salary dataset obtained from the web above. As you look at the salary data here, you can see the name of the team in the league, and also you can see the total amount of total payroll spent by each team. Now we can also produce very simple descriptive statistics about the dataset here. Again, we have 30 observations in the salary dataset indicating that we have the salary information among all the teams in the MLB, and also we can see the mean salary and the minimum salary, maximum salary spent by each team. In order to fit the regression model, we need to organize and clean the data first. First of all, we are going to select columns to be used for the regression analysis here, then take a look at the resulting data frame. Then as a result, we have five columns, and then we are going to obtain the results of the game. Well, first of all, we have to obtain the run differentials first. From here, we need to obtain the run differential spice abstracting the home score to the visitor score. It should be noted that the differentials are calculated from the home team's perspective. We can create the binary win variable to see whether or not home wins the match. With that, we will assign one when one differentials are greater than zero, indicating home team winning and zero otherwise. Run this line off the code, then we will obtain the run differentials between the two teams in a match from the home team's perspective, and then we're going to create the binary dependent variable to see whether or not home team won that game. Then here you can take a look at the resulting data frame and you can see the game results. Now, let's move on to the salary data. First of all, we are going to drop the unnecessary column in salary data. Then now we are going to use the team column in the salary data as a matching column when appending the salary information for home and visitor team in a match. This means that you need to change the team column one at a time for home and away team separately. First of all, let's work on the salary record first. We're going to use home column as a matching column. We're going to match the dataset. As a result here we have the salary information of the home team here. Now we are going to obtain the salary information for visitors as well. First of all, we need to change the name of the column in the salary data so that we can use visitor column as matching column when we march the dataset into the existing data frame here. As a result we have two variables for home team salary and visitor salary here. Then now we are going to change the column names properly. Let's take a look at the resulting DataFrame here. Now we are ready to get salary ratio between two teams in a match. First of all, we'll take the log of salary for each team and then we will generate the log scale, the salary ratio between the two teams in a match. If you run this line of the code, then this way, in the resulting DataFrame at the very right end of the DataFrame you can see the logs scale to salary ratio between home and away team in a match. Now we are ready to fit the regression model. The first regression model that we are going to fit is the forecasting model with linear regression. In this case, we are going to run differentials as dependent variable. Then we are going to fit the linear regression model as the measurement of scale of the dependent variable in this case is continuous. Then we are going to use the log scale, the salary ratio between the two teams as an independent variable. As always, when we fit the linear model, it's always good to plug the variables first. Let's take a look at a scatterplot where we see the linear relationship between two variables in the regression model. We put the log scale, the seller ratio along the x-axis, and we put the dependent, the continuous variable, which is the run differentials encoded from the [inaudible] perspective along the y-axis. As you can see the resulting plot, even though it's not very impressive, but still we can observe with the past table linear relationship between the two variables. Then now we are ready to fit the regression model for forecasting purpose. As we did previously, let's split the Euler's model using run differentials as a dependent variable. Let's run this line of the code. We can see the result here. First of all, take a look at the regression coefficient, attach it to the independent variable, which is the logs scaled salary ratio between the two teams which is positive. We can observe that the p-value is very low. We got the statistically significant result here, which is also positive as well. Then we can take a look at the R-square value here as well. Well, while R-scale is not very impressive, we can just go ahead and check the predictability of our model as well. As we did previously with the NHL data, we can also obtain the faded results by using the regression model above. Then we can just pass this line of the code to obtain the faded result. The faded results will be up here at the very right end of that data column here. By using the faded column here, then we can also obtain the faded binary winning variable. Again, this is basically predicted run differentials between the two teams that is encoded from the perspective of a home team. If the faded run differentials between the two teams is greater than zero, which means home team won the game. Based on that classification rule, we are going to create the faded binary variable. From the model we obtained the fitted results as well. Lastly, we can compare the predicted outcomes against the actual outcomes and obtain the rate of correct predictions as well. Let's run this line of the code. Here you can see the success rate of our regression model. As a result, we can say that our regression model predicted 55.6 percent of the game results correctly. Then now let's move on to the logistic regression where we are going to use the binary dependent variable. Then we are going to use the logistic regression model to obtain the fitted outcomes as well. Then we're going to use that fitted outcomes as our predictions. Then we're going to compare the success rate of the logistic regression as well. Basically, same analytical steps are applied in the case of logistic regression here. First of all, we are going to import, we are going to load all the Libraries. We are going to specify the model here. This case, we are going to use win a binary dependent variable, and then use the log scale, the celery ratio between the two teams as an independent variable. Then we are going to use smf.glm to fit the logistic regression model. Then we can obtain all the results. Here you can see the regression coefficient attach it to the our independent variable, which is the lowest scaled celery ratio between the two teams, which is statistically significant and the Z-score, G statistics is very high as well. Yeah, so based off of this logistic regression model, we can obtain the fitted probabilities of winning on each game by using the lottery model. This is the fitted probabilities by using the regression mode. Then based on the fitted probabilities, we can also create a binary winning variable by using the fitted probabilities. One thing I would like to note is that the classification rule here is even simpler than the previous NHL example, as there are only two possible outcomes in baseball. Here we will use 0.5 as a core point to classify the game results. When the fitted probabilities are greater than 0.5, then we will assign a variable one for home team winning and zero otherwise. Now, let's create the confusion matrix to get the results here as well. As a result of the confusion matrix here, then this will create a two-by-two matrix so that we can compare the actual outcomes against faded outcomes. As a result of a passing this line of the code here, we will see the number of correct predictions from the models as well. Like I said, we can see the correct number of home team winning and home team losing along the diagonal in the two-by-two matrix here. Based off the information from the confusion matrix, we can calculate the success rate of our logistic regression model as well. As a result of the success rate here, then we can see that our logistic regression model predicted 55 percent of the total gain results correctly. We can also create the classification report here. Under the preseason column, you can see the success rate of home team losing, which is 53.6 percent. You can also take a look at the success rate of home team winning, which is 55.8 percent under the precision column here as well. We have used very handy calls to obtain all the fitted results above. However, there is another way to obtain all the fitted probabilities and the fitted outcomes by directly applying the regression model as well. I will leave this as our self-test. Basically, this is the way to obtain the fitted probabilities by applying the regression formula obtained above. This way, you can also calculate the fitted probabilities manually as well. If we run this line of the code, you can see the fitted probabilities for home team meaning and home team losing respectively. Then based on the fitted probabilities, we can also obtain the outcome results as well. Then we can obtain the accurate prediction from the fitted values and we can also calculate the success rate of our prediction model as well, which is basically same as the previous model with simple code here. But again, I just wanted to show you how to calculate the fitted probabilities manually from applying the logistic regression model. Now, let's move on to the betting odds as basis for forecasting. Let's import the data-set first and then take a look at the data-set here. As you can see the data frame, you can see the betting odds for two teams in a match and also you should notice that the betting odds are encoded in the form of the decimal odds here. Based on the betting odds, we can obtain the fitted probabilities from the betting odds as well then take a look at the resulting data frame. Here you can see the fitted winning probabilities by means of the betting odds. Here we have the fitted winning probabilities from the betting odds model and we are going to see how accurate this model is. Well before we move on to obtain the fitted results from the winning probabilities, there is another way to think about this. I mean how accurate the betting odds model is. Well, we can generate the fitted probabilities on a team level and compare the average faded probabilities against the winning percentages among the teams in the league. To do so, we can also check how good the betting odds are as an independent variable to forecast the results. Let's go ahead and generate the team-level data, then here we use the mean command. This way, we are able to obtain the average winning percentages among the MLB team and also we are going to obtain the average fitted winning probabilities among the MLB teams in 2019. We can just take a look at the relationship between the fitted winning probabilities against the actual winning probabilities in 2019. Here we have two variables as a result of [inaudible] the code, and we can just plot those two variables together here. Then well, as you can see, there is a very strong and high correlation between the actual winning percentages and the fitted winning probabilities. Well, we can expect that fitted winning probabilities from the betting odds will give us a fairly accurate prediction rate as well. Let's move on to the forecasting model then we are going to obtain the fitted outcomes from the fitted winning probabilities here. The classification rule is the same as the way we obtained the fitted probabilities from the logistic regression model above. Then we can create a column to check whether or not we got the correct prediction from the model and then we can calculate the success rate from the betting odds model here as well. Betting odds model predicted 60 percent of the total game results correctly, while Salary ratio model predicted 55 percent results correctly. Then now we can also compare the performance of the two models in the context of Brier Score. Let's calculate the Brier Score for Salary Ratio Model first. Well, first of all, we are going to create the dummy outcome variable first and then we are going to apply the formula directly and then we can get the Brier Score, that is 0.59. Now we are going to obtain the Brier Score for betting odds model here as well. Well, same goes with this example. The Brier Score for the betting odds is 0.47. Brier Score also indicates that the betting odds model is more accurate than the Salary Ratio Model. That's the end of the MLB [inaudible] model then now we have another model using the NDA data-set.