cover

10. Comprehensive Application Exercises of Regression Methods#

10.1. Introduction#

This challenge will combine the relevant regression analysis methods learned previously to complete a multiple regression analysis task. At the same time, we will review the implementation and application of different methods by combining libraries such as NumPy, scikit-learn, SciPy, and statsmodels.

10.2. Key Points#

  • Simple linear regression

  • Multiple linear regression

  • Hypothesis testing

The challenge uses the Advertising example dataset provided in the book An Introduction to Statistical Learning for practice. This book is a classic in statistical learning. If you are interested, you can download and read it for free from the author’s website.

First, we load and preview the dataset.

# 下载数据集
!wget -nc https://cdn.aibydoing.com/aibydoing/files/advertising.csv
import pandas as pd

data = pd.read_csv("advertising.csv", index_col=0,)
data.head()
tv radio newspaper sales
1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
4 151.5 41.3 58.5 18.5
5 180.8 10.8 58.4 12.9

The dataset contains 4 columns and 200 rows in total. Each sample represents the advertising expenses required for a supermarket to sell a corresponding unit of goods. Taking the first row as an example, it means that for the supermarket to sell an average of 22.1 units of goods, the advertising expenses for TV, radio, and newspaper are: \(230.1, \)37.8, and $69.2 respectively.

Therefore, in this challenge, the first 3 columns are regarded as features and the last column as the target value.

Exercise 10.1

Challenge: Use the ordinary least squares method provided by SciPy to calculate the fitting parameters of the univariate linear regression model between the three features and the target respectively.

Requirement: The scipy.optimize.leastsq function should be used to complete the calculation, and the function result should be directly output without processing.

import numpy as np
from scipy.optimize import leastsq

## 代码开始 ### (≈ 10 行代码)
params_tv = None
params_radio = None
params_newspaper = None
## 代码结束 ###

params_tv[0], params_radio[0], params_newspaper[0]

Expected output

((array([7.03259354, 0.04753664]),
 (array([9.31163807, 0.20249578]),
 (array([12.35140711, 0.0546931]))

Next, based on the results obtained by the least squares method, we plot the fitted line onto the original scatter plot of the distribution.

Exercise 10.2

Challenge: Plot the scatter plots between the three features and the target respectively in a horizontal subplot manner, and add the linear fitting lines.

Requirement: The linear fitting lines start from the minimum abscissa value in the scatter plot, end at the maximum abscissa value, and are shown in red.

from matplotlib import pyplot as plt

%matplotlib inline

## 代码开始 ### (≈ 10 行代码)


## 代码结束 ###

Expected output

image

Next, we try to build a multiple linear regression model that includes all features.

\[ y = \omega_0 + \omega_1 * tv + \omega_2 * radio + \omega_3 * newspaper \]

Exercise 10.3

Challenge: Use the linear regression method provided by scikit-learn to build a multiple linear regression model consisting of 3 features and a target.

Requirement: Only the sklearn.linear_model.LinearRegression class provided by scikit-learn can be used.

from sklearn.linear_model import LinearRegression

## 代码开始 ### (≈ 4 行代码)
model = None
## 代码结束 ###

model.coef_, model.intercept_  # 返回模型自变量系数和截距项

Expected output

(array([ 0.04576465,  0.18853002, -0.00103749]), 2.9388893694594067)

Next, we hope to test the multiple linear regression model. Use the relevant methods provided by the statsmodels library to complete the goodness-of-fit test and variable significance test.

Exercise 10.4

Challenge: Use the relevant methods provided by the statsmodels library to complete the goodness-of-fit test and variable significance test for the above multiple regression model.

Hint: You can use statsmodels.api.sm.OLS or statsmodels.formula.api.smf. The latter is not covered in this experiment and you need to learn about it on your own.

import statsmodels.formula.api as smf

## 代码开始 ### (≈ 3 行代码)
results = None
## 代码结束 ###

results.summary2()  # 输出模型摘要

Expected output

Model: OLS Adj. R-squared: 0.896
Dependent Variable: sales AIC: 780.3622
Date: BIC: 793.5555
No. Observations: 200 Log-Likelihood: -386.18
Df Model: 3 F-statistic: 570.3
Df Residuals: 196 Prob (F-statistic): 1.58e-96
R-squared: 0.897 Scale: 2.8409
Coef. Std.Err. t P>|t| [0.025 0.975]
Intercept 2.9389 0.3119 9.4223 0.0000 2.3238 3.5540
tv 0.0458 0.0014 32.8086 0.0000 0.0430 0.0485
radio 0.1885 0.0086 21.8935 0.0000 0.1715 0.2055
newspaper -0.0010 0.0059 -0.1767 0.8599 -0.0126 0.0105
Omnibus: 60.414 Durbin-Watson: 2.084
Prob(Omnibus): 0.000 Jarque-Bera (JB): 151.241
Skew: -1.327 Prob(JB): 0.000
Kurtosis: 6.332 Condition No.: 454

We can see that the regression fitting coefficients obtained here are consistent with the calculation results of scikit-learn above. At the same time, the P-values of tv and radio are close to 0 [precision], while the P-value of newspaper is relatively large. We can consider that tv and radio have passed the variable significance test, while newspaper has not. In fact, you can try to recalculate the multiple linear regression results after removing the newspaper feature.

Model: OLS Adj. R-squared: 0.896
Dependent Variable: sales AIC: 778.3941
Date: BIC: 788.2891
No. Observations: 200 Log-Likelihood: -386.20
Df Model: 2 F-statistic: 859.6
Df Residuals: 197 Prob (F-statistic): 4.83e-98
R-squared: 0.897 Scale: 2.8270
Coef. Std.Err. t P>|t| [0.025 0.975]
Intercept 2.9211 0.2945 9.9192 0.0000 2.3403 3.5019
tv 0.0458 0.0014 32.9087 0.0000 0.0430 0.0485
radio 0.1880 0.0080 23.3824 0.0000 0.1721 0.2038
Omnibus: 60.022 Durbin-Watson: 2.081
Prob(Omnibus): 0.000 Jarque-Bera (JB): 148.679
Skew: -1.323 Prob(JB): 0.000
Kurtosis: 6.292 Condition No.: 425

The above is the result recalculated after removing the newspaper feature. You can find that the \(R^2\) values of the multiple linear regression models with and without newspaper are both 0.896. This also verifies that this feature is not very helpful in reflecting the changes in the target value. You can try removing one of the other two features and then check the change in the \(R^2\) goodness-of-fit value.


○ Sharethis article link to your social media, blog, forum, etc. More external links will increase the search engine ranking of this site.