Pro Membership

Pro Membership is a membership plan created by the author to maintain and update this tutorial. You can get more benefits and services, click to view details.

10. Comprehensive Application Exercises of Regression Methods#

10.1. Introduction#

This challenge will combine the relevant regression analysis methods learned previously to complete a multiple regression analysis task. At the same time, we will review the implementation and application of different methods by combining libraries such as NumPy, scikit-learn, SciPy, and statsmodels.

10.2. Key Points#

Simple linear regression
Multiple linear regression
Hypothesis testing

The challenge uses the Advertising example dataset provided in the book An Introduction to Statistical Learning for practice. This book is a classic in statistical learning. If you are interested, you can download and read it for free from the author’s website.

First, we load and preview the dataset.

                          # 下载数据集
!wget -nc https://cdn.aibydoing.com/aibydoing/files/advertising.csv

                          import pandas as pd

data = pd.read_csv("advertising.csv", index_col=0,)
data.head()

	tv	radio	newspaper	sales
1	230.1	37.8	69.2	22.1
2	44.5	39.3	45.1	10.4
3	17.2	45.9	69.3	9.3
4	151.5	41.3	58.5	18.5
5	180.8	10.8	58.4	12.9

The dataset contains 4 columns and 200 rows in total. Each sample represents the advertising expenses required for a supermarket to sell a corresponding unit of goods. Taking the first row as an example, it means that for the supermarket to sell an average of 22.1 units of goods, the advertising expenses for TV, radio, and newspaper are: $230.1, $37.8, and $69.2 respectively.

Therefore, in this challenge, the first 3 columns are regarded as features and the last column as the target value.

Exercise 10.1

Challenge: Use the ordinary least squares method provided by SciPy to calculate the fitting parameters of the univariate linear regression model between the three features and the target respectively.

Requirement: The scipy.optimize.leastsq function should be used to complete the calculation, and the function result should be directly output without processing.

                          import numpy as np
from scipy.optimize import leastsq

## 代码开始 ### (≈ 10 行代码)
params_tv = None
params_radio = None
params_newspaper = None
## 代码结束 ###

params_tv[0], params_radio[0], params_newspaper[0]

                        

Solution to Exercise 10.1

                          import numpy as np
from scipy.optimize import leastsq

### Code start ### (≈ 10 lines of code)
p_init = np.random.randn(2)

def func(p, x):
    w0, w1 = p
    f = w0 + w1*x
    return f

def err_func(p, x, y):
    ret = func(p, x) - y
    return ret

params_tv = leastsq(err_func, p_init, args=(data.tv, data.sales))
params_radio = leastsq(err_func, p_init, args=(data.radio, data.sales))
params_newspaper = leastsq(err_func, p_init, args=(data.newspaper, data.sales))
### Code end ###

params_tv[0], params_radio[0], params_newspaper[0]

                        

Expected output

((array([7.03259354, 0.04753664]),
 (array([9.31163807, 0.20249578]),
 (array([12.35140711, 0.0546931]))

Next, based on the results obtained by the least squares method, we plot the fitted line onto the original scatter plot of the distribution.

Exercise 10.2

Challenge: Plot the scatter plots between the three features and the target respectively in a horizontal subplot manner, and add the linear fitting lines.

Requirement: The linear fitting lines start from the minimum abscissa value in the scatter plot, end at the maximum abscissa value, and are shown in red.

                          from matplotlib import pyplot as plt

%matplotlib inline

## 代码开始 ### (≈ 10 行代码)

## 代码结束 ###

Solution to Exercise 10.2

                          from matplotlib import pyplot as plt
%matplotlib inline

### Code start ### (≈ 10 lines of code)
fig, axes = plt.subplots(1, 3, figsize=(16, 4))

data.plot(kind='scatter', x='tv', y='sales', ax=axes[0])
data.plot(kind='scatter', x='radio', y='sales', ax=axes[1])
data.plot(kind='scatter', x='newspaper', y='sales', ax=axes[2])

x_tv = np.array([data.tv.min(), data.tv.max()])
axes[0].plot(x_tv, params_tv[0][1]*x_tv + params_tv[0][0], 'r')

x_radio = np.array([data.radio.min(), data.radio.max()])
axes[1].plot(x_radio, params_radio[0][1]*x_radio + params_radio[0][0], 'r')

x_newspaper = np.array([data.newspaper.min(), data.newspaper.max()])
axes[2].plot(x_newspaper, params_newspaper[0][1] *
             x_newspaper + params_newspaper[0][0], 'r')
### Code end ###

                        

Expected output

Next, we try to build a multiple linear regression model that includes all features.

\[ y = \omega_0 + \omega_1 * tv + \omega_2 * radio + \omega_3 * newspaper \]

Exercise 10.3

Challenge: Use the linear regression method provided by scikit-learn to build a multiple linear regression model consisting of 3 features and a target.

Requirement: Only the sklearn.linear_model.LinearRegression class provided by scikit-learn can be used.

                          from sklearn.linear_model import LinearRegression

## 代码开始 ### (≈ 4 行代码)
model = None
## 代码结束 ###

model.coef_, model.intercept_  # 返回模型自变量系数和截距项

Solution to Exercise 10.3

                          from sklearn.linear_model import LinearRegression

### Code starts ### (≈ 4 lines of code)
X = data[['tv', 'radio', 'newspaper']]
y = data.sales

model = LinearRegression()
model.fit(X, y)
### Code ends ###

model.coef_, model.intercept_  # Return the coefficients of the independent variables and the intercept term of the model

                        

Expected output

(array([ 0.04576465,  0.18853002, -0.00103749]), 2.9388893694594067)

Next, we hope to test the multiple linear regression model. Use the relevant methods provided by the statsmodels library to complete the goodness-of-fit test and variable significance test.

Exercise 10.4

Challenge: Use the relevant methods provided by the statsmodels library to complete the goodness-of-fit test and variable significance test for the above multiple regression model.

Hint: You can use statsmodels.api.sm.OLS or statsmodels.formula.api.smf. The latter is not covered in this experiment and you need to learn about it on your own.

                          import statsmodels.formula.api as smf

## 代码开始 ### (≈ 3 行代码)
results = None
## 代码结束 ###

results.summary2()  # 输出模型摘要

Solution to Exercise 10.4

                          import statsmodels.formula.api as smf

### Code start ### (≈ 3 lines of code)
results = smf.ols(formula='sales ~ tv + radio + newspaper', data=data).fit()
### Code end ###

results.summary2() # Output the model summary

Expected output

Model:	OLS	Adj. R-squared:	0.896
Dependent Variable:	sales	AIC:	780.3622
Date:		BIC:	793.5555
No. Observations:	200	Log-Likelihood:	-386.18
Df Model:	3	F-statistic:	570.3
Df Residuals:	196	Prob (F-statistic):	1.58e-96
R-squared:	0.897	Scale:	2.8409

	Coef.	Std.Err.	t	P>\|t\|	[0.025	0.975]
Intercept	2.9389	0.3119	9.4223	0.0000	2.3238	3.5540
tv	0.0458	0.0014	32.8086	0.0000	0.0430	0.0485
radio	0.1885	0.0086	21.8935	0.0000	0.1715	0.2055
newspaper	-0.0010	0.0059	-0.1767	0.8599	-0.0126	0.0105

Omnibus:	60.414	Durbin-Watson:	2.084
Prob(Omnibus):	0.000	Jarque-Bera (JB):	151.241
Skew:	-1.327	Prob(JB):	0.000
Kurtosis:	6.332	Condition No.:	454

We can see that the regression fitting coefficients obtained here are consistent with the calculation results of scikit-learn above. At the same time, the P-values of tv and radio are close to 0 [precision], while the P-value of newspaper is relatively large. We can consider that tv and radio have passed the variable significance test, while newspaper has not. In fact, you can try to recalculate the multiple linear regression results after removing the newspaper feature.

Model:	OLS	Adj. R-squared:	0.896
Dependent Variable:	sales	AIC:	778.3941
Date:		BIC:	788.2891
No. Observations:	200	Log-Likelihood:	-386.20
Df Model:	2	F-statistic:	859.6
Df Residuals:	197	Prob (F-statistic):	4.83e-98
R-squared:	0.897	Scale:	2.8270

	Coef.	Std.Err.	t	P>\|t\|	[0.025	0.975]
Intercept	2.9211	0.2945	9.9192	0.0000	2.3403	3.5019
tv	0.0458	0.0014	32.9087	0.0000	0.0430	0.0485
radio	0.1880	0.0080	23.3824	0.0000	0.1721	0.2038

Omnibus:	60.022	Durbin-Watson:	2.081
Prob(Omnibus):	0.000	Jarque-Bera (JB):	148.679
Skew:	-1.323	Prob(JB):	0.000
Kurtosis:	6.292	Condition No.:	425

The above is the result recalculated after removing the newspaper feature. You can find that the $R^2$ values of the multiple linear regression models with and without newspaper are both 0.896. This also verifies that this feature is not very helpful in reflecting the changes in the target value. You can try removing one of the other two features and then check the change in the $R^2$ goodness-of-fit value.

○ Sharethis article link to your social media, blog, forum, etc. More external links will increase the search engine ranking of this site.

If you find this content helpful, you can buy me a coffee

9. Regression Model Evaluation and Testing

11. Logistic Regression Implementation and Application