Pro Membership

Pro Membership is a membership plan created by the author to maintain and update this tutorial. You can get more benefits and services, click to view details.

9. Regression Model Evaluation and Testing#

9.1. Introduction#

So far, we have mastered the relevant methods of regression analysis and prediction and understood various model evaluation criteria. However, for the results of regression analysis, there is often a process that needs to be gone through, namely: hypothesis testing, and this process will apply the relevant knowledge of mathematical statistics.

9.2. Key Points#

Goodness-of-fit test
Variable significance test

For the regression prediction results, there are usually three evaluation metrics: Mean Absolute Error (MAE), Mean Squared Error (MSE), and Mean Absolute Percentage Error (MAPE). Among them, MAE and MSE have been introduced in the experiments of linear regression and polynomial regression. Now, let’s review and supplement the calculation formula of MAPE.

The Mean Absolute Error (MAE) is the average of the absolute errors, and its calculation formula is as follows:

\[ \operatorname{MAE}(y, \hat{y})=\frac{1}{n} \sum_{i=1}^{n}\left|y_{i}-\hat{y}_{i}\right| \tag{1} \]

The Mean Squared Error (MSE) represents the expected value of the square of the error, and its calculation formula is as follows:

\[ \textrm{MSE}(y, \hat{y} ) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_{i})^{2} \tag{2} \]

Mean Absolute Percentage Error (MAPE). MAPE is a variation of MAD. It is a percentage value and thus easier to understand than other statistics. For example, if the MAPE is 5, it means that the predicted results deviate from the actual results by an average of 5%. The calculation formula for MAPE is as follows:

\[ \textrm{MAPE}(y, \hat{y} ) = \frac{\sum_{i=1}^{n}{|\frac{y_{i}-\hat y_{i}}{y_{i}}|}}{n} \times 100 \tag{3} \]

For the above three formulas, \(y_{i}\) represents the true value, \(\hat y_{i}\) represents the predicted value, and \(n\) represents the number of values. The smaller the three evaluation values, the better the accuracy of the prediction model.

Among them, scikit-learn provides the calculation methods for MAE and MSE as follows:

                      sklearn.metrics.mean_absolute_error(y_true, y_pred)  # MAE calculation
sklearn.metrics.mean_squared_error(y_true, y_pred)  # MSE calculation

Regarding the calculation method of MAPE, although scikit-learn does not directly provide it, we have learned how to implement it using the formula.

The above three formulas can actually often only be used to evaluate the relative merits of models. For example, polynomial regression has better results than linear regression when applied to the same set of data. However, if we want to evaluate whether a certain model can really well explain the sample data, it needs to be completed through a series of hypothesis testing methods.

9.3. Goodness-of-Fit Test#

When we use regression analysis to fit data, we actually replace the true parameters of the population with the parameters estimated from the sample. Although it is known from the statistical properties that the expectation of the parameter estimate is equal to the true parameter of the population after sufficient repeated sampling operations. In fact, in a single sampling process, there may be a difference between the estimated value and the true value of the parameter.

Then, after a single sampling, it is necessary to conduct a statistical test to determine whether the difference between the estimated value and the true value of the parameter is significant. In fact, in a simple linear regression model, generally, goodness-of-fit test, significance test of variables, and interval estimation of parameters are required. In a multiple linear regression model, there are often goodness-of-fit test, significance test of variables (T-test), significance test of the equation (F-test), and confidence interval of parameters. Among them, the T-test and F-test in simple linear regression are the same.

The goodness-of-fit test for simple linear regression refers to testing the degree of fit between the sample regression fitting line and the sample observations. We generally use the coefficient of determination \(R^2\) as an indicator to measure the goodness of fit.

Suppose we have a set of sample observations \((x_i, y_i)\), and the linear regression fitting line obtained from them is:

\[ \hat{y}_{i}=w x_{i}+b \tag{4} \]

At this time, if the mean of the sample observations (observed mean) is: \({\displaystyle{\bar{y}}={\frac{1}{n}}\sum_{i=1}^{n}y_{i}}\), we define the “deviation” between the \(i\)-th observation and the sample mean as: \(Y_{i}=y_{i}-\overline{y}\).

The “deviation” can then be decomposed into:

\[ Y_{i}=y_{i}-\overline{y}=\left(y_{i}-\hat{y}_{i}\right)+\left(\hat{y}_{i}-\overline{y}\right)=e_{i}+\hat{Y}_{i} \tag{5} \]

Among them, \(\hat{Y}_{i}=\left(\hat{y}_{i}-\overline{y}\right)\) is the difference between the sample regression fitted value and the observed mean. We can consider this as the part explained by the regression line.

And, \(e_{i}=\left(y_{i}-\hat{y}_{i}\right)\) is the difference between the actual observation and the sample regression fitted value, that is, the “residual”. When \(y_{i}=\hat{y}_{i}\), that is, the actual observation exactly falls on the sample regression fitted line, this is the best fitting state. It can be considered that the “deviation” all comes from the regression line and has nothing to do with the “residual”. We can consider this as the part that cannot be explained by the regression line.

Then, we denote it as:

\[ TSS=\sum_{i=1}^{n} Y_{i}^{2}=\sum_{i=1}^{n}\left(y_{i}-\overline{y}\right)^{2} \tag{6} \]

\[ ESS=\sum_{i=1}^{n} \hat{Y}_{i}^{2}=\sum_{i=1}^{n}\left(\hat{y}_{i}-\overline{y}\right)^{2} \tag{7} \]

\[ RSS=\sum_{i=1}^{n} e_{i}^{2}=\sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)^{2} \tag{8} \]

We can continue to derive the relationships among the above three. Among them:

\[ TSS=\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2} = \sum_{i=1}^{n}\left(\left(y_{i} - \hat{y}_{i}\right) + \left(\hat{y}_{i} -\bar{y}\right)\right)^{2} \]

\[ =\sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)^{2}+\sum_{i=1}^{n}\left(\hat{y}_{i}-\bar{y}\right)^{2}+2 \sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)\left(\hat{y}_{i}-\bar{y}\right) \tag{9} \]

For the linear regression model \(y_i = w x_{i}+b\), when estimating parameters using the least squares method:

\[ Q\left(b, w\right)=\min _{b, w} Q\left(b, w\right)=\sum_{i=1}^{n}\left(y_{i}-b-w x_{i}\right)^{2} \tag{10} \]

Then, the necessary conditions for the formula \((10)\) to have a minimum value are:

\[ \frac{\partial Q}{\partial b}=-2 \sum_{i=1}^{n}\left(y_{i}-b-w x_{i}\right)=0 \tag{11} \]

\[ \frac{\partial Q}{\partial w}=-2 \sum_{i=1}^{n} x_{i} \left(y_{i}-b-w x_{i}\right)=0 \tag{12} \]

Now, let’s go back to the third term of formula \((9)\). We continue the derivation by omitting the constant coefficient. Since:

\[ \sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)\left(\hat{y}_{i}-\bar{y}\right)=\sum_{i=1}^{n}\left(y_{i}-b-w x_{i}\right)\left(b+w x_{i}-\bar{y}\right) \]

\[ =b\sum_{i=1}^{n}\left(y_{i}-b-w x_{i}\right) + w\sum_{i=1}^{n}x_{i}\left(y_{i}-b-w x_{i}\right)-\bar{y}\sum_{i=1}^{n}\left(y_{i}-b-w x_{i}\right) \tag{13} \]

In formula \((13)\), the first and third terms are known to be 0 according to formula \((11)\), and the second term is known to be 0 according to formula \((12)\). Therefore, formula \((13)\) is also 0.

Therefore, formula \((9)\) can be written as:

\[ \sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}=\sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)^{2}+\sum_{i=1}^{n}\left(\hat{y}_{i}-\bar{y}\right)^{2} \tag{14} \]

That is to prove:

\[ TSS=ESS+RSS \tag{15} \]

Among them, TSS is called the Total Sum of Squares, ESS is called the Explained Sum of Squares, and RSS is the Residual Sum of Squares. Thus, the total deviation of the observed values of \(y\) around its mean can be decomposed into two parts: one part comes from the regression fitting line ESS, and the other part comes from the random force RSS.

In a given sample, TSS remains unchanged. If the actual observed points are closer to the sample regression fitting line, the proportion of ESS in TSS will be larger. Therefore, we define the goodness of fit as the ratio of the explained sum of squares ESS to the total deviation of \(y\), TSS.

\[ R^{2}=\frac{ESS}{TSS}=1-\frac{RSS}{TSS} \tag{16} \]

In short, when RSS is smaller, \(R^{2}\) approaches 1 more closely, indicating that the explanatory power of the model is stronger. Conversely, the explanatory power of the model is weaker.

Therefore, generally, the valid range of values for \(R^{2}\) is between \([0, 1]\). The larger the value, the better the goodness of fit of the model.

Next, we will calculate the goodness of fit of the linear regression model through an example.

                          import numpy as np

# 生成 2 组 x 相同的示例数据
x = np.array([4, 8, 12, 25, 32, 43, 58, 63, 69, 79]).reshape(-1, 1)
y1 = np.array([9, 17, 23, 51, 62, 83, 115, 125, 137, 159]).reshape(-1, 1)
y2 = np.array([20, 33, 50, 56, 42, 31, 33, 46, 65, 75]).reshape(-1, 1)

                        

                          from matplotlib import pyplot as plt

%matplotlib inline

# 两组数据分别绘制子图
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].scatter(x, y1)
axes[1].scatter(x, y2)

                        

<matplotlib.collections.PathCollection at 0x115e14310>

../_images/66d902983c0c897fd83116d509ff9e7dadd4f9b8642999af5d12fa6f172b6ca3.png

                          from sklearn.linear_model import LinearRegression

model1 = LinearRegression()  # 线性模型 1
model1.fit(x, y1)

model2 = LinearRegression()  # 线性模型 2
model2.fit(x, y2)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

                          # 绘制拟合直线到原散点图中
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
# 模型 1
axes[0].scatter(x, y1)
axes[0].plot(
    [0, 90], [model1.intercept_, model1.coef_[0] * 90 + model1.intercept_], "r"
)
# 模型 2
axes[1].scatter(x, y2)
axes[1].plot(
    [0, 90], [model2.intercept_, model2.coef_[0] * 90 + model2.intercept_], "r"
)

                        

[<matplotlib.lines.Line2D at 0x16c1c7c70>]

../_images/06d71ff55f7388d48968d1d9dde419f2b1593627f00ea337410d26123e9082aa.png

Next, we can use the \(R^{2}\) calculation method provided by scikit - learn to calculate the \(R^{2}\) results corresponding to the two models.

                          from sklearn.metrics import r2_score

# 分别传入真实观测值和模型预测值
r2_score(y1, model1.predict(x)), r2_score(y2, model2.predict(x))

(0.9993315719475073, 0.3740615194558855)

We can use the above formula calculation method to verify the calculation results of scikit - learn.

                          # 使用公式计算方法验证 model1 计算结果
rss = np.sum(pow((y1 - model1.predict(x)), 2))
tss = np.sum(pow((y1 - np.mean(y1)), 2))
1 - rss / tss

                        

0.9993315719475073

It can be seen that the \(R^{2}\) result of Model 1 approaches 1, while the value of \(R^{2}\) of Model 2 is much smaller. This indicates that the linear regression model has a stronger explanatory power for the first set of data, but insufficient explanatory power for the second set of data. In fact, it can also be visually observed that the second set of data is actually more suitable for a polynomial regression model rather than a linear regression model. This also corroborates the result of the \(R^{2}\) value.

Therefore, although using the ordinary least squares estimation method can already ensure the optimal model fitting result, it is still necessary to conduct a “goodness of fit test” to determine whether the model can well explain the data distribution.

One point that needs to be explained here is that although it was mentioned above that the valid value range of \(R^{2}\) is between \([0, 1]\). However, the API provided by scikit - learn may calculate a negative \(R^{2}\) value. For example, if we artificially set a fitting line that is exactly opposite to the sample trend, the calculated \(R^{2}\) will be negative. In this case, the formula \(TSS = ESS + RSS\) will also lose its meaning.

                          plt.scatter(x, y1)
plt.plot([0, 90], [200, -2 * 90 + 200], "r")  # 线性函数为：y = -2x + 200
plt.title("r2_score: {}".format(r2_score(y1, -2 * x + 200)))

                        

Text(0.5, 1.0, 'r2_score: -3.7404627174607548')

../_images/db430f520131ff2a0a2ef0482397249e80809d1560e3f26bc0791c34fbb1cd26.png

Well, you may wonder what value of \(R^{2}\) indicates that the model has passed the goodness of fit test. In fact, there is no definite numerical interval here, and different problems will have different criteria. You can read some discussions about the value of \(R^{2}\).

9.4. Variable Significance Test#

The variable significance test is to determine whether the explanatory variable \(x\) is a significant influencing factor of the explained variable \(y\) in regression analysis, and it will apply the knowledge of hypothesis testing in mathematical statistics. The principle of hypothesis testing is to make assumptions about the population parameters or the form of the population distribution in advance. Then, use the sample information to judge whether the null hypothesis is reasonable. That is to say, by judging whether there is a significant difference between the sample information and the null hypothesis, and then decide whether to accept or reject the null hypothesis.

Since it involves a lot of knowledge of mathematical statistics, the relevant principles of hypothesis testing will not be discussed here. For details, you can refer to the relevant materials.

Next, let’s directly see how to use Python to complete the variable significance test and explain the meaning of the parameters. Here, instead of using the linear regression algorithm provided by scikit-learn for modeling, we will use the commonly used library statsmodels in mathematical statistics to complete it.

First, use the second set of data above to re - establish an ordinary least squares linear regression model.

                          import statsmodels.api as sm

x = sm.add_constant(x)  # 添加截距项系数 1
model = sm.OLS(endog=y2, exog=x)  # 普通最小二乘法, endog 因变量, exog 自变量
results = model.fit()
results.params  # 输出拟合后系数

                        

array([30.12973186,  0.38092285])

                          array([[ 1.,  4.],
       [ 1.,  8.],
       [ 1., 12.],
       [ 1., 25.],
       [ 1., 32.],
       [ 1., 43.],
       [ 1., 58.],
       [ 1., 63.],
       [ 1., 69.],
       [ 1., 79.]])

                        

Next, we can directly output the summary information of the model, which is also a major feature of statsmodels.

                          import warnings

warnings.filterwarnings("ignore")

results.summary2(alpha=0.05)  # 输出模型摘要, 95% 置信度

Model:	OLS	Adj. R-squared:	0.296
Dependent Variable:	y	AIC:	83.1646
Date:	2023-11-09 14:47	BIC:	83.7698
No. Observations:	10	Log-Likelihood:	-39.582
Df Model:	1	F-statistic:	4.781
Df Residuals:	8	Prob (F-statistic):	0.0602
R-squared:	0.374	Scale:	200.68

	Coef.	Std.Err.	t	P>\|t\|	[0.025	0.975]
const	30.1297	8.1820	3.6824	0.0062	11.2620	48.9975
x1	0.3809	0.1742	2.1865	0.0602	-0.0208	0.7827

Omnibus:	2.455	Durbin-Watson:	0.832
Prob(Omnibus):	0.293	Jarque-Bera (JB):	0.895
Skew:	-0.056	Prob(JB):	0.639
Kurtosis:	1.538	Condition No.:	86

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Focus on the second table and sequentially obtain the fitted coefficients, standard errors, t - values, p - values, and confidence intervals.

-	Coef.	Std.Err.	t	P>abs(t)	[0.025]	[0.975]
const	30.1297	8.1820	3.6824	0.0062	11.2620	48.9975
x1	0.3809	0.1742	2.1865	0.0602	-0.0208	0.7827

Among them, the P - value is what we need to look at. If the P - value is very small, it means that the variable passes the significance test and needs to be retained in the model. In fact, when we do multiple regression modeling, we can eliminate the insignificant variables by looking at the P - values.

In fact, the confidence interval estimates of the parameter tests are given simultaneously in the above table. The last two columns indicate that, at a 97.5% confidence level, the confidence intervals for the intercept term and the independent variable coefficient are \((11.2620, 48.9975)\) and \((-0.0208, 0.7827)\) respectively.

Since the confidence interval reflects the proximity of the sample parameter estimate to the true value of the population parameter, the smaller the confidence interval, the better. Generally, we can narrow the confidence interval by increasing the sample size and improving the goodness of fit of the model.

9.5. Summary#

In this experiment, we reviewed the regression model evaluation metrics used previously. However, these metrics are often only applicable for direct comparison of models. If we want to further evaluate the performance of a model, we need to conduct tests on the model. Therefore, in the experiment, we learned two types of methods: the goodness - of - fit test for regression models and the variable significance test. We hope that everyone can have a certain understanding of the two. Students with spare capacity can, through the relevant materials given in the text, fully understand the relevant hypothesis testing principles and methods in mathematical statistics.