7. Ridge Regression and LASSO Regression Implementation#
7.1. Introduction#
In regression prediction, in addition to the linear regression and polynomial regression mentioned earlier, there are many regression analysis methods. For example, ridge regression, LASSO regression, and various regression methods derived from the classification methods in the next week. In this section of the experiment, we first introduce the ridge regression and LASSO regression methods. Both of these methods apply regularization techniques and can solve some problems that cannot be covered by linear regression.
7.2. Key Points#
Limitations of Ordinary Least Squares
OLS Linear Fitting of Hilbert Matrix
Ridge Regression
LASSO Regression
7.3. Limitations of Ordinary Least Squares#
In the previous experiments, we learned to use the least squares method to solve linear regression and polynomial regression problems. At the same time, in the chapter on polynomial regression, we learned that polynomial regression can actually be regarded as a special form of linear regression, that is to say, the core of the regression method is linear regression.
7.4. Matrix Representation of Ordinary Least Squares#
First, let’s review the content of linear regression. In linear regression, we need to solve:
When we use Ordinary Least Squares (OLS) for learning (with the squared loss function), it becomes minimizing the objective function:
We can rewrite equation \((2)\) in vector form:
\(\left \| y-Xw \right \|_{2}\) represents the 2-norm, which is the square root of the sum of the squares of the absolute values of the vector elements.
In equation \((3)\), \(X\) is a matrix composed of the data \((x_{1},x_{2},……,x_{n})^{T}\); \(y\) is a column vector composed of \((y_{1},y_{2},……,y_{n})^{T}\). Meanwhile, the analytical solution of formula \((3)\) is:
The condition for formula \((4)\) to hold is that \(\left | X^{T}X \right |\) cannot be 0. When the correlation between variables is strong (multicollinearity), or when \(m\) is greater than \(n\), \(X\) in formula \((4)\) is not a full-rank (\(rank(A)\neq dim(x)\)) matrix. As a result, the value of \(\left | X^{T}X \right |\) approaches 0, increasing the numerical instability of the fitting parameters. This is the limitation of ordinary least squares method.
7.5. Hilbert Matrix OLS Linear Fitting#
In linear algebra, a Hilbert matrix is a square matrix whose coefficients are all unit fractions. Specifically, the coefficient in the \(i\)-th row and \(j\)-th column of a Hilbert matrix \(H\) is:
A \(5 \times 5\) Hilbert matrix is shown in the following formula \((6)\):
The reason for introducing the Hilbert matrix here is that we plan to conduct a linear fitting experiment using the least squares method with the numerical values of the Hilbert matrix below. There is a strong linear correlation between the data in each column of the Hilbert matrix, which exactly meets the condition that the result of \(\left | X^{T}X \right |\) approaches 0 as mentioned in Section 1.1.
Here, we use the
hilbert()
method provided by
Scipy
to directly create a
\(10 \times 10\)
Hilbert matrix:
from scipy.linalg import hilbert
x = hilbert(10)
x
array([[1. , 0.5 , 0.33333333, 0.25 , 0.2 ,
0.16666667, 0.14285714, 0.125 , 0.11111111, 0.1 ],
[0.5 , 0.33333333, 0.25 , 0.2 , 0.16666667,
0.14285714, 0.125 , 0.11111111, 0.1 , 0.09090909],
[0.33333333, 0.25 , 0.2 , 0.16666667, 0.14285714,
0.125 , 0.11111111, 0.1 , 0.09090909, 0.08333333],
[0.25 , 0.2 , 0.16666667, 0.14285714, 0.125 ,
0.11111111, 0.1 , 0.09090909, 0.08333333, 0.07692308],
[0.2 , 0.16666667, 0.14285714, 0.125 , 0.11111111,
0.1 , 0.09090909, 0.08333333, 0.07692308, 0.07142857],
[0.16666667, 0.14285714, 0.125 , 0.11111111, 0.1 ,
0.09090909, 0.08333333, 0.07692308, 0.07142857, 0.06666667],
[0.14285714, 0.125 , 0.11111111, 0.1 , 0.09090909,
0.08333333, 0.07692308, 0.07142857, 0.06666667, 0.0625 ],
[0.125 , 0.11111111, 0.1 , 0.09090909, 0.08333333,
0.07692308, 0.07142857, 0.06666667, 0.0625 , 0.05882353],
[0.11111111, 0.1 , 0.09090909, 0.08333333, 0.07692308,
0.07142857, 0.06666667, 0.0625 , 0.05882353, 0.05555556],
[0.1 , 0.09090909, 0.08333333, 0.07692308, 0.07142857,
0.06666667, 0.0625 , 0.05882353, 0.05555556, 0.05263158]])
import numpy as np
np.linalg.det(np.matrix(x).T * np.matrix(x))
3.882744733261777e-92
It can be seen that the result of \(\left | X^{T}X \right |\) here does approach 0.
7.6. Pearson Correlation Coefficient#
The Pearson Correlation Coefficient is usually used to measure the degree of linear correlation between two variables \(X\) and \(Y\), and its value ranges from -1 to 1. Among them, the closer the value is to 1, the higher the degree of positive correlation; the closer it is to 0, the lower the degree of linear correlation; and the closer it is to -1, the higher the degree of negative correlation.
Pandas provides a method to directly calculate the correlation coefficient, thereby calculating the correlation coefficients between the data columns in the \(10 \times 10\) Hilbert matrix above.
import pandas as pd
pd.DataFrame(x, columns=["x%d" % i for i in range(1, 11)]).corr()
x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 | |
---|---|---|---|---|---|---|---|---|---|---|
x1 | 1.000000 | 0.985344 | 0.965392 | 0.948277 | 0.934230 | 0.922665 | 0.913025 | 0.904883 | 0.897921 | 0.891902 |
x2 | 0.985344 | 1.000000 | 0.995632 | 0.988183 | 0.980720 | 0.973927 | 0.967905 | 0.962598 | 0.957918 | 0.953774 |
x3 | 0.965392 | 0.995632 | 1.000000 | 0.998160 | 0.994616 | 0.990719 | 0.986928 | 0.983393 | 0.980155 | 0.977207 |
x4 | 0.948277 | 0.988183 | 0.998160 | 1.000000 | 0.999065 | 0.997120 | 0.994845 | 0.992525 | 0.990281 | 0.988163 |
x5 | 0.934230 | 0.980720 | 0.994616 | 0.999065 | 1.000000 | 0.999465 | 0.998294 | 0.996860 | 0.995346 | 0.993839 |
x6 | 0.922665 | 0.973927 | 0.990719 | 0.997120 | 0.999465 | 1.000000 | 0.999669 | 0.998914 | 0.997959 | 0.996922 |
x7 | 0.913025 | 0.967905 | 0.986928 | 0.994845 | 0.998294 | 0.999669 | 1.000000 | 0.999782 | 0.999271 | 0.998608 |
x8 | 0.904883 | 0.962598 | 0.983393 | 0.992525 | 0.996860 | 0.998914 | 0.999782 | 1.000000 | 0.999850 | 0.999491 |
x9 | 0.897921 | 0.957918 | 0.980155 | 0.990281 | 0.995346 | 0.997959 | 0.999271 | 0.999850 | 1.000000 | 0.999893 |
x10 | 0.891902 | 0.953774 | 0.977207 | 0.988163 | 0.993839 | 0.996922 | 0.998608 | 0.999491 | 0.999893 | 1.000000 |
As shown above, there is a high numerical correlation between each column of data in the \(10 \times 10\) Hilbert matrix.
7.7. Least Squares Linear Fitting#
In the example dataset corresponding to the Hilbert matrix above, it is assumed that the known \(x_{1}\) to \(x_{10}\) follow a linear distribution:
Next, we first use NumPy to randomly generate the \(w\) parameters and obtain the actual corresponding \(y\) values.
from scipy.optimize import leastsq
x = hilbert(10) # 生成 10x10 的希尔伯特矩阵
np.random.seed(10) # 随机数种子能保证每次生成的随机数一致
w = np.random.randint(2, 10, 10) # 随机生成 w 系数
y_temp = np.matrix(x) * np.matrix(w).T # 计算 y 值
y = np.array(y_temp.T)[0] # 将 y 值转换成 1 维行向量
print("实际参数 w: ", w)
print("实际函数值 y: ", y)
实际参数 w: [3 7 6 9 2 3 5 6 3 7]
实际函数值 y: [14.14761905 10.1232684 8.12233045 6.8529637 5.95634643 5.28188478
4.75274309 4.32480306 3.97061256 3.67205737]
Next, we will use the least squares method learned in the previous courses to perform linear fitting on the dataset and find the fitted parameters.
func = lambda p, x: np.dot(x, p) # 函数公式
err_func = lambda p, x, y: func(p, x) - y # 残差函数
p_init = np.random.randint(1, 2, 10) # 全部参数初始化为 1
parameters = leastsq(err_func, p_init, args=(x, y)) # 最小二乘法求解
print("拟合参数 w: ", parameters[0])
拟合参数 w: [ 3.00002433 6.9974538 6.04179367 8.81766306 1.7719309 6.74596479
-6.17014658 21.44118582 -7.40951369 9.76387767]
You will find that the gap between the actual parameter \(w\) and the fitted parameter \(w\) is very large. This also verifies the limitations of the ordinary least squares method mentioned in Section 1.1.
To sum up, the limitations brought by the ordinary least squares method result in that it cannot be directly used for linear regression fitting in many cases. Especially in the following two situations:
-
The number of columns (features) in the dataset > the amount of data (number of rows), that is, \(X\) is not full column rank.
-
There is a strong linear correlation between the column (feature) data in the dataset, that is, the model is prone to overfitting.
7.8. Ridge Regression Derivation#
To solve the problems arising in the above two situations, Ridge Regression came into being. Ridge Regression can be regarded as a modified least squares estimation method. It effectively prevents the model from overfitting by adding an \(L_{2}\) regularization term (2-norm) to the loss function and helps solve the problem of difficult inversion under the condition of non-full rank, thereby improving the interpretability of the model. That is, the loss function corresponding to formula (2) above becomes:
We can rewrite equation (8) in vector form:
The analytical solution of the regression coefficient \(w\) in formula (9) is:
From the difference between formula (4) and formula (10), it can be seen that by adding an identity matrix to \(X^TX\), the matrix becomes full rank, thus improving the deficiencies of ordinary least squares.
7.9. Ridge Regression Fitting#
Next, we use the Ridge regression method
Ridge()
provided by scikit-learn to complete the data fitting in
Section 1.2.
sklearn.linear_model.Ridge(alpha=1.0, fit_intercept=True, normalize=False, copy_X=True, max_iter=None, tol=0.001, solver='auto', random_state=None)
-
alpha
: Regularization strength, default is 1.0, corresponding to \(\lambda\) in formula (8). -
fit_intercept
: Default is True, calculates the intercept term. -
normalize
: Default is False, does not standardize the data. -
copy_X
: Default is True, that is, operates on a copy of the data to prevent affecting the original data. -
max_iter
: Maximum number of iterations, default is None. -
tol
: Data solution accuracy. -
solver
: Automatically selects a solver according to the data type. -
random_state
: Random number generator.
from sklearn.linear_model import Ridge
ridge_model = Ridge(fit_intercept=False) # 参数代表不增加截距项
ridge_model.fit(x, y)
Ridge(fit_intercept=False)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Ridge(fit_intercept=False)
ridge_model.coef_ # 打印模型参数
array([6.3497497 , 4.32792068, 3.40228975, 2.83692059, 2.44632895,
2.15683095, 1.93213214, 1.75189668, 1.60369747, 1.47944808])
It can be seen that although there is a gap from the true
model parameters
[3, 7,
6, 9,
2, 3,
5, 6,
3,
7]
, it is already much better than the parameters fitted by
ordinary least squares.
Note
If you are careful enough, when using
sklearn.linear_model.LinearRegression()
to build a linear regression model to solve the parameters
here, you will find that the obtained parameters are close
to those fitted by ridge regression, but quite different
from those obtained by the least squares method in the
above text. Why is that?
In fact, this shows that the encapsulated linear
regression class
LinearRegression()
provided by scikit-learn does not just implement simple
least squares. Perhaps for convenience, regularization and
other means are also used by default under certain
conditions, which is why the results are different from
those of ordinary least squares. Scikit-learn is a very
mature module, and each class supports a large number of
parameters and functions. You can learn about them in
detail through the official documentation.
The
alpha
parameter in the
Ridge()
model represents the regularization strength. Next, we will
try to adjust this parameter to obtain different fitting
results.
"""不同 alpha 参数拟合
"""
alphas = np.linspace(1, 10, 20)
coefs = []
for a in alphas:
ridge = Ridge(alpha=a, fit_intercept=False)
ridge.fit(x, y)
coefs.append(ridge.coef_)
"""绘制不同 alpha 参数结果
"""
from matplotlib import pyplot as plt
%matplotlib inline
plt.plot(alphas, coefs) # 绘制不同 alpha 参数下的 w 拟合值
plt.scatter(np.linspace(1, 0, 10), parameters[0]) # 普通最小二乘法拟合的 w 值放入图中
plt.xlabel("alpha")
plt.ylabel("w")
plt.title("Ridge Regression")
As can be seen from the figure, when the value of
alpha
is larger, the regularization term dominates the convergence
process and each
\(w\)
coefficient approaches 0. When
alpha
is very small, the fluctuation range of each
\(w\)
coefficient becomes larger.
7.10. LASSO Regression#
When we use ordinary least squares for regression fitting, if there is a strong correlation among the feature variables, it may lead to some \(w\) coefficients being very large while some other coefficients becoming very small negative numbers. Therefore, we add an \(L_{2}\) regularization term through ridge regression in the above text to solve this problem.
Similar to ridge regression, LASSO regression also improves ordinary least squares by adding a regularization term. However, the regularization term added here is the \(L_{1}\) regularization term. That is:
7.11. LASSO Regression Fitting#
Similarly, we use the Lasso regression method
sklearn.linear_model.Lasso()
provided by scikit-learn to fit the data in Section 1.2.
Here, instead of writing it step by step as in the ridge
regression section, it is written in the same cell.
sklearn.linear_model.Lasso(alpha=1.0, fit_intercept=True, normalize=False, precompute=False, copy_X=True, max_iter=1000, tol=0.0001, warm_start=False, positive=False, random_state=None, selection='cyclic')
-
alpha
: Regularization strength, default is 1.0. -
fit_intercept
: Default is True, calculates the intercept term. -
normalize
: Default is False, does not standardize the data. -
precompute
: Whether to use precomputed Gram matrix to accelerate calculations. -
copy_X
: Default is True, i.e., operates on a copy of the data to prevent affecting the original data. -
max_iter
: Maximum number of iterations, default is 1000. -
tol
: Data solution accuracy. -
warm_start
: Reuses the solution from the previous call for initialization. -
positive
: Forces coefficients to be positive. -
random_state
: Random number generator. -
selection
: Updates a random coefficient at each iteration.
"""使用 LASSO 回归拟合并绘图
"""
from sklearn.linear_model import Lasso
alphas = np.linspace(1, 10, 10)
lasso_coefs = []
for a in alphas:
lasso = Lasso(alpha=a, fit_intercept=False)
lasso.fit(x, y)
lasso_coefs.append(lasso.coef_)
plt.plot(alphas, lasso_coefs) # 绘制不同 alpha 参数下的 w 拟合值
plt.scatter(np.linspace(1, 0, 10), parameters[0]) # 普通最小二乘法拟合的 w 值放入图中
plt.xlabel("alpha")
plt.ylabel("w")
plt.title("Lasso Regression")
As can be seen from the figure, when the value of
alpha
is larger, the regularization term dominates the convergence
process and each
\(w\)
coefficient approaches 0. When
alpha
is very small, the fluctuation range of each
\(w\)
coefficient becomes larger.
7.12. Summary#
In this experiment, we got in touch with two regression fitting methods that introduce regularization methods, namely Ridge Regression and LASSO Regression. There are many similarities between the two, but there are also some differences. There are many mathematical formulas and principles involved in this experiment, which requires you to have a certain knowledge of linear algebra.
Related Links