cover

7. Ridge Regression and LASSO Regression Implementation#

7.1. Introduction#

In regression prediction, in addition to the linear regression and polynomial regression mentioned earlier, there are many regression analysis methods. For example, ridge regression, LASSO regression, and various regression methods derived from the classification methods in the next week. In this section of the experiment, we first introduce the ridge regression and LASSO regression methods. Both of these methods apply regularization techniques and can solve some problems that cannot be covered by linear regression.

7.2. Key Points#

  • Limitations of Ordinary Least Squares

  • OLS Linear Fitting of Hilbert Matrix

  • Ridge Regression

  • LASSO Regression

7.3. Limitations of Ordinary Least Squares#

In the previous experiments, we learned to use the least squares method to solve linear regression and polynomial regression problems. At the same time, in the chapter on polynomial regression, we learned that polynomial regression can actually be regarded as a special form of linear regression, that is to say, the core of the regression method is linear regression.

7.4. Matrix Representation of Ordinary Least Squares#

First, let’s review the content of linear regression. In linear regression, we need to solve:

\[ f(x)=\sum_{i=1}^{m}w_i x_i =w^Tx \tag{1}\]

When we use Ordinary Least Squares (OLS) for learning (with the squared loss function), it becomes minimizing the objective function:

\[F=\sum_{i=1}^{n}(y_{i}-w^Tx)^2 \tag{2}\]

We can rewrite equation \((2)\) in vector form:

\[F={\left \| y-Xw \right \|_{2}}^{2} \tag{3}\]

\(\left \| y-Xw \right \|_{2}\) represents the 2-norm, which is the square root of the sum of the squares of the absolute values of the vector elements.

In equation \((3)\), \(X\) is a matrix composed of the data \((x_{1},x_{2},……,x_{n})^{T}\); \(y\) is a column vector composed of \((y_{1},y_{2},……,y_{n})^{T}\). Meanwhile, the analytical solution of formula \((3)\) is:

\[\hat{w} = (X^{T}X)^{-1}X^{T}y \tag{4}\]

The condition for formula \((4)\) to hold is that \(\left | X^{T}X \right |\) cannot be 0. When the correlation between variables is strong (multicollinearity), or when \(m\) is greater than \(n\), \(X\) in formula \((4)\) is not a full-rank (\(rank(A)\neq dim(x)\)) matrix. As a result, the value of \(\left | X^{T}X \right |\) approaches 0, increasing the numerical instability of the fitting parameters. This is the limitation of ordinary least squares method.

7.5. Hilbert Matrix OLS Linear Fitting#

In linear algebra, a Hilbert matrix is a square matrix whose coefficients are all unit fractions. Specifically, the coefficient in the \(i\)-th row and \(j\)-th column of a Hilbert matrix \(H\) is:

\[H_{{ij}}={\frac {1}{i+j-1}} \tag{5}\]

A \(5 \times 5\) Hilbert matrix is shown in the following formula \((6)\):

\[\begin{split} H_{5}=\left[ \begin{array}{ccccc}{1} & {\frac{1}{2}} & {\frac{1}{3}} & {\frac{1}{4}} & {\frac{1}{5}} \\ {\frac{1}{2}} & {\frac{1}{3}} & {\frac{1}{4}} & {\frac{1}{5}} & {\frac{1}{6}} \\ {\frac{1}{3}} & {\frac{1}{4}} & {\frac{1}{5}} & {\frac{1}{6}} & {\frac{1}{7}} \\ {\frac{1}{4}} & {\frac{1}{5}} & {\frac{1}{6}} & {\frac{1}{7}} & {\frac{1}{8}} \\ {\frac{1}{5}} & {\frac{1}{6}} & {\frac{1}{7}} & {\frac{1}{8}} & {\frac{1}{9}}\end{array}\right] \end{split}\]

The reason for introducing the Hilbert matrix here is that we plan to conduct a linear fitting experiment using the least squares method with the numerical values of the Hilbert matrix below. There is a strong linear correlation between the data in each column of the Hilbert matrix, which exactly meets the condition that the result of \(\left | X^{T}X \right |\) approaches 0 as mentioned in Section 1.1.

Here, we use the hilbert() method provided by Scipy to directly create a \(10 \times 10\) Hilbert matrix:

from scipy.linalg import hilbert

x = hilbert(10)
x
array([[1.        , 0.5       , 0.33333333, 0.25      , 0.2       ,
        0.16666667, 0.14285714, 0.125     , 0.11111111, 0.1       ],
       [0.5       , 0.33333333, 0.25      , 0.2       , 0.16666667,
        0.14285714, 0.125     , 0.11111111, 0.1       , 0.09090909],
       [0.33333333, 0.25      , 0.2       , 0.16666667, 0.14285714,
        0.125     , 0.11111111, 0.1       , 0.09090909, 0.08333333],
       [0.25      , 0.2       , 0.16666667, 0.14285714, 0.125     ,
        0.11111111, 0.1       , 0.09090909, 0.08333333, 0.07692308],
       [0.2       , 0.16666667, 0.14285714, 0.125     , 0.11111111,
        0.1       , 0.09090909, 0.08333333, 0.07692308, 0.07142857],
       [0.16666667, 0.14285714, 0.125     , 0.11111111, 0.1       ,
        0.09090909, 0.08333333, 0.07692308, 0.07142857, 0.06666667],
       [0.14285714, 0.125     , 0.11111111, 0.1       , 0.09090909,
        0.08333333, 0.07692308, 0.07142857, 0.06666667, 0.0625    ],
       [0.125     , 0.11111111, 0.1       , 0.09090909, 0.08333333,
        0.07692308, 0.07142857, 0.06666667, 0.0625    , 0.05882353],
       [0.11111111, 0.1       , 0.09090909, 0.08333333, 0.07692308,
        0.07142857, 0.06666667, 0.0625    , 0.05882353, 0.05555556],
       [0.1       , 0.09090909, 0.08333333, 0.07692308, 0.07142857,
        0.06666667, 0.0625    , 0.05882353, 0.05555556, 0.05263158]])
import numpy as np

np.linalg.det(np.matrix(x).T * np.matrix(x))
3.882744733261777e-92

It can be seen that the result of \(\left | X^{T}X \right |\) here does approach 0.

7.6. Pearson Correlation Coefficient#

The Pearson Correlation Coefficient is usually used to measure the degree of linear correlation between two variables \(X\) and \(Y\), and its value ranges from -1 to 1. Among them, the closer the value is to 1, the higher the degree of positive correlation; the closer it is to 0, the lower the degree of linear correlation; and the closer it is to -1, the higher the degree of negative correlation.

Pandas provides a method to directly calculate the correlation coefficient, thereby calculating the correlation coefficients between the data columns in the \(10 \times 10\) Hilbert matrix above.

import pandas as pd

pd.DataFrame(x, columns=["x%d" % i for i in range(1, 11)]).corr()
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
x1 1.000000 0.985344 0.965392 0.948277 0.934230 0.922665 0.913025 0.904883 0.897921 0.891902
x2 0.985344 1.000000 0.995632 0.988183 0.980720 0.973927 0.967905 0.962598 0.957918 0.953774
x3 0.965392 0.995632 1.000000 0.998160 0.994616 0.990719 0.986928 0.983393 0.980155 0.977207
x4 0.948277 0.988183 0.998160 1.000000 0.999065 0.997120 0.994845 0.992525 0.990281 0.988163
x5 0.934230 0.980720 0.994616 0.999065 1.000000 0.999465 0.998294 0.996860 0.995346 0.993839
x6 0.922665 0.973927 0.990719 0.997120 0.999465 1.000000 0.999669 0.998914 0.997959 0.996922
x7 0.913025 0.967905 0.986928 0.994845 0.998294 0.999669 1.000000 0.999782 0.999271 0.998608
x8 0.904883 0.962598 0.983393 0.992525 0.996860 0.998914 0.999782 1.000000 0.999850 0.999491
x9 0.897921 0.957918 0.980155 0.990281 0.995346 0.997959 0.999271 0.999850 1.000000 0.999893
x10 0.891902 0.953774 0.977207 0.988163 0.993839 0.996922 0.998608 0.999491 0.999893 1.000000

As shown above, there is a high numerical correlation between each column of data in the \(10 \times 10\) Hilbert matrix.

7.7. Least Squares Linear Fitting#

In the example dataset corresponding to the Hilbert matrix above, it is assumed that the known \(x_{1}\) to \(x_{10}\) follow a linear distribution:

\[ y = w_{1} * x_{1} + w_{2} * x_{2} + \cdots + w_{10} * x_{10}\tag{7}\]

Next, we first use NumPy to randomly generate the \(w\) parameters and obtain the actual corresponding \(y\) values.

from scipy.optimize import leastsq

x = hilbert(10)  # 生成 10x10 的希尔伯特矩阵
np.random.seed(10)  # 随机数种子能保证每次生成的随机数一致
w = np.random.randint(2, 10, 10)  # 随机生成 w 系数
y_temp = np.matrix(x) * np.matrix(w).T  # 计算 y 值
y = np.array(y_temp.T)[0]  # 将 y 值转换成 1 维行向量

print("实际参数 w: ", w)
print("实际函数值 y: ", y)
实际参数 w:  [3 7 6 9 2 3 5 6 3 7]
实际函数值 y:  [14.14761905 10.1232684   8.12233045  6.8529637   5.95634643  5.28188478
  4.75274309  4.32480306  3.97061256  3.67205737]

Next, we will use the least squares method learned in the previous courses to perform linear fitting on the dataset and find the fitted parameters.

func = lambda p, x: np.dot(x, p)  # 函数公式
err_func = lambda p, x, y: func(p, x) - y  # 残差函数
p_init = np.random.randint(1, 2, 10)  # 全部参数初始化为 1

parameters = leastsq(err_func, p_init, args=(x, y))  # 最小二乘法求解
print("拟合参数 w: ", parameters[0])
拟合参数 w:  [ 3.00002433  6.9974538   6.04179367  8.81766306  1.7719309   6.74596479
 -6.17014658 21.44118582 -7.40951369  9.76387767]

You will find that the gap between the actual parameter \(w\) and the fitted parameter \(w\) is very large. This also verifies the limitations of the ordinary least squares method mentioned in Section 1.1.

To sum up, the limitations brought by the ordinary least squares method result in that it cannot be directly used for linear regression fitting in many cases. Especially in the following two situations:

  • The number of columns (features) in the dataset > the amount of data (number of rows), that is, \(X\) is not full column rank.

  • There is a strong linear correlation between the column (feature) data in the dataset, that is, the model is prone to overfitting.

7.8. Ridge Regression Derivation#

To solve the problems arising in the above two situations, Ridge Regression came into being. Ridge Regression can be regarded as a modified least squares estimation method. It effectively prevents the model from overfitting by adding an \(L_{2}\) regularization term (2-norm) to the loss function and helps solve the problem of difficult inversion under the condition of non-full rank, thereby improving the interpretability of the model. That is, the loss function corresponding to formula (2) above becomes:

\[F_{Ridge}=\sum_{i=1}^{n}(y_{i}-w^Tx)^2 + \lambda \sum_{i=1}^{n}(w_{i})^2 \tag{8}\]

We can rewrite equation (8) in vector form:

\[ F_{\text {Ridge}}=\|y-X w\|_{2}^{2}+\lambda\|w\|_{2}^{2} \tag{9} \]

The analytical solution of the regression coefficient \(w\) in formula (9) is:

\[\hat w_{Ridge} = (X^TX + \lambda I)^{-1} X^TY \tag{10}\]

From the difference between formula (4) and formula (10), it can be seen that by adding an identity matrix to \(X^TX\), the matrix becomes full rank, thus improving the deficiencies of ordinary least squares.

7.9. Ridge Regression Fitting#

Next, we use the Ridge regression method Ridge() provided by scikit-learn to complete the data fitting in Section 1.2.

sklearn.linear_model.Ridge(alpha=1.0, fit_intercept=True, normalize=False, copy_X=True, max_iter=None, tol=0.001, solver='auto', random_state=None)
  • alpha: Regularization strength, default is 1.0, corresponding to \(\lambda\) in formula (8).

  • fit_intercept: Default is True, calculates the intercept term.

  • normalize: Default is False, does not standardize the data.

  • copy_X: Default is True, that is, operates on a copy of the data to prevent affecting the original data.

  • max_iter: Maximum number of iterations, default is None.

  • tol: Data solution accuracy.

  • solver: Automatically selects a solver according to the data type.

  • random_state: Random number generator.

from sklearn.linear_model import Ridge

ridge_model = Ridge(fit_intercept=False)  # 参数代表不增加截距项
ridge_model.fit(x, y)
Ridge(fit_intercept=False)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ridge_model.coef_  # 打印模型参数
array([6.3497497 , 4.32792068, 3.40228975, 2.83692059, 2.44632895,
       2.15683095, 1.93213214, 1.75189668, 1.60369747, 1.47944808])

It can be seen that although there is a gap from the true model parameters [3, 7, 6, 9, 2, 3, 5, 6, 3, 7], it is already much better than the parameters fitted by ordinary least squares.

Note

If you are careful enough, when using sklearn.linear_model.LinearRegression() to build a linear regression model to solve the parameters here, you will find that the obtained parameters are close to those fitted by ridge regression, but quite different from those obtained by the least squares method in the above text. Why is that?

In fact, this shows that the encapsulated linear regression class LinearRegression() provided by scikit-learn does not just implement simple least squares. Perhaps for convenience, regularization and other means are also used by default under certain conditions, which is why the results are different from those of ordinary least squares. Scikit-learn is a very mature module, and each class supports a large number of parameters and functions. You can learn about them in detail through the official documentation.

The alpha parameter in the Ridge() model represents the regularization strength. Next, we will try to adjust this parameter to obtain different fitting results.

"""不同 alpha 参数拟合
"""
alphas = np.linspace(1, 10, 20)

coefs = []
for a in alphas:
    ridge = Ridge(alpha=a, fit_intercept=False)
    ridge.fit(x, y)
    coefs.append(ridge.coef_)
"""绘制不同 alpha 参数结果
"""
from matplotlib import pyplot as plt

%matplotlib inline

plt.plot(alphas, coefs)  # 绘制不同 alpha 参数下的 w 拟合值
plt.scatter(np.linspace(1, 0, 10), parameters[0])  # 普通最小二乘法拟合的 w 值放入图中
plt.xlabel("alpha")
plt.ylabel("w")
plt.title("Ridge Regression")
Text(0.5, 1.0, 'Ridge Regression')
../_images/7291e5366813540511fd35e7c77e7dba2c0df2968544ef8a470358136e0d466f.png

As can be seen from the figure, when the value of alpha is larger, the regularization term dominates the convergence process and each \(w\) coefficient approaches 0. When alpha is very small, the fluctuation range of each \(w\) coefficient becomes larger.

7.10. LASSO Regression#

When we use ordinary least squares for regression fitting, if there is a strong correlation among the feature variables, it may lead to some \(w\) coefficients being very large while some other coefficients becoming very small negative numbers. Therefore, we add an \(L_{2}\) regularization term through ridge regression in the above text to solve this problem.

Similar to ridge regression, LASSO regression also improves ordinary least squares by adding a regularization term. However, the regularization term added here is the \(L_{1}\) regularization term. That is:

\[ F_{L A S S O}=\|y-X w\|_{2}^{2}+\lambda\|w\|_{1} \tag{12} \]

7.11. LASSO Regression Fitting#

Similarly, we use the Lasso regression method sklearn.linear_model.Lasso() provided by scikit-learn to fit the data in Section 1.2. Here, instead of writing it step by step as in the ridge regression section, it is written in the same cell.

sklearn.linear_model.Lasso(alpha=1.0, fit_intercept=True, normalize=False, precompute=False, copy_X=True, max_iter=1000, tol=0.0001, warm_start=False, positive=False, random_state=None, selection='cyclic')
  • alpha: Regularization strength, default is 1.0.

  • fit_intercept: Default is True, calculates the intercept term.

  • normalize: Default is False, does not standardize the data.

  • precompute: Whether to use precomputed Gram matrix to accelerate calculations.

  • copy_X: Default is True, i.e., operates on a copy of the data to prevent affecting the original data.

  • max_iter: Maximum number of iterations, default is 1000.

  • tol: Data solution accuracy.

  • warm_start: Reuses the solution from the previous call for initialization.

  • positive: Forces coefficients to be positive.

  • random_state: Random number generator.

  • selection: Updates a random coefficient at each iteration.

"""使用 LASSO 回归拟合并绘图
"""
from sklearn.linear_model import Lasso

alphas = np.linspace(1, 10, 10)
lasso_coefs = []

for a in alphas:
    lasso = Lasso(alpha=a, fit_intercept=False)
    lasso.fit(x, y)
    lasso_coefs.append(lasso.coef_)

plt.plot(alphas, lasso_coefs)  # 绘制不同 alpha 参数下的 w 拟合值
plt.scatter(np.linspace(1, 0, 10), parameters[0])  # 普通最小二乘法拟合的 w 值放入图中
plt.xlabel("alpha")
plt.ylabel("w")
plt.title("Lasso Regression")
Text(0.5, 1.0, 'Lasso Regression')
../_images/e57d3db431110a0302d6b7f86494a382684b50b7637734e26419ca025d63821b.png

As can be seen from the figure, when the value of alpha is larger, the regularization term dominates the convergence process and each \(w\) coefficient approaches 0. When alpha is very small, the fluctuation range of each \(w\) coefficient becomes larger.

7.12. Summary#

In this experiment, we got in touch with two regression fitting methods that introduce regularization methods, namely Ridge Regression and LASSO Regression. There are many similarities between the two, but there are also some differences. There are many mathematical formulas and principles involved in this experiment, which requires you to have a certain knowledge of linear algebra.

Related Links


○ Sharethis article link to your social media, blog, forum, etc. More external links will increase the search engine ranking of this site.