Pro Membership

Pro Membership is a membership plan created by the author to maintain and update this tutorial. You can get more benefits and services, click to view details.

2. Linear Regression Implementation and Application#

2.1. Introduction#

Linear regression is a relatively simple but extremely important machine learning method. Mastering the principles and solution methods of linear regression is a basic requirement for in-depth understanding of linear regression. In addition, linear regression is also the cornerstone of the supervised learning regression part, and I hope you can eventually master some of the important ideas of machine learning.

2.2. Key Points#

Linear Regression
Squared Loss Function
Least Squares Method and Algebraic Solution
Matrix Solution of Least Squares Method
Comprehensive Case of Linear Regression

2.3. Introduction to Linear Regression#

Previously, we understood the difference between classification and regression problems. That is, regression problems aim to achieve prediction of continuous values, such as stock prices, trends in housing prices, and so on. For example, the figure below shows the corresponding relationship between house area and price.

https://cdn.aibydoing.com/aibydoing/images/1625990611676.svg

As shown in the figure above, different house areas correspond to different prices. Now, let’s assume that I have a house that I want to sell, and I need to appraise the house in advance before selling it. So, I want to use the information in the figure, which is the selling prices of other houses, to judge the value of my property. How should I do this?

The method I used is as follows. As shown in the figure below, I first drew a red straight line to roughly verify the trend of the distribution of the data points. Then I mapped the known house sizes to the red line, which corresponds to the location of the blue points. Finally, I found the price corresponding to the blue points as the final estimated value of the house.

https://cdn.aibydoing.com/aibydoing/images/1625990738227.svg

In the process presented in the above figure, the process of finding a straight line to fit the distribution trend of the data points is the process of linear regression. And the “linear” in linear regression refers to the linear relationship, which is the red straight line plotted in the figure.

At this moment, you may have a question in your mind. How was the red solid line in the above image drawn? Why was it not possible to choose the other two green dashed lines like in the image below, but instead a red solid line had to be selected?

https://cdn.aibydoing.com/aibydoing/images/1625990779549.svg

The green dashed line can also reflect the distribution trend of data points. So, finding the most suitable red straight line becomes the target problem to be solved in linear regression.

Through the above small example, I believe you have a little bit of an impression of linear regression, at least a rough understanding of what it can do. In the following content, we will learn the mathematical principles behind linear regression, as well as its implementation using Python code.

2.4. Linear Regression with One Variable#

In the introduction to linear regression above, we presented an example of the relationship between house area and house price. In this, the house area is the independent variable and the house price is the dependent variable. Additionally, the linear fitting process with only 1 independent variable is called simple linear regression.

Below, we will generate a set of example data on the changes in house area and house prices. \(x\) represents the house area, with the unit being square meters; \(y\) represents the house price, with the unit being 10,000 yuan.

                          import warnings

# Reduce unnecessary reminders during code execution
warnings.filterwarnings("ignore")

                          import numpy as np

x = np.array([56, 72, 69, 88, 102, 86, 76, 79, 94, 74])
y = np.array([92, 102, 86, 110, 130, 99, 96, 102, 105, 92])

The sample data consists of 10 sets of corresponding housing area and price. Next, use Matplotlib to plot the data points, where \(x\) and \(y\) correspond to the x-coordinate and y-coordinate respectively.

                          from matplotlib import pyplot as plt

%matplotlib inline

plt.scatter(x, y)
plt.xlabel("Area")
plt.ylabel("Price")

Text(0, 0.5, 'Price')

../_images/bdedba4c98e347420f93304b12a02a48802b7c1f4823add7f9287274fa782bc8.png

As mentioned above, linear regression is to fit data points through a linear equation. Then, we can let the expression of this 1st degree function be:

\[ y(x, w) = w_0 + w_1x \tag{1} \]

Formula \((1)\) is a typical expression of a linear function with one variable, and we can obtain different fitted lines by combining different values of \(w_0\) and \(w_1\).

Next, implement the code for formula \((1)\):

                          def f(x: list, w0: float, w1: float):
    """Linear function expression"""
    y = w0 + w1 * x
    return y

                        

Which line best reflects the changing trend of the data?

To find the best-fitting straight line for a dataset, let’s explain further using the figure from the previous section. As shown in the image, when we use the function \(y(x, w) = w_0 + w_1x\) to fit the data, we can obtain the overall fitting error, which is represented by the total length of the blue line segments. If the error for a particular straight line is the smallest, does that mean this line best reflects the distribution trend of the data points?

https://cdn.aibydoing.com/aibydoing/images/1625990796196.svg

2.5. Square Loss Function#

The square loss function, also known as the least squares loss function, is a commonly used loss function in machine learning. It is defined as:

\(L(y, \hat{y}) = (y - \hat{y})^2\)

where \(y\) is the true label or target value, and \(\hat{y}\) is the predicted value. The goal is to minimize the average square of the difference between the true and predicted values.

The square loss function has several desirable properties:

It is a convex function, which means that it is easy to optimize using gradient-based methods.
It is differentiable, which allows for efficient gradient computation.
It penalizes large errors more heavily than small errors, which can be useful in many applications.

The square loss function is widely used in regression problems, where the goal is to predict a continuous target variable. It is also used in some classification problems, such as logistic regression, where the predicted output is a probability.

As mentioned above, if a data point is (\(x_{i}\), \(y_{i}\)), then its corresponding error would be:

\[y_{i}-(w_0 + w_1x_{i}) \tag{2}\]

The error above is often also called the “residual”. However, in machine learning, we prefer to call it the “loss”, which is the degree of deviation between the true value and the predicted value. Then, for \(n\) entire data points, the corresponding total residual loss is:

\[ \sum\limits_{i = 1}^n {{{(y_{i}-(w_0 + w_1x_{i}))}}} \tag{3} \]

Further, in linear regression, we generally use the sum of squared residuals to represent the error of all sample points. The formula is as follows:

\[ \sum\limits_{i = 1}^n {{{(y_{i}-(w_0 + w_1x_{i}))}}^2} \tag{4} \]

The benefit of using the sum of squared residuals is that it can ensure that the loss is always a positive accumulative number, without the problem of positive and negative residuals canceling each other out. For formula \((4)\), there is a specific term in machine learning, which is the “squared loss function”. In order to obtain the optimal values for the fitting parameters \(w_0\) and \(w_1\), our goal is to minimize the squared loss function corresponding to formula \((4)\).

Similarly, we can implement the formula \((4)\) in code:

                          def square_loss(x: np.ndarray, y: np.ndarray, w0: float, w1: float):
    """Square loss function"""
    loss = sum(np.square(y - (w0 + w1 * x)))
    return loss

                        

If the total loss of the samples fitted by a certain straight line is minimized, then this straight line is the final result to be obtained. And the process of solving the minimum loss value must use the following mathematical methods.

2.6. Least Squares Method Algebraic Solution#

The method of least squares is a commonly used approach for solving the parameters \(w\) of linear regression fitting. The “squares” in the method of least squares refers to the square, and the least squares means the minimum square. And here the square refers to the square loss function mentioned above.

To put it simply, the least squares method is a method to find the minimum value of the square loss function. So, how exactly do we solve this? This requires knowledge from advanced mathematics. The derivation is as follows:

First, the squared loss function is:

\[ f = \sum\limits_{i = 1}^n {{{(y_{i}-(w_0 + w_1x_{i}))}}^2} \tag{5} \]

Our goal is to find the \(w\) that corresponds to the minimum value of the squared loss function \(min(f)\). First, let’s find the first-order partial derivative of \(f\):

\[ \frac{\partial f}{\partial w_{0}}=-2(\sum_{i=1}^{n}{y_i}-nw_{0}-w_{1}\sum_{i=1}^{n}{x_i}) \tag{6a} \]

\[ \frac{\partial f}{\partial w_{1}}=-2(\sum_{i=1}^{n}{x_iy_i}-w_{0}\sum_{i=1}^{n}{x_i}-w_{1}\sum_{i=1}^{n}{x_i}^2) \tag{6b} \]

Then, we set \(\frac{\partial f}{\partial w_{0}}=0\) and \(\frac{\partial f}{\partial w_{1}}=0\), and solve to obtain:

\[ w_{1}=\frac {n\sum_{}^{}{x_iy_i}-\sum_{}^{}{x_i}\sum_{}^{}{y_i}} {n\sum_{}^{}{x_i}^2-(\sum_{}^{}{x_i})^2} \tag{7b} \]

\[ w_{0}=\frac {\sum_{}^{}{x_i}^2\sum_{}^{}{y_i}-\sum_{}^{}{x_i}\sum_{}^{}{x_iy_i}} {n\sum_{}^{}{x_i}^2-(\sum_{}^{}{x_i})^2} \tag{7b} \]

So far, the \(w\) parameter value corresponding to the minimum of the squared loss function has been obtained, which is also the best-fit straight line.

We will implement the code to solve the formula \((7)\) and obtain \(w\):

                          def least_squares_algebraic(x: np.ndarray, y: np.ndarray):
    """Algebraic solution for the least squares method"""
    n = x.shape[0]
    w1 = (n * sum(x * y) - sum(x) * sum(y)) / (n * sum(x * x) - sum(x) * sum(x))
    w0 = (sum(x * x) * sum(y) - sum(x) * sum(x * y)) / (
        n * sum(x * x) - sum(x) * sum(x)
    )
    return w0, w1

                        

So, you can pass \(x\) and \(y\) into the function least_squares_algebraic(x, y) to get the values of \(w_0\) and \(w_1\).

                          least_squares_algebraic(x, y)

                        

(41.33509168550616, 0.7545842753077117)

Of course, we can also obtain the corresponding value of the squared loss at this time:

                          w0 = least_squares_algebraic(x, y)[0]
w1 = least_squares_algebraic(x, y)[1]

square_loss(x, y, w0, w1)

447.69153479025357

Next, we will try to plot the fitted line onto the original image:

                          x_temp = np.linspace(50, 120, 100)  # Generate temporary points for plotting the line

plt.scatter(x, y)
plt.plot(x_temp, x_temp * w1 + w0, "r")

[<matplotlib.lines.Line2D at 0x11a01a320>]

../_images/38a7a33c2c93d0b0b137be37bb1feba240f73dee2b65c7e3f8408d95e86492dc.png

From the above figure, we can see that the fitting effect is still quite good. So, if you have a 150 square meter property you want to sell, you can just plug the values into the equation to get an estimated quote:

                          f(150, w0, w1)

                        

154.5227329816629

The estimated selling price obtained here is about 1.54 million yuan. This is the complete process of solving a linear regression problem using the least squares method.

2.7. Least Squares Method Matrix Solution#

The least squares method is a standard approach for solving overdetermined systems of linear equations (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the residuals made in the results of every single equation.

Let’s consider a linear system of equations:

Ax = b

Where:

A is the design matrix
x is the vector of unknown parameters
b is the vector of observed values

The goal is to find the vector x that minimizes the sum of squares of the residuals:

||Ax - b||^2

The solution to this optimization problem is given by the normal equations:

x = (A^T A)^-1 A^T b

Where:

A^T is the transpose of A
(A^T A)^-1 is the inverse of the matrix (A^T A)

This solution provides the least squares estimate of the unknown parameters x.

After learning the above content, I believe you have already understood what the least squares method is, and how to use the least squares method for linear regression fitting. Above, we used the method of taking partial derivatives, and solved the best fitting parameters \(w\) through algebraic calculations. Here we’ll try another method, which is to calculate the parameters \(w\) through matrix transformations.

First, the expression of the univariate linear function is \(y(x, w) = w_0 + w_1x\), which can be expressed in matrix form as:

\[\begin{split} \left[ \begin{array}{c}{1, x_{1}} \\ {1, x_{2}} \\ {\cdots} \\ {1, x_{9}} \\ {1, x_{10}}\end{array}\right] \left[ \begin{array}{c}{w_{0}} \\ {w_{1}}\end{array}\right] = \left[ \begin{array}{c}{y_{1}} \\ {y_{2}} \\ {\cdots} \\ {y_{9}} \\ {y_{10}}\end{array}\right] \Rightarrow \left[ \begin{array}{c}{1,56} \\ {1,72} \\ {\cdots} \\ {1,94} \\ {1,74}\end{array}\right] \left[ \begin{array}{c}{w_{0}} \\ {w_{1}}\end{array}\right]=\left[ \begin{array}{c}{92} \\ {102} \\ {\cdots} \\ {105} \\ {92}\end{array}\right] \end{split}\]

Here is the translation of the Chinese text into English, preserving the markdown format:

\[ y(x, w) = XW \tag{8b} \]

In Equation (8), \(W\) is a \(\begin{bmatrix}w_{0} \\ w_{1}\end{bmatrix}\) vector, and \(X\) is a \(\begin{bmatrix}1, x_{1} \\ 1, x_{2} \\ \cdots \\ 1, x_{9} \\ 1, x_{10}\end{bmatrix}\) matrix. Then, the squared loss function is:

\[ f = \sum\limits_{i = 1}^n {{{(y_{i}-(w_0 + w_1x_{i}))}}}^2 =(y-XW)^T(y-XW)\tag{9} \]

Through the implementation of matrix computation multiplication distributive law applied to formula \((9)\), we obtain:

\[ f = y^{T}y - y^{T}(XW) - (XW)^{T}y + (XW)^{T}(XW) \tag{10} \]

In this formula, \(y\) and \(XW\) are both \((m,1)\) matrices of the same form, so their multiplication is a linear relationship, and the equivalent transformation is as follows:

\[\begin{split} f = y^{T}y - (XW)^{T}y - (XW)^{T}y + (XW)^{T}(XW)\\ = y^{T}y - 2 (XW)^{T}y + (XW)^{T}(XW) \end{split}\]

At this point, taking the partial derivative of the matrix gives:

\[ \frac{\partial f}{\partial W}=2X^TXW-2X^Ty=0 \tag{12} \]

When the matrix \(X^TX\) is full rank, \((X^TX)^{-1}X^TX=E\), and \(EW=W\). Therefore, we have \((X^TX)^{-1}X^TXW=(X^TX)^{-1}X^Ty\), and finally obtain:

\[ W=(X^TX)^{-1}X^Ty \tag{13} \]

3. Supplementary Formula 12 → 13 Explanation#

We start from the following partial differential equation:

\[\frac{\partial f}{\partial W} = 2X^T XW - 2X^T y = 0\]

The goal is to solve for the expression of \(W\) when \(f\) is minimized.

First, we multiply both sides by \((X^T X)^{-1}\) on the left:

\[(X^T X)^{-1} \frac{\partial f}{\partial W} = (X^T X)^{-1} (2X^T XW - 2X^T y) = 0\]

Since \((X^T X)^{-1}\) and \(X^T X\) are inverse matrices, their product is the identity matrix \(I\), therefore:

\[\begin{split}\begin{aligned} (X^T X)^{-1} \frac{\partial f}{\partial W} &= (X^T X)^{-1} (2X^T XW - 2X^T y) \\ &= 2(X^T X)^{-1}(X^T X)W - 2(X^T X)^{-1}X^T y \\ &= 2IW - 2(X^T X)^{-1}X^T y \\ &= 2W - 2(X^T X)^{-1}X^T y = 0 \end{aligned}\end{split}\]

Rearranging, we get:

\[2W = 2(X^T X)^{-1} X^T y\]

Dividing both sides by 2, we obtain the final expression for \(W\):

\[W = (X^T X)^{-1} X^T y\]

This is the analytical solution for the model parameter \(W\) when minimizing the error function \(f\). The derivation process uses the chain rule of matrix differentiation and the invertibility of the \((X^T X)\) matrix. When \(X^T X\) is a full-rank matrix, its inverse \((X^T X)^{-1}\) exists and satisfies:

\[(X^T X) (X^T X)^{-1} = (X^T X)^{-1} (X^T X) = I\]

This property allows us to cancel out \((X^T X)^{-1}\) and \(X^T X\) in the derivation process, leading to the simplified expression.

As a professional translator specializing in machine learning, you excel at accurately translating markdown text into English while ensuring precision. Please translate the following Chinese text into English while keeping the markdown format:

我们可以针对公式 \((13)\) 进行代码实现：

                        def least_squares_matrix(x: np.matrix, y: np.matrix):
    """Least squares method matrix solution"""
    w = (x.T * x).I * x.T * y
    return w

                      

When computing, you need to refer to the calculation formula above and add an intercept term coefficient of 1 to the original \(x\) data. Here, the np.hstack method is used.

                        x_matrix = np.matrix(np.hstack((np.ones((x.shape[0], 1)), x.reshape(x.shape[0], 1))))
y_matrix = np.matrix(y.reshape(y.shape[0], 1))
x_matrix, y_matrix

                      

                        (matrix([[  1.,  56.],
         [  1.,  72.],
         [  1.,  69.],
         [  1.,  88.],
         [  1., 102.],
         [  1.,  86.],
         [  1.,  76.],
         [  1.,  79.],
         [  1.,  94.],
         [  1.,  74.]]),
 matrix([[ 92],
         [102],
         [ 86],
         [110],
         [130],
         [ 99],
         [ 96],
         [102],
         [105],
         [ 92]]))

                      

                        least_squares_matrix(x_matrix, y_matrix)

                      

                        matrix([[41.33509169],
        [ 0.75458428]])

As can be seen, the matrix computation results are consistent with the previous algebraic computation results. You may wonder, why use the matrix transformation method for the calculation? Is the algebraic computation method learned earlier not good?

In fact, it’s not that algebraic calculation methods are not good. For small datasets, the computation efficiency of the two is similar. However, when we face data on the scale of hundreds of thousands or millions, the efficiency of matrix computation will be much higher. This is why we need to learn matrix computation.

3.1. Linear Regression Implementation in scikit-learn#

Machine learning algorithms are powerful tools for solving complex problems. One of the fundamental machine learning algorithms is linear regression, which is used for predicting continuous target variables based on one or more input features.

In this article, we will explore how to implement linear regression using the scikit-learn library, a popular open-source machine learning library in Python. Scikit-learn provides a simple and efficient implementation of the linear regression algorithm that can handle both single-variable and multi-variable regression tasks.

Here’s the step-by-step process:

Import the necessary libraries: We’ll start by importing the LinearRegression class from the sklearn.linear_model module, as well as the numpy and matplotlib.pyplot libraries for data manipulation and visualization.

                      import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

                    

Prepare the data: Let’s generate some sample data for a single-variable linear regression problem. We’ll create an array of input features (X) and a corresponding array of target variables (y).

                      # Generate sample data
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([2, 4, 6, 8, 10])

                    

Create and train the linear regression model: We’ll create an instance of the LinearRegression class and fit the model to the data using the fit() method.

                      # Create and train the linear regression model
model = LinearRegression()
model.fit(X, y)

                    

Evaluate the model: We can evaluate the model’s performance by examining the slope (model.coef_) and intercept (model.intercept_) of the linear regression line, as well as the coefficient of determination (model.score(X, y)), which tells us how well the model fits the data.

                      # Evaluate the model
print("Slope:", model.coef_[0])
print("Intercept:", model.intercept_)
print("R-squared:", model.score(X, y))

                    

Visualize the results: Finally, we’ll plot the input data points and the linear regression line to visualize the model’s fit.

                      # Visualize the results
plt.scatter(X, y, color='blue')
plt.plot(X, model.predict(X), color='red')
plt.xlabel('Input feature')
plt.ylabel('Target variable')
plt.title('Linear Regression Example')
plt.show()

                    

This example demonstrates the ease of implementing linear regression using the scikit-learn library. The library provides a simple and efficient way to train, evaluate, and visualize linear regression models, making it a valuable tool for machine learning practitioners.

In the content above, we learned what is the least squares method, and also implemented complete linear regression using Python. So, how can we use the machine learning open-source module scikit-learn to implement the least squares linear regression method?

The process of implementing linear regression using scikit-learn will be much simpler, where we need to use the LinearRegression() class. Let’s take a look at its parameters:

sklearn.linear_model.LinearRegression(fit_intercept=True, normalize=False, copy_X=True, n_jobs=1)

fit_intercept: Default is True, calculate the intercept term.
normalize: Default is False, do not perform standardization on the data.
copy_X: Default is True, use a copy of the data to operate on, preventing impact on the original data.
n_jobs: Number of jobs to use for the computation. Default is 1, if set to -1 all CPUs are used.

                          from sklearn.linear_model import LinearRegression

# Define the linear regression model
model = LinearRegression()
model.fit(x.reshape(x.shape[0], 1), y)  # Train the model, reshape the data to the shape accepted by fit

# Get the model's fitted parameters
model.intercept_, model.coef_

                        

(41.33509168550617, array([0.75458428]))

Here, through model.intercept_ you can get the fitted intercept term, which is the \(w_{0}\) above, and through model.coef_ you can get the coefficient of \(x\), which is the \(w_{1}\) above. By comparison, the results are completely consistent.

Similarly, we can predict the price of a 150 square meter property:

                          model.predict([[150]])

                        

array([154.52273298])

As can be seen, the results obtained here are consistent with the results of self-implemented calculations.

Comprehensive Case Study on Linear Regression

Currently, you have learned how to perform linear regression fitting using the least squares method, and how to calculate the fitting coefficients \(w\) through algebraic calculation and matrix transformation. This has met the requirements for mastering the linear regression method. Next, we will try to load a real dataset and use scikit-learn to build a predictive model to achieve regression prediction.

Since the previous 2 sections both used example data related to housing prices, here we will use a real housing price dataset, which is the “Boston Housing Price Dataset”.

3.1.1. Introduction and Splitting of the Dataset#

The dataset consists of a collection of image-text pairs. Each image is paired with a corresponding text description. The dataset is split into training, validation, and test sets to evaluate the performance of machine learning models.

The training set is used to train the models, the validation set is used to tune the hyperparameters, and the test set is used for final evaluation of the model’s performance. This splitting ensures that the model is evaluated on data it has not seen during training, providing a more accurate assessment of its generalization capabilities.

The splitting of the dataset is typically done using a random sampling approach, where a certain percentage of the total data is allocated to each of the three sets. Common split ratios are 80/10/10 (training/validation/test) or 70/15/15. The exact split ratio may depend on the size of the overall dataset and the specific requirements of the project.

Proper dataset splitting is a crucial step in the machine learning pipeline, as it helps prevent overfitting and ensures the reliability of the model’s performance evaluation.

The Boston housing price dataset is a very classic dataset in machine learning, which has been used in numerous academic papers on regression algorithms. The dataset consists of 506 records, containing 13 features related to housing prices and 1 target value (housing price).

First, we use Pandas to load and preview the dataset, while also checking the first 5 rows of the DataFrame.

                            import pandas as pd

df = pd.read_csv(
    "https://cdn.aibydoing.com/aibydoing/files/course-5-boston.csv"
)
df.head()

                          

	crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	black	lstat	medv
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	396.90	5.33	36.2

This dataset contains statistics on the median housing prices in towns in the Boston area, as well as associated features. The column names and their explanations are as follows:

CRIM: Crime rate of the town.
ZN: Proportion of residential land areas over 25,000 square feet.
INDUS: Proportion of non-retail business areas in the town.
CHAS: Whether the Charles River runs through the area (=1 if it does, =0 if it doesn’t).
NOX: Concentration of nitric oxides (parts per 10 million).
RM: Average number of rooms per dwelling.
AGE: Age of the properties.
DIS: Distance to employment centers.
RAD: Index of accessibility to radial highways.
TAX: Property tax rate.
PTRATIO: Pupil-teacher ratio by town.
BLACK: Percentage of black population in the town.
LSTAT: Percentage of lower status of the population.
MEDV: Median value of owner-occupied homes.

We will not use all the data features. Here, we will select only the CRIM, RM, and LSTAT features for training the linear regression model. We will extract the data for these three features separately and use the describe() method to check their descriptive information. describe() calculates the count, max, min, mean, and other statistics for each column of data.

                            features = df[["crim", "rm", "lstat"]]
features.describe()

	crim	rm	lstat
count	506.000000	506.000000	506.000000
mean	3.593761	6.284634	12.653063
std	8.596783	0.702617	7.141062
min	0.006320	3.561000	1.730000
25%	0.082045	5.885500	6.950000
50%	0.256510	6.208500	11.360000
75%	3.647423	6.623500	16.955000
max	88.976200	8.780000	37.970000

Similarly, we will take the target value separately. When training a machine learning prediction model, we usually divide the dataset into two parts, 70% and 30%.

Among them, 70% of the part is called the training set, which is used for model training. For example, here the linear regression is to find the best fitting parameter \(w\) from the training set. Another 30% is called the test set. For the test set, we first know its true target value, and then we can input the features in the test set into the learned model to get the predicted target value. Finally, by comparing the difference between the predicted target value and the true target value, the prediction performance of the model is evaluated.

1625992504364

The image shows a simple machine learning model training process. Next, we split the dataset into features and targets, obtaining 70% as the training set and 30% as the test set. Specifically, the training set features, training set targets, test set features, and test set targets are defined as: X_train, y_train, X_test, y_test.

                            target = df["medv"]  # target value data

split_num = int(len(features) * 0.7)  # get 70% position 

X_train = features[:split_num]  # training set features
y_train = target[:split_num]  # training set target

X_test = features[split_num:]  # test set features
y_test = target[split_num:]  # test set target

X_train.shape, y_train.shape, X_test.shape, y_test.shape

((354, 3), (354,), (152, 3), (152,))

3.1.2. Building and Training Models#

Constructing and training machine learning models is a crucial step in the development of intelligent systems. This process involves the following key components:

Data Preparation: Collecting, cleaning, and preprocessing the data that will be used to train the model. This includes handling missing values, encoding categorical features, and scaling numerical features as needed.
Model Selection: Choosing an appropriate machine learning algorithm or model architecture based on the problem domain and the characteristics of the data. Common models include linear regression, decision trees, neural networks, and support vector machines.
Hyperparameter Tuning: Adjusting the hyperparameters of the chosen model, such as learning rate, regularization strength, or network architecture, to optimize its performance on the training data.
Model Training: Using the prepared data to fit the parameters of the selected model, typically by minimizing a loss function through an optimization algorithm like gradient descent.
Model Evaluation: Assessing the performance of the trained model on held-out test data to ensure it generalizes well and can make accurate predictions on new, unseen data.
Model Deployment: Integrating the trained model into a production system or application, where it can be used to make predictions or decisions in real-world scenarios.

The entire process of building and training machine learning models is an iterative one, often requiring multiple cycles of experimentation, evaluation, and refinement to achieve the desired level of performance.

After dividing the dataset, you can then build and train the model. Similarly, here we will use the LinearRegression() class. The parameters for this class have already been introduced previously.

                            model = LinearRegression()  # Establish the model
model.fit(X_train, y_train)  # Train the model
model.coef_, model.intercept_  # Output the trained model parameters and intercept

                          

(array([ 0.69979497, 10.13564218, -0.20532653]), -38.00096988969018)

In the cell above, we output the fitting parameters of the linear regression model. That is, the final fitted linear function approximation is:

\[ f = 0.6997 * x_{1} + 10.1356 * x_{2} - 0.2053 * x_{3} - 38 \tag{14} \]

Here, \(x_{1}\), \(x_{2}\), \(x_{3}\) correspond to the CRIM, RM, and LSTAT columns in the dataset, respectively. Next, input the test set features into the trained model to obtain the predicted values.

                            preds = model.predict(X_test)  # Predict on the test set features
preds  # The prediction results

                            array([17.77439141, 21.09512448, 27.63412265, 26.78577951, 25.38313368,
       24.3286313 , 28.4257879 , 25.12834727, 16.82806601, 20.76498858,
       52.3350748 , -0.18169806, 12.01475786,  7.87878077, 15.13155699,
       32.93748235, 37.07872049, 29.50613719, 25.50800832, 12.35867972,
        9.08901644, 47.08374238, 35.31759193, 33.3738765 , 38.34913316,
       33.10414639, 91.3556125 , 35.11735022, 19.69326952, 18.49805269,
       14.03767555, 20.9235166 , 20.41406182, 21.92218226, 15.20451678,
       18.05362998, 21.26289453, 23.18192502, 15.87149504, 27.70381826,
       27.65958772, 30.17151829, 27.04987446, 21.52730227, 37.82614512,
       22.09872387, 34.71166346, 32.07959454, 29.45253042, 29.51137956,
       41.49935191, 62.4121152 , 13.64508882, 24.71242033, 18.69151684,
       37.4909413 , 54.05864658, 34.94758034, 15.01355249, 30.17849355,
       32.22191275, 33.90252834, 33.02530285, 28.4416789 , 69.60201087,
       34.7617152 , 31.65353442, 24.5644437 , 24.78130285, 24.00864792,
       21.03315696, 27.84982052, 26.50972924, 48.2345499 , 25.50590175,
       28.25547265, 28.66087656, 34.2545407 , 29.15996676, 27.8072316 ,
       31.54282066, 32.22024557, 33.8708737 , 29.54354233, 24.7407235 ,
       20.90593331, 31.85967562, 29.72491232, 25.59151894, 30.83279914,
       25.40734645, 23.01153504, 27.01673798, 28.92672135, 27.49385728,
       28.34125465, 31.52461119, 29.61897187, 25.83925513, 39.26663855,
       33.00756176, 27.73720999, 21.93359421, 24.42469533, 27.95623349,
       25.37499479, 29.91401113, 26.20027081, 27.81044317, 29.97326914,
       27.7027324 , 19.68280094, 21.44673441, 21.56041782, 29.24007222,
       26.02322353, 24.20402765, 25.31745183, 26.79101418, 33.60357546,
       18.91793831, 23.98036109, 27.29202266, 21.15709214, 28.14694161,
       32.47276562, 27.13611459, 32.81994315, 36.13809753, 20.23338607,
       20.43084078, 26.37366467, 24.87561302, 22.88298598, 13.67619651,
       12.08004137,  7.6709438 , 19.00432321, 19.97736929, 17.49844989,
       19.46809982, 15.97963066, 12.49219926, 18.01764782, 20.51997661,
       15.46843536, 20.30123637, 26.88163963, 22.19647509, 31.58974789,
       29.60675772, 21.5321567 ])

                          

For regression prediction results, there are usually multiple metrics used for evaluation, such as mean absolute error, mean absolute percentage error, and mean squared error. Here, we will introduce two of them:

The mean absolute error (MAE) is the average of the absolute errors, and its calculation formula is as follows:

\[ \textrm{MAE}(y, \hat{y} ) = \frac{1}{n}\sum_{i=1}^{n}{|y_{i}-\hat y_{i}|}\tag{15} \]

Where, \(y_{i}\) represents the true value, \(\hat y_{i}\) represents the predicted value, and \(n\) represents the number of values. The smaller the value of MAE, the better the fitting degree of the model. We can try to use Python to implement the MAE calculation function:

                            def mae_solver(y_true: np.ndarray, y_pred: np.ndarray):
    """Mean Absolute Error (MAE) Solver"""
    n = len(y_true)
    mae = sum(np.abs(y_true - y_pred)) / n
    return mae

                          

Mean Squared Error (MSE) represents the expected value of the squared error, and its calculation formula is as follows:

\[ \textrm{MSE}(y, \hat{y} ) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^{2}\tag{16} \]

Here, \(y_{i}\) represents the true value, \(\hat y_{i}\) represents the predicted value, and \(n\) represents the number of values. The smaller the MSE value, the better the precision of the prediction model. Similarly, we can try to implement the MSE calculation function using Python:

                            def mse_solver(y_true: np.ndarray, y_pred: np.ndarray):
    """Mean Squared Error (MSE) Solver"""
    n = len(y_true)
    mse = sum(np.square(y_true - y_pred)) / n
    return mse

                          

Therefore, we can calculate the average metrics of the above model, that is, the MSE and MAE values of the prediction results:

                            mae = mae_solver(y_test.values, preds)
mse = mse_solver(y_test.values, preds)

print("MAE: ", mae)
print("MSE: ", mse)

                            MAE:  13.022063072780178
MSE:  303.833124722358

We can also call the pre-existing MAE and MSE solving methods in scikit-learn to give it a try. If all goes well, the results should be consistent with what we have implemented ourselves.

                            from sklearn.metrics import mean_absolute_error, mean_squared_error

mae_ = mean_absolute_error(y_test, preds)
mse_ = mean_squared_error(y_test, preds)

print("scikit-learn MAE: ", mae_)
print("scikit-learn MSE: ", mse_)

                            scikit-learn MAE:  13.02206307278018
scikit-learn MSE:  303.8331247223582

As can be seen, the average absolute error of the model’s predictions here is about 13.02. If you calculate the average of all the target values (which is around 22), you’ll find that an average absolute error of 13.02 is quite large. This indicates that the model’s performance is not good. What could be the reason for this?

The main reasons are two-fold. First, the data - we did not preprocess the data, and randomly selected 3 features without properly utilizing the other features provided in the dataset. Additionally, we did not filter out or normalize the outlier data. Another reason is the algorithm itself - linear regression reflects the data patterns through linear relationships, but in reality, housing prices do not simply follow a linear relationship, which is one of the reasons for the poor prediction performance.

Of course, the knowledge of data preprocessing involved in the process of using machine learning to train models will be gradually learned in the following courses. Mastering the principles and implementation methods of linear regression is the focus of this learning.

3.2. Summary#

We start with the principle of linear regression, learn the two solution methods of the least squares method, and implement the linear regression algorithm in detail. In this process, you have understood the training and prediction process of machine learning, as well as the mathematical ideas behind it.

In summary, a machine learning process often consists of two parts: training and prediction. The trained model can be used to predict unknown data. The process of training the model is actually the process of applying machine learning algorithms to solve problems. Typically, we define a loss function (such as the squared loss function) and use a mathematical optimization method (such as the least squares method) to find the optimal solution to the loss function. This idea will always be at the core of machine learning.

Relevant links

○ Sharethis article link to your social media, blog, forum, etc. More external links will increase the search engine ranking of this site.

If you find this content helpful, you can buy me a coffee

1. Overview and Examples of Machine Learning

4. Beijing Housing Price Prediction