Pro Membership

Pro Membership is a membership plan created by the author to maintain and update this tutorial. You can get more benefits and services, click to view details.

4. Beijing Housing Price Prediction#

4.1. Introduction#

In the experiment on linear regression, we took the prediction of housing prices in the Boston area as an example and discussed its implementation process in detail. In this challenge, you need to apply the knowledge learned from linear regression to predict the housing prices in Beijing.

4.2. Key Points#

Dataset Reading and Partitioning
Model Training and Prediction
Model Evaluation

4.3. Dataset Reading and Partitioning#

The challenge requires downloading the housing price dataset of some communities in Beijing, and the name of this dataset is challenge-1-beijing.csv.

                      # Dataset download link
https://cdn.aibydoing.com/aibydoing/files/challenge-1-beijing.csv

The housing price dataset of Beijing is sourced from the open-source project: PENGZhaoqing/scrapy-HousePricing

Exercise 4.1

Challenge: Use Pandas to load the CSV file of the dataset and preview the first 5 rows of data.

Requirement: Use pd.read_csv() to read the CSV file.

                          import pandas as pd

## 代码开始 ### (≈ 2 行代码)
df = None
## 代码结束 ###

Solution to Exercise 4.1

wget -nc https://cdn.aibydoing.com/aibydoing/files/challenge-1-beijing.csv

                          import pandas as pd

### Start of code ### (≈ 2 lines of code)
df = pd.read_csv("challenge-1-beijing.csv")
df.head()
### End of code ###

                        

Expected output

	Bus	Office Building	Hospital	Subway	School	Community Name	Construction Year	Room Type	Floor	Price per Square Meter	Area
0	18	18	10	2	49	Yuanyang Landscape	2006	2 Bedrooms, 1 Living Room	26	60937	96
1	17	42	10	4	37	Chunshu Garden	1998	3 Bedrooms, 1 Living Room	14	88686	130
2	18	36	9	1	24	Yongle Community	1989	3 Bedrooms, 1 Living Room	18	46621	74
3	15	49	13	2	45	Subject Home	2007	4 Bedrooms, 3 Living Rooms	2	86147	462
4	6	0	0	0	0	Tianlun Jincheng	2007	1 Bedroom, 1 Living Room	13	42500	64

As can be seen, there are a total of 12 columns in this dataset. Since linear regression requires numerical input data, the features we selected include 9 items such as “Bus”, “Office Building”, “Hospital”, “Mall”, “Subway”, “School”, “Construction Year”, “Floor”, and “Area”, while “Price per Square Meter” is the predicted target value.

Exercise 4.2

Challenge: Divide the loaded dataset into a feature value features dataset and a target value target dataset.

Requirement: Use Pandas to select data columns.

                          ## 代码开始 ### (≈ 1 行代码)
features = None
target = df['每平米价格']
## 代码结束 ###

                        

Solution to Exercise 4.2

                          ### Code start ### (≈ 1 line of code)
features = df[['Bus', 'Office Building', 'Hospital', 'Mall', 'Subway', 'School', 'Construction Year', 'Floor', 'Area']]
target = df['Price per Square Meter']
### Code end ###

                        

                          """Method 2
"""
### Code start ### (≈ 1 line of code)
features = df[df.columns.drop(['Community Name', 'Room Type', 'Price per Square Meter'])]
target = df['Price per Square Meter']
### Code end ###

                        

Run the tests

                          pd.concat([features, target], axis=1).head()

                        

Expected output

	Bus	Office Building	Hospital	Subway	School	Construction Year	Floor	Area	Price per Square Meter
0	18	18	10	2	49	2006	26	96	60937
1	17	42	10	4	37	1998	14	130	88686
2	18	36	9	1	24	1989	18	74	46621
3	15	49	13	2	45	2007	2	462	86147
4	6	0	0	0	0	2007	13	64	42500

After splitting the original DataFrame into feature values features and target values target, we also need to divide these two DataFrames into a 70% training set and a 30% test set. Among them, the training set features, training set target, test set features, and test set target are defined as: X_train, y_train, X_test, y_test respectively.

Exercise 4.3

Challenge: Divide each of the two DataFrames features and target into a 70% training set and a 30% test set.

Requirement: Here, use the method of slicing the DataFrame by a 70% split ratio in Pandas to split the data.

                          split_num = int(len(df)*0.7) # 70% 分割数

## 代码开始 ### (≈ 4 行代码)
X_train = None
y_train = None
X_test = None
y_test = None
## 代码结束 ###

                        

Solution to Exercise 4.3

                          split_num = int(len(df)*0.7) # 70% split ratio

### Start of code ### (≈ 4 lines of code)
X_train = features[:split_num]
y_train = target[:split_num]
X_test = features[split_num:]
y_test = target[split_num:]
### End of code ###

                        

Run the tests

                          len(X_train), len(y_train), len(X_test), len(y_test)

                        

Expected output

(2100, 2100, 900, 900)

As can be seen, the training set contains 2100 data entries, while the test set contains 900 data entries.

4.4. Model Training and Prediction#

With the training and test data available, we can now start building a machine learning model. Here, we directly use the linear regression method in scikit-learn to build the model.

Exercise 4.4

Challenge: Build and train a linear regression model using scikit-learn.

Requirement: The model.fit() method can be used to train the model.

                          from sklearn.linear_model import LinearRegression

## 代码开始 ### (≈ 2 行代码)
model = None

## 代码结束 ###

Solution to Exercise 4.4

                          from sklearn.linear_model import LinearRegression

### Start of code ### (≈ 2 lines of code)
model = LinearRegression()
model.fit(X_train, y_train)
### End of code ###

                        

Run tests

                          model.coef_[:3], len(model.coef_)

                        

Expected output

(array([6.59398431, 299.47265083, 371.60427218]), 9)

4.5. Model Evaluation#

In the linear regression experiment, we provide two evaluation metrics for the regression prediction results, namely the Mean Absolute Error (MAE) and the Mean Squared Error (MSE). In this challenge, we introduce another evaluation metric, which is the Mean Absolute Percentage Error (MAPE).

MAPE is a percentage value and is thus easier to understand than other statistics. For example, if the MAPE is 5, it means that the prediction results deviate from the true results by an average of 5%. The calculation formula for MAPE is as follows:

\[ \textrm{MAPE}(y, \hat{y} ) = \frac{\sum_{i=1}^{n}{|\frac{y_{i}-\hat y_{i}}{y_{i}}|}}{n} \times 100 \tag{1} \]

Among them, \(y_{i}\) represents the true value, \(\hat y_{i}\) represents the predicted value, and \(n\) represents the number of values. The smaller the value of MAPE, the better the accuracy of the prediction model.

Exercise 4.5

Challenge: The model.fit() method can be used to train a model.

Requirement: The np.abs() method can be used to calculate the absolute value.

                          import numpy as np

def mape(y_true, y_pred):
    """
    参数:
    y_true -- 测试集目标真实值
    y_pred -- 测试集目标预测值
    
    返回:
    mape -- MAPE 评价指标
    """
    
    ### 代码开始 ### (≈ 2 行代码)
    
    mape = None
    ### 代码结束 ###
    
    return mape

                        

Solution to Exercise 4.5

                          import numpy as np

def mape(y_true, y_pred):
    """
    Parameters:
    y_true -- True values of the test set targets
    y_pred -- Predicted values of the test set targets
    
    Returns:
    mape -- MAPE evaluation metric
    """
    
    ### START CODE HERE ### (≈ 2 lines of code)
    n = len(y_true)
    mape = 100 * np.sum(np.abs((y_true-y_pred)/y_true)) / n
    ### END CODE HERE ###
    
    return mape

                        

Run the tests

                          y_true = y_test.values
y_pred = model.predict(X_test)
mape(y_true, y_pred)

                        

Expected output

45.5%

As can be seen, the final MAPE value obtained is relatively large, indicating a large prediction offset. In the linear regression experiment, we mentioned that one of the reasons for the poor prediction results might be that the data was not preprocessed. In addition, linear regression itself is a very basic and simple prediction method. For a prediction problem like housing prices that involves multiple features, we usually need to use more complex methods for regression prediction to obtain better results. This is what we will learn later.

○ Sharethis article link to your social media, blog, forum, etc. More external links will increase the search engine ranking of this site.

If you find this content helpful, you can buy me a coffee

2. Linear Regression Implementation and Application

5. Polynomial Regression Implementation and Application