4. Beijing Housing Price Prediction#
4.1. Introduction#
In the experiment on linear regression, we took the prediction of housing prices in the Boston area as an example and discussed its implementation process in detail. In this challenge, you need to apply the knowledge learned from linear regression to predict the housing prices in Beijing.
4.2. Key Points#
Dataset Reading and Partitioning
Model Training and Prediction
Model Evaluation
4.3. Dataset Reading and Partitioning#
The challenge requires downloading the housing price dataset
of some communities in Beijing, and the name of this dataset
is
challenge-1-beijing.csv
.
# Dataset download link
https://cdn.aibydoing.com/aibydoing/files/challenge-1-beijing.csv
The housing price dataset of Beijing is sourced from the open-source project: PENGZhaoqing/scrapy-HousePricing
Exercise 4.1
Challenge: Use Pandas to load the CSV file of the
dataset and preview the first
5
rows of data.
Requirement: Use
pd.read_csv()
to read the CSV file.
import pandas as pd
## 代码开始 ### (≈ 2 行代码)
df = None
## 代码结束 ###
Solution to Exercise 4.1
wget -nc https://cdn.aibydoing.com/aibydoing/files/challenge-1-beijing.csv
import pandas as pd
### Start of code ### (≈ 2 lines of code)
df = pd.read_csv("challenge-1-beijing.csv")
df.head()
### End of code ###
Expected output
Bus | Office Building | Hospital | Mall | Subway | School | Community Name | Construction Year | Room Type | Floor | Price per Square Meter | Area | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 18 | 18 | 10 | 0 | 2 | 49 | Yuanyang Landscape | 2006 | 2 Bedrooms, 1 Living Room | 26 | 60937 | 96 |
1 | 17 | 42 | 10 | 0 | 4 | 37 | Chunshu Garden | 1998 | 3 Bedrooms, 1 Living Room | 14 | 88686 | 130 |
2 | 18 | 36 | 9 | 0 | 1 | 24 | Yongle Community | 1989 | 3 Bedrooms, 1 Living Room | 18 | 46621 | 74 |
3 | 15 | 49 | 13 | 0 | 2 | 45 | Subject Home | 2007 | 4 Bedrooms, 3 Living Rooms | 2 | 86147 | 462 |
4 | 6 | 0 | 0 | 0 | 0 | 0 | Tianlun Jincheng | 2007 | 1 Bedroom, 1 Living Room | 13 | 42500 | 64 |
As can be seen, there are a total of
12
columns in this dataset. Since linear regression requires
numerical input data, the features we selected include 9
items such as “Bus”, “Office Building”, “Hospital”, “Mall”,
“Subway”, “School”, “Construction Year”, “Floor”, and
“Area”, while “Price per Square Meter” is the predicted
target value.
Exercise 4.2
Challenge: Divide the loaded dataset into a feature
value
features
dataset and a target value
target
dataset.
Requirement: Use Pandas to select data columns.
## 代码开始 ### (≈ 1 行代码)
features = None
target = df['每平米价格']
## 代码结束 ###
Solution to Exercise 4.2
### Code start ### (≈ 1 line of code)
features = df[['Bus', 'Office Building', 'Hospital', 'Mall', 'Subway', 'School', 'Construction Year', 'Floor', 'Area']]
target = df['Price per Square Meter']
### Code end ###
"""Method 2
"""
### Code start ### (≈ 1 line of code)
features = df[df.columns.drop(['Community Name', 'Room Type', 'Price per Square Meter'])]
target = df['Price per Square Meter']
### Code end ###
Run the tests
pd.concat([features, target], axis=1).head()
Expected output
Bus | Office Building | Hospital | Mall | Subway | School | Construction Year | Floor | Area | Price per Square Meter | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 18 | 18 | 10 | 0 | 2 | 49 | 2006 | 26 | 96 | 60937 |
1 | 17 | 42 | 10 | 0 | 4 | 37 | 1998 | 14 | 130 | 88686 |
2 | 18 | 36 | 9 | 0 | 1 | 24 | 1989 | 18 | 74 | 46621 |
3 | 15 | 49 | 13 | 0 | 2 | 45 | 2007 | 2 | 462 | 86147 |
4 | 6 | 0 | 0 | 0 | 0 | 0 | 2007 | 13 | 64 | 42500 |
After splitting the original DataFrame into feature values
features
and target values
target
, we also need to divide these two DataFrames into a 70%
training set and a 30% test set. Among them, the training
set features, training set target, test set features, and
test set target are defined as:
X_train
,
y_train
,
X_test
,
y_test
respectively.
Exercise 4.3
Challenge: Divide each of the two DataFrames
features
and
target
into a 70% training set and a 30% test set.
Requirement: Here, use the method of slicing the DataFrame by a 70% split ratio in Pandas to split the data.
split_num = int(len(df)*0.7) # 70% 分割数
## 代码开始 ### (≈ 4 行代码)
X_train = None
y_train = None
X_test = None
y_test = None
## 代码结束 ###
Solution to Exercise 4.3
split_num = int(len(df)*0.7) # 70% split ratio
### Start of code ### (≈ 4 lines of code)
X_train = features[:split_num]
y_train = target[:split_num]
X_test = features[split_num:]
y_test = target[split_num:]
### End of code ###
Run the tests
len(X_train), len(y_train), len(X_test), len(y_test)
Expected output
(2100,
2100,
900,
900)
As can be seen, the training set contains 2100 data entries, while the test set contains 900 data entries.
4.4. Model Training and Prediction#
With the training and test data available, we can now start building a machine learning model. Here, we directly use the linear regression method in scikit-learn to build the model.
Exercise 4.4
Challenge: Build and train a linear regression model using scikit-learn.
Requirement: The
model.fit()
method can be used to train the model.
from sklearn.linear_model import LinearRegression
## 代码开始 ### (≈ 2 行代码)
model = None
## 代码结束 ###
Solution to Exercise 4.4
from sklearn.linear_model import LinearRegression
### Start of code ### (≈ 2 lines of code)
model = LinearRegression()
model.fit(X_train, y_train)
### End of code ###
Run tests
model.coef_[:3], len(model.coef_)
Expected output
(array([6.59398431,
299.47265083,
371.60427218]),
9)
4.5. Model Evaluation#
In the linear regression experiment, we provide two evaluation metrics for the regression prediction results, namely the Mean Absolute Error (MAE) and the Mean Squared Error (MSE). In this challenge, we introduce another evaluation metric, which is the Mean Absolute Percentage Error (MAPE).
MAPE is a percentage value and is thus easier to understand than other statistics. For example, if the MAPE is 5, it means that the prediction results deviate from the true results by an average of 5%. The calculation formula for MAPE is as follows:
Among them, \(y_{i}\) represents the true value, \(\hat y_{i}\) represents the predicted value, and \(n\) represents the number of values. The smaller the value of MAPE, the better the accuracy of the prediction model.
Exercise 4.5
Challenge: The
model.fit()
method can be used to train a model.
Requirement: The
np.abs()
method can be used to calculate the absolute value.
import numpy as np
def mape(y_true, y_pred):
"""
参数:
y_true -- 测试集目标真实值
y_pred -- 测试集目标预测值
返回:
mape -- MAPE 评价指标
"""
### 代码开始 ### (≈ 2 行代码)
mape = None
### 代码结束 ###
return mape
Solution to Exercise 4.5
import numpy as np
def mape(y_true, y_pred):
"""
Parameters:
y_true -- True values of the test set targets
y_pred -- Predicted values of the test set targets
Returns:
mape -- MAPE evaluation metric
"""
### START CODE HERE ### (≈ 2 lines of code)
n = len(y_true)
mape = 100 * np.sum(np.abs((y_true-y_pred)/y_true)) / n
### END CODE HERE ###
return mape
Run the tests
y_true = y_test.values
y_pred = model.predict(X_test)
mape(y_true, y_pred)
Expected output
45.5%
As can be seen, the final MAPE value obtained is relatively large, indicating a large prediction offset. In the linear regression experiment, we mentioned that one of the reasons for the poor prediction results might be that the data was not preprocessed. In addition, linear regression itself is a very basic and simple prediction method. For a prediction problem like housing prices that involves multiple features, we usually need to use more complex methods for regression prediction to obtain better results. This is what we will learn later.