cover

4. Beijing Housing Price Prediction#

4.1. Introduction#

In the experiment on linear regression, we took the prediction of housing prices in the Boston area as an example and discussed its implementation process in detail. In this challenge, you need to apply the knowledge learned from linear regression to predict the housing prices in Beijing.

4.2. Key Points#

  • Dataset Reading and Partitioning

  • Model Training and Prediction

  • Model Evaluation

4.3. Dataset Reading and Partitioning#

The challenge requires downloading the housing price dataset of some communities in Beijing, and the name of this dataset is challenge-1-beijing.csv.

# Dataset download link
https://cdn.aibydoing.com/aibydoing/files/challenge-1-beijing.csv

The housing price dataset of Beijing is sourced from the open-source project: PENGZhaoqing/scrapy-HousePricing

Exercise 4.1

Challenge: Use Pandas to load the CSV file of the dataset and preview the first 5 rows of data.

Requirement: Use pd.read_csv() to read the CSV file.

import pandas as pd

## 代码开始 ### (≈ 2 行代码)
df = None
## 代码结束 ###

Expected output

Bus Office Building Hospital Mall Subway School Community Name Construction Year Room Type Floor Price per Square Meter Area
0 18 18 10 0 2 49 Yuanyang Landscape 2006 2 Bedrooms, 1 Living Room 26 60937 96
1 17 42 10 0 4 37 Chunshu Garden 1998 3 Bedrooms, 1 Living Room 14 88686 130
2 18 36 9 0 1 24 Yongle Community 1989 3 Bedrooms, 1 Living Room 18 46621 74
3 15 49 13 0 2 45 Subject Home 2007 4 Bedrooms, 3 Living Rooms 2 86147 462
4 6 0 0 0 0 0 Tianlun Jincheng 2007 1 Bedroom, 1 Living Room 13 42500 64

As can be seen, there are a total of 12 columns in this dataset. Since linear regression requires numerical input data, the features we selected include 9 items such as “Bus”, “Office Building”, “Hospital”, “Mall”, “Subway”, “School”, “Construction Year”, “Floor”, and “Area”, while “Price per Square Meter” is the predicted target value.

Exercise 4.2

Challenge: Divide the loaded dataset into a feature value features dataset and a target value target dataset.

Requirement: Use Pandas to select data columns.

## 代码开始 ### (≈ 1 行代码)
features = None
target = df['每平米价格']
## 代码结束 ###

Run the tests

pd.concat([features, target], axis=1).head()

Expected output

Bus Office Building Hospital Mall Subway School Construction Year Floor Area Price per Square Meter
0 18 18 10 0 2 49 2006 26 96 60937
1 17 42 10 0 4 37 1998 14 130 88686
2 18 36 9 0 1 24 1989 18 74 46621
3 15 49 13 0 2 45 2007 2 462 86147
4 6 0 0 0 0 0 2007 13 64 42500

After splitting the original DataFrame into feature values features and target values target, we also need to divide these two DataFrames into a 70% training set and a 30% test set. Among them, the training set features, training set target, test set features, and test set target are defined as: X_train, y_train, X_test, y_test respectively.

Exercise 4.3

Challenge: Divide each of the two DataFrames features and target into a 70% training set and a 30% test set.

Requirement: Here, use the method of slicing the DataFrame by a 70% split ratio in Pandas to split the data.

split_num = int(len(df)*0.7) # 70% 分割数

## 代码开始 ### (≈ 4 行代码)
X_train = None
y_train = None
X_test = None
y_test = None
## 代码结束 ###

Run the tests

len(X_train), len(y_train), len(X_test), len(y_test)

Expected output

(2100, 2100, 900, 900)

As can be seen, the training set contains 2100 data entries, while the test set contains 900 data entries.

4.4. Model Training and Prediction#

With the training and test data available, we can now start building a machine learning model. Here, we directly use the linear regression method in scikit-learn to build the model.

Exercise 4.4

Challenge: Build and train a linear regression model using scikit-learn.

Requirement: The model.fit() method can be used to train the model.

from sklearn.linear_model import LinearRegression

## 代码开始 ### (≈ 2 行代码)
model = None

## 代码结束 ###

Run tests

model.coef_[:3], len(model.coef_)

Expected output

(array([6.59398431, 299.47265083, 371.60427218]), 9)

4.5. Model Evaluation#

In the linear regression experiment, we provide two evaluation metrics for the regression prediction results, namely the Mean Absolute Error (MAE) and the Mean Squared Error (MSE). In this challenge, we introduce another evaluation metric, which is the Mean Absolute Percentage Error (MAPE).

MAPE is a percentage value and is thus easier to understand than other statistics. For example, if the MAPE is 5, it means that the prediction results deviate from the true results by an average of 5%. The calculation formula for MAPE is as follows:

\[ \textrm{MAPE}(y, \hat{y} ) = \frac{\sum_{i=1}^{n}{|\frac{y_{i}-\hat y_{i}}{y_{i}}|}}{n} \times 100 \tag{1} \]

Among them, \(y_{i}\) represents the true value, \(\hat y_{i}\) represents the predicted value, and \(n\) represents the number of values. The smaller the value of MAPE, the better the accuracy of the prediction model.

Exercise 4.5

Challenge: The model.fit() method can be used to train a model.

Requirement: The np.abs() method can be used to calculate the absolute value.

import numpy as np

def mape(y_true, y_pred):
    """
    参数:
    y_true -- 测试集目标真实值
    y_pred -- 测试集目标预测值
    
    返回:
    mape -- MAPE 评价指标
    """
    
    ### 代码开始 ### (≈ 2 行代码)
    
    mape = None
    ### 代码结束 ###
    
    return mape

Run the tests

y_true = y_test.values
y_pred = model.predict(X_test)
mape(y_true, y_pred)

Expected output

45.5%

As can be seen, the final MAPE value obtained is relatively large, indicating a large prediction offset. In the linear regression experiment, we mentioned that one of the reasons for the poor prediction results might be that the data was not preprocessed. In addition, linear regression itself is a very basic and simple prediction method. For a prediction problem like housing prices that involves multiple features, we usually need to use more complex methods for regression prediction to obtain better results. This is what we will learn later.


○ Sharethis article link to your social media, blog, forum, etc. More external links will increase the search engine ranking of this site.