cover

24. Quickly Select Models Using Cross-Validation#

24.1. Introduction#

In the previous courses, we learned multiple different classification prediction methods together. When you are in actual engineering practice, how can you quickly find the most suitable one from these machine learning methods? This involves the knowledge of model selection. In this challenge, we will learn about the K-fold cross-validation method and use this method to complete model selection.

24.2. Key Points#

  • K-fold Cross-Validation

  • Equal Subdivision of K-fold Subsets

  • Abalone Age Classification

We have learned multiple different classification methods in supervised learning. I believe you have fully grasped the principles and implementation processes of these methods. After learning these methods, you may have a question, that is, how to quickly select the most suitable method for the current application scenario from different machine learning methods?

To answer this question, let’s first review the experiment content of the K-Nearest Neighbors method. At the end of the K-Nearest Neighbors course, the experiment introduced the selection of the K value. At that time, we tried to find the most suitable one by trying different K values from 2 to 10. Although this method seems a bit “stupid”, it works well.

The model selection process is a bit similar. When we want to quickly select a relatively suitable model from multiple models, we can also put the data into each model for testing in turn to find the one with stronger generalization ability. Although this is a “stupid” method, there are also some clever steps in the experimental process. One of them is the K-fold cross-validation introduced today.

Regarding cross-validation, we have actually been exposed to it before, so we won’t repeat the concept here.

24.3. Data Set Preprocessing#

Now let’s officially start this challenge. The Abalone (abalone) age data set challenge-6-abalone.csv is used in this challenge. First, the experiment needs to download this data set through the following link.

# Dataset Download Link
https://cdn.aibydoing.com/aibydoing/files/challenge-6-abalone.csv

Exercise 24.1

Challenge: Load and preview the first 5 rows of the dataset.

import pandas as pd

## 代码开始 ### (≈ 2 行代码)
df = None
## 代码结束 ###

Expected output

M 0.455 0.365 0.095 0.514 0.2245 0.101 0.15 15
0 M 0.35 0.265 0.09 0.2255 0.0995 0.0485 0.07 7
1 F 0.53 0.42 0.135 0.677 0.2565 0.1415 0.21 9
2 M 0.44 0.365 0.125 0.516 0.2155 0.114 0.155 10
3 I 0.33 0.255 0.08 0.205 0.0895 0.0395 0.055 7
4 I 0.425 0.3 0.095 0.3515 0.141 0.0775 0.12 8

At this point, you will notice that there is something wrong with the column names of this dataset. It seems that they are data values. In fact, this is due to the non-standardization of the dataset. The column names of this dataset are located in the last row, and we can preview the last 5 rows to check.

Exercise 24.2

Challenge: Preview the last 5 rows of the dataset.

## 代码开始 ### (≈ 1 行代码)

## 代码结束 ###

Expected output

M 0.455 0.365 0.095 0.514 0.2245 0.101 0.15 15
4172 M 0.59 0.44 0.135 0.966 0.439 0.2145 0.2605 10
4173 M 0.6 0.475 0.205 1.176 0.5255 0.2875 0.308 9
4174 F 0.625 0.485 0.15 1.0945 0.531 0.261 0.296 10
4175 M 0.71 0.555 0.195 1.9485 0.9455 0.3765 0.495 12
4176 Sex Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Rings

As shown in the above table, the row with index 4176 is actually the column name. Therefore, we need to put the column names in the correct position, that is, we need to regenerate the DataFrame composed of the dataset.

Exercise 24.3

Challenge: Set the correct column names for the dataset, keep the row that was wrongly used as a column name, and delete the last row which was wrongly placed as a column name.

## 代码开始 ### (≈ 3~7 行代码)
df = None
## 代码结束 ###

Run the tests

pd.concat([df.head(2), df.tail(2)])

Expected output

Sex Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Rings
0 M 0.455 0.365 0.095 0.514 0.2245 0.101 0.15 15
1 M 0.35 0.265 0.09 0.2255 0.0995 0.0485 0.07 7
4175 F 0.625 0.485 0.15 1.0945 0.531 0.261 0.296 10
4176 M 0.71 0.555 0.195 1.9485 0.9455 0.3765 0.495 12

So far, we have set up the correct dataset format. The first 8 columns of this dataset count some physiological characteristics of abalones, such as sex, length, weight, etc. The last 1 column is the target column, which counts the number of rings (Rings) of abalones. The number of rings varies from 1 to 30, and the larger the value, the older the abalone.

You may realize that if we want to predict the age of abalones, then this should be a regression problem. Therefore, in this experiment, abalones with 1 - 10 rings are defined as small, those with 11 - 20 rings as middle, and those with 21 - 30 rings as large.

Meanwhile, for the three gender categories M, F, and I that appear in the Sex column, we replace them with the numerical values 0, 1, and 2 respectively.

Exercise 24.4

Challenge: Replace the target values (Rings) of the dataset with 3 categories according to the intervals, and replace the Sex column as required above.

Hint: To replace according to intervals, you can use pd.cut(), refer to the official documentation for details.

## 代码开始 ### (≈ 3~5 行代码)

## 代码结束 ###

Run the tests

print(df.iloc[[3, 6, 12, 83]]["Rings"].values)
df.head()

Expected output

[small, middle, middle, large]
Sex Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Rings
0 0 0.455 0.365 0.095 0.514 0.2245 0.101 0.15 middle
1 0 0.35 0.265 0.09 0.2255 0.0995 0.0485 0.07 small
2 1 0.53 0.42 0.135 0.677 0.2565 0.1415 0.21 small
3 0 0.44 0.365 0.125 0.516 0.2155 0.114 0.155 small
4 2 0.33 0.255 0.08 0.205 0.0895 0.0395 0.055 small

24.4. K-fold Subset Equalization#

Next, we evenly divide the preprocessed dataset above into K subsets. Here, we use the K-fold splitting method provided by scikit-learn as follows:

sklearn.model_selection.KFold(n_splits=3, shuffle=False, random_state=None)

Among them, the parameters:

- n_splits : The default is 3; the minimum is 2, representing the value of K for K-fold subset splitting.
- shuffle : The default is False; when it is True, the data will be randomly shuffled.
- random_state : The default is None; the random number seed.

Exercise 24.5

Challenge: Use KFold() to divide the dataset into 10 folds, specifying the parameters: shuffle=False, random_state=50.

## 代码开始 ### (≈ 2 行代码)
kf = None
## 代码结束 ###

Run the tests

kf

Expected output

KFold(n_splits=10, random_state=50, shuffle=False)

If you want to obtain the splitting results, it can be achieved through a for loop:

# 直接运行查看结果
for train_index, test_index in kf.split(df):
    print("TRAIN:", len(train_index), "TEST:", len(test_index))

Here, the length of train_index is printed using len(). You can also try printing the content in train_index by yourself.

24.5. K-Fold Cross-Validation#

Above, we can conveniently complete K-fold splitting using KFold(), and you can use the splitting results to train and validate the model. For convenience, scikit-learn also provides a method for direct cross-validation, which can greatly reduce the amount of code:

sklearn.model_selection.cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=None, n_jobs=1, verbose=0, fit_params=None, pre_dispatch=‘2*n_jobs’)

Main parameters:

- estimator : The model.
- X : The array of features.
- y : The array of target values.
- cv : The number of K-folds.

Exercise 24.6

Challenge: Build a machine learning classification model using the K-nearest neighbors method and perform 10-fold cross-validation.

## 代码开始 ### (≈ 4~7 行代码)
cross_val_score()
## 代码结束 ###

Expected output

array([0.75417661, 0.72009569, 0.77990431, 0.72966507, 0.73205742,
       0.74401914, 0.76498801, 0.74580336, 0.75779376, 0.73860911])

Since the 10-fold cross-validation will conduct 10 experiments, 10 evaluation results of classification accuracy are finally output.

If we want to train a classification prediction model using the abalone dataset, in order to quickly screen out models with better performance, we can use K-fold cross-validation. Therefore, next we will obtain the 10-fold cross-validation results of the models learned in this chapter’s course under default parameters.

Exercise 24.7

Challenge: Use the 10-fold cross-validation method to test the performance of the abalone dataset under the default parameters of logistic regression, K-nearest neighbors, support vector machine, artificial neural network, decision tree, random forest, and Adaboost, and take the average of the 10-fold cross-validation results.

Hint: The method for calling the artificial neural network model is: sklearn.neural_network.MLPClassifier.

"""加载分类器模块
"""
## 代码开始 ### (≈ 7 行代码)

## 代码结束 ###

"""各分类模型 10 折交叉验证函数
"""


def classifiers():
    """
    参数:无

    返回:
    scores -- 各分类模型 10 折交叉验证平均结果(列表)
    """
    ### 代码开始 ### (> 10 行代码)
    scores = []
    ### 代码结束 ###

    return scores

Run the test (it takes a long time to execute)

classifiers()

Reference output:

[0.7543513318292162,0.7467112482377095, 0.7443091754229052,0.7648902636235668,
 0.6856593267369182,0.7455041701414079, 0.7318638234403398]
* The above output is for reference only, and the results of each execution may vary.

You will find that when we use the default parameters, the performance of the 7 classification prediction models is quite good, among which the support vector machine, artificial neural network, random forest, etc. stand out. At this time, we can initially select these methods and then obtain more ideal results by tuning the parameters.


○ Sharethis article link to your social media, blog, forum, etc. More external links will increase the search engine ranking of this site.