Pro Membership

Pro Membership is a membership plan created by the author to maintain and update this tutorial. You can get more benefits and services, click to view details.

24. Quickly Select Models Using Cross-Validation#

24.1. Introduction#

In the previous courses, we learned multiple different classification prediction methods together. When you are in actual engineering practice, how can you quickly find the most suitable one from these machine learning methods? This involves the knowledge of model selection. In this challenge, we will learn about the K-fold cross-validation method and use this method to complete model selection.

24.2. Key Points#

K-fold Cross-Validation
Equal Subdivision of K-fold Subsets
Abalone Age Classification

We have learned multiple different classification methods in supervised learning. I believe you have fully grasped the principles and implementation processes of these methods. After learning these methods, you may have a question, that is, how to quickly select the most suitable method for the current application scenario from different machine learning methods?

To answer this question, let’s first review the experiment content of the K-Nearest Neighbors method. At the end of the K-Nearest Neighbors course, the experiment introduced the selection of the K value. At that time, we tried to find the most suitable one by trying different K values from 2 to 10. Although this method seems a bit “stupid”, it works well.

The model selection process is a bit similar. When we want to quickly select a relatively suitable model from multiple models, we can also put the data into each model for testing in turn to find the one with stronger generalization ability. Although this is a “stupid” method, there are also some clever steps in the experimental process. One of them is the K-fold cross-validation introduced today.

Regarding cross-validation, we have actually been exposed to it before, so we won’t repeat the concept here.

24.3. Data Set Preprocessing#

Now let’s officially start this challenge. The Abalone (abalone) age data set challenge-6-abalone.csv is used in this challenge. First, the experiment needs to download this data set through the following link.

# Dataset Download Link
https://cdn.aibydoing.com/aibydoing/files/challenge-6-abalone.csv

Exercise 24.1

Challenge: Load and preview the first 5 rows of the dataset.

                          import pandas as pd

## 代码开始 ### (≈ 2 行代码)
df = None
## 代码结束 ###

Solution to Exercise 24.1

wget -nc https://cdn.aibydoing.com/aibydoing/files/challenge-6-abalone.csv

                          import pandas as pd

### Start of code ### (≈ 2 lines of code)
df = pd.read_csv("challenge-6-abalone.csv")
df.head()
### End of code ###

                        

Expected output

	M	0.455	0.365	0.095	0.514	0.2245	0.101	0.15	15
0	M	0.35	0.265	0.09	0.2255	0.0995	0.0485	0.07	7
1	F	0.53	0.42	0.135	0.677	0.2565	0.1415	0.21	9
2	M	0.44	0.365	0.125	0.516	0.2155	0.114	0.155	10
3	I	0.33	0.255	0.08	0.205	0.0895	0.0395	0.055	7
4	I	0.425	0.3	0.095	0.3515	0.141	0.0775	0.12	8

At this point, you will notice that there is something wrong with the column names of this dataset. It seems that they are data values. In fact, this is due to the non-standardization of the dataset. The column names of this dataset are located in the last row, and we can preview the last 5 rows to check.

Exercise 24.2

Challenge: Preview the last 5 rows of the dataset.

                          ## 代码开始 ### (≈ 1 行代码)

## 代码结束 ###

Solution to Exercise 24.2

                          ### Code starts ### (≈ 1 line of code)
df.tail()
### Code ends ###

                        

Expected output

	M	0.455	0.365	0.095	0.514	0.2245	0.101	0.15	15
4172	M	0.59	0.44	0.135	0.966	0.439	0.2145	0.2605	10
4173	M	0.6	0.475	0.205	1.176	0.5255	0.2875	0.308	9
4174	F	0.625	0.485	0.15	1.0945	0.531	0.261	0.296	10
4175	M	0.71	0.555	0.195	1.9485	0.9455	0.3765	0.495	12
4176	Sex	Length	Diameter	Height	Whole weight	Shucked weight	Viscera weight	Shell weight	Rings

As shown in the above table, the row with index 4176 is actually the column name. Therefore, we need to put the column names in the correct position, that is, we need to regenerate the DataFrame composed of the dataset.

Exercise 24.3

Challenge: Set the correct column names for the dataset, keep the row that was wrongly used as a column name, and delete the last row which was wrongly placed as a column name.

                          ## 代码开始 ### (≈ 3～7 行代码)
df = None
## 代码结束 ###

                        

Solution to Exercise 24.3

                          ### Code start ### (≈ 3 - 7 lines of code)
columns_name = df.iloc[df.index[-1]].values
new_line = df.columns.values
df = df.drop(df.index[-1])
df.columns = columns_name
df = pd.DataFrame([new_line], columns=columns_name).append(df, ignore_index=True)
### Code end ###

### Method two
df = pd.read_csv("challenge-6-abalone.csv", header=-1)
df.columns = df.iloc[-1].values
df = df.drop(df.index[-1])

                        

Run the tests

                          pd.concat([df.head(2), df.tail(2)])

                        

Expected output

	Sex	Length	Diameter	Height	Whole weight	Shucked weight	Viscera weight	Shell weight	Rings
0	M	0.455	0.365	0.095	0.514	0.2245	0.101	0.15	15
1	M	0.35	0.265	0.09	0.2255	0.0995	0.0485	0.07	7
4175	F	0.625	0.485	0.15	1.0945	0.531	0.261	0.296	10
4176	M	0.71	0.555	0.195	1.9485	0.9455	0.3765	0.495	12

So far, we have set up the correct dataset format. The first 8 columns of this dataset count some physiological characteristics of abalones, such as sex, length, weight, etc. The last 1 column is the target column, which counts the number of rings (Rings) of abalones. The number of rings varies from 1 to 30, and the larger the value, the older the abalone.

You may realize that if we want to predict the age of abalones, then this should be a regression problem. Therefore, in this experiment, abalones with 1 - 10 rings are defined as small, those with 11 - 20 rings as middle, and those with 21 - 30 rings as large.

Meanwhile, for the three gender categories M, F, and I that appear in the Sex column, we replace them with the numerical values 0, 1, and 2 respectively.

Exercise 24.4

Challenge: Replace the target values (Rings) of the dataset with 3 categories according to the intervals, and replace the Sex column as required above.

Hint: To replace according to intervals, you can use pd.cut(), refer to the official documentation for details.

                          ## 代码开始 ### (≈ 3~5 行代码)

## 代码结束 ###

Solution to Exercise 24.4

                          ### Code start ### (≈ 3 - 5 lines of code)
df['Rings'] = pd.to_numeric(df['Rings'])
df['Rings'] = pd.cut(df.Rings, bins=[0, 10, 20, 30], labels=['small','middle', 'large'])
df['Sex'] = df.Sex.replace({'M':0, 'F':1, 'I':2})
### Code end ###

                        

Run the tests

                          print(df.iloc[[3, 6, 12, 83]]["Rings"].values)
df.head()

Expected output

[small, middle, middle, large]

	Sex	Length	Diameter	Height	Whole weight	Shucked weight	Viscera weight	Shell weight	Rings
0	0	0.455	0.365	0.095	0.514	0.2245	0.101	0.15	middle
1	0	0.35	0.265	0.09	0.2255	0.0995	0.0485	0.07	small
2	1	0.53	0.42	0.135	0.677	0.2565	0.1415	0.21	small
3	0	0.44	0.365	0.125	0.516	0.2155	0.114	0.155	small
4	2	0.33	0.255	0.08	0.205	0.0895	0.0395	0.055	small

24.4. K-fold Subset Equalization#

Next, we evenly divide the preprocessed dataset above into K subsets. Here, we use the K-fold splitting method provided by scikit-learn as follows:

sklearn.model_selection.KFold(n_splits=3, shuffle=False, random_state=None)

Among them, the parameters:

- n_splits : The default is 3; the minimum is 2, representing the value of K for K-fold subset splitting.
- shuffle : The default is False; when it is True, the data will be randomly shuffled.
- random_state : The default is None; the random number seed.

Exercise 24.5

Challenge: Use KFold() to divide the dataset into 10 folds, specifying the parameters: shuffle=False, random_state=50.

                          ## 代码开始 ### (≈ 2 行代码)
kf = None
## 代码结束 ###

                        

Solution to Exercise 24.5

                          ### Code start ### (≈ 2 lines of code)
from sklearn.model_selection import KFold
kf = KFold(n_splits=10, random_state=50)
### Code end ###

                        

Run the tests

kf

Expected output

KFold(n_splits=10, random_state=50, shuffle=False)

If you want to obtain the splitting results, it can be achieved through a for loop:

                          # 直接运行查看结果
for train_index, test_index in kf.split(df):
    print("TRAIN:", len(train_index), "TEST:", len(test_index))

                        

Here, the length of train_index is printed using len(). You can also try printing the content in train_index by yourself.

24.5. K-Fold Cross-Validation#

Above, we can conveniently complete K-fold splitting using KFold(), and you can use the splitting results to train and validate the model. For convenience, scikit-learn also provides a method for direct cross-validation, which can greatly reduce the amount of code:

sklearn.model_selection.cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=None, n_jobs=1, verbose=0, fit_params=None, pre_dispatch=‘2*n_jobs’)

Main parameters:

- estimator : The model.
- X : The array of features.
- y : The array of target values.
- cv : The number of K-folds.

Exercise 24.6

Challenge: Build a machine learning classification model using the K-nearest neighbors method and perform 10-fold cross-validation.

                          ## 代码开始 ### (≈ 4~7 行代码)
cross_val_score()
## 代码结束 ###

                        

Solution to Exercise 24.6

                          ### Code start ### (≈ 4 - 7 lines of code)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

features = df.iloc[:, 0:8]
target = df['Rings']

model = KNeighborsClassifier()
cross_val_score(model, X=features, y=target, cv=10)
### Code end ###

                        

Expected output

array([0.75417661, 0.72009569, 0.77990431, 0.72966507, 0.73205742,
       0.74401914, 0.76498801, 0.74580336, 0.75779376, 0.73860911])

Since the 10-fold cross-validation will conduct 10 experiments, 10 evaluation results of classification accuracy are finally output.

If we want to train a classification prediction model using the abalone dataset, in order to quickly screen out models with better performance, we can use K-fold cross-validation. Therefore, next we will obtain the 10-fold cross-validation results of the models learned in this chapter’s course under default parameters.

Exercise 24.7

Challenge: Use the 10-fold cross-validation method to test the performance of the abalone dataset under the default parameters of logistic regression, K-nearest neighbors, support vector machine, artificial neural network, decision tree, random forest, and Adaboost, and take the average of the 10-fold cross-validation results.

Hint: The method for calling the artificial neural network model is: sklearn.neural_network.MLPClassifier.

                          """加载分类器模块
"""
## 代码开始 ### (≈ 7 行代码)

## 代码结束 ###

"""各分类模型 10 折交叉验证函数
"""


def classifiers():
    """
    参数:无

    返回:
    scores -- 各分类模型 10 折交叉验证平均结果（列表）
    """
    ### 代码开始 ### (> 10 行代码)
    scores = []
    ### 代码结束 ###

    return scores

                        

Solution to Exercise 24.7

                          """Load classifier modules
"""
### Code start ### (≈ 7 lines of code)
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier

### Code end ###

"""Function for 10-fold cross-validation of each classification model
"""
def classifiers():
    
    """
    Parameters: None

    Returns:
    scores -- Average results of 10-fold cross-validation for each classification model (list)
    """
    ### Code start ### (> 10 lines of code)
    scores = []
    
    models = [
        LogisticRegression(),
        KNeighborsClassifier(),
        SVC(),
        MLPClassifier(),
        DecisionTreeClassifier(),
        RandomForestClassifier(),
        AdaBoostClassifier()]
    
    for model in models:
        score = cross_val_score(model, X=features, y=target, cv=10)
        mean_score = np.mean(score)
        scores.append(mean_score)
    ### Code end ###
    
    return scores

                        

Run the test (it takes a long time to execute)

classifiers()

Reference output:

[0.7543513318292162,0.7467112482377095, 0.7443091754229052,0.7648902636235668,
 0.6856593267369182,0.7455041701414079, 0.7318638234403398]

* The above output is for reference only, and the results of each execution may vary.

You will find that when we use the default parameters, the performance of the 7 classification prediction models is quite good, among which the support vector machine, artificial neural network, random forest, etc. stand out. At this time, we can initially select these methods and then obtain more ideal results by tuning the parameters.

○ Sharethis article link to your social media, blog, forum, etc. More external links will increase the search engine ranking of this site.

If you find this content helpful, you can buy me a coffee

23. Application of Heterogeneous Integration Voting Method

25. Implementation and Application of Partitioning Clustering Methods