cover

23. Application of Heterogeneous Integration Voting Method#

23.1. Introduction#

In the previous experiments on ensemble learning algorithms, we mainly introduced two categories: Bagging and Boosting. Among them, voting method is mainly applied in Bagging. However, whether it is Bagging Tree or Random Forest, the decision tree algorithm is used for homogeneous integration. In this challenge, we will learn to apply different algorithms for heterogeneous ensemble learning.

23.2. Key Points#

  • CART Decision Tree Classification

  • Grid Search Parameter Selection

In this challenge, we still use the student performance dataset in the ensemble learning experiment. First, we load the dataset and complete the splitting of the training and test sets.

wget -nc https://cdn.aibydoing.com/aibydoing/files/course-14-student.csv
import pandas as pd
from sklearn.model_selection import train_test_split

stu_data = pd.read_csv("course-14-student.csv", index_col=0)

X_train, X_test, y_train, y_test = train_test_split(
    stu_data.iloc[:, :-1], stu_data["G3"], test_size=0.3, random_state=35
)

X_train.shape, X_test.shape, y_train.shape, y_test.shape
((276, 26), (119, 26), (276,), (119,))

23.3. Voting Classifier#

Next, we introduce the VotingClassifier provided by scikit-learn official documentation.

sklearn.ensemble.VotingClassifier(estimators, voting='hard')

Among them:

  • estimators: Multiple different classifiers can be passed in as a list of tuples [('name1', model1), ('name2', model2)].

  • voting: Can be either hard or soft.

When voting='hard', it is equivalent to the majority voting method mentioned earlier. For example, for a certain sample judgment:

  • Classifier 1 → Class 1

  • Classifier 2 → Class 1

  • Classifier 3 → Class 2

Finally, it is predicted that this sample belongs to Class 1.

When voting='soft', it is equivalent to the weighted voting method mentioned earlier. For example, for a certain sample judgment, we preset the weights of the three classes as \(w_1 = 1\), \(w_2 = 1\), \(w_3 = 1\). Then, according to the class probabilities returned by the classifier, we can obtain the average probability for the three classes. The example calculation table is as follows:

Classifier Class 1 Class 2 Class 3
Classifier 1 w1 * 0.2 w1 * 0.5 w1 * 0.3
Classifier 2 w2 * 0.6 w2 * 0.3 w2 * 0.1
Classifier 3 w3 * 0.3 w3 * 0.4 w3 * 0.3
Average Probability 0.37 0.4 0.23

In the table, taking Classifier 1 as an example, the three class probabilities returned by the classifier are 0.2, 0.5, 0.3, which need to be multiplied by the preset weights w1, w2, w3 respectively. Finally, since the average weighted probability of Class 2 is the largest, it is determined that the sample belongs to Class 2.

Exercise 23.1

Challenge: Learn and use VotingClassifier to complete heterogeneous integrated voting classification.

Requirement: Compare the test results of three “individual classifiers”, namely Logistic Regression, Decision Tree, and Gaussian Naive Bayes, and the VotingClassifier composed of them. You can freely select parameters.

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import VotingClassifier

## 代码开始 ### (>5 行代码)

## 代码结束 ###

Finally, the classification accuracies of the three classifiers and the VotingClassifier on the test set are expected to be obtained.

Expected output

LogisticRegression: 0.76
DecisionTreeClassifier: 0.87
GaussianNB: 0.53
VotingClassifier: 0.78

The results given above are for reference only, and the values may change with parameters.

You may find that the results of the VotingClassifier are not necessarily better than all individual classifiers during the process of completing the challenge, just like the results given in the reference output above. In fact, this is normal. Individual learners tend to overfit the training data, while the VotingClassifier can avoid this well because it uses majority voting. In addition, the result of voting='soft' may be worse than that of voting='hard' because it is not necessarily possible to accurately set the class weights.

Therefore, the VotingClassifier may not be the classifier with the best performance on a single dataset, but it should be a classifier that can better avoid overfitting.


○ Sharethis article link to your social media, blog, forum, etc. More external links will increase the search engine ranking of this site.