Pro Membership

Pro Membership is a membership plan created by the author to maintain and update this tutorial. You can get more benefits and services, click to view details.

23. Application of Heterogeneous Integration Voting Method#

23.1. Introduction#

In the previous experiments on ensemble learning algorithms, we mainly introduced two categories: Bagging and Boosting. Among them, voting method is mainly applied in Bagging. However, whether it is Bagging Tree or Random Forest, the decision tree algorithm is used for homogeneous integration. In this challenge, we will learn to apply different algorithms for heterogeneous ensemble learning.

23.2. Key Points#

CART Decision Tree Classification
Grid Search Parameter Selection

In this challenge, we still use the student performance dataset in the ensemble learning experiment. First, we load the dataset and complete the splitting of the training and test sets.

                          wget -nc https://cdn.aibydoing.com/aibydoing/files/course-14-student.csv

                        

                          import pandas as pd
from sklearn.model_selection import train_test_split

stu_data = pd.read_csv("course-14-student.csv", index_col=0)

X_train, X_test, y_train, y_test = train_test_split(
    stu_data.iloc[:, :-1], stu_data["G3"], test_size=0.3, random_state=35
)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((276, 26), (119, 26), (276,), (119,))

23.3. Voting Classifier#

Next, we introduce the VotingClassifier provided by scikit-learn official documentation.

                      sklearn.ensemble.VotingClassifier(estimators, voting='hard')

                    

Among them:

estimators: Multiple different classifiers can be passed in as a list of tuples [('name1', model1), ('name2', model2)].
voting: Can be either hard or soft.

When voting='hard', it is equivalent to the majority voting method mentioned earlier. For example, for a certain sample judgment:

Classifier 1 → Class 1
Classifier 2 → Class 1
Classifier 3 → Class 2

Finally, it is predicted that this sample belongs to Class 1.

When voting='soft', it is equivalent to the weighted voting method mentioned earlier. For example, for a certain sample judgment, we preset the weights of the three classes as \(w_1 = 1\), \(w_2 = 1\), \(w_3 = 1\). Then, according to the class probabilities returned by the classifier, we can obtain the average probability for the three classes. The example calculation table is as follows:

Classifier	Class 1	Class 2	Class 3
Classifier 1	w1 * 0.2	w1 * 0.5	w1 * 0.3
Classifier 2	w2 * 0.6	w2 * 0.3	w2 * 0.1
Classifier 3	w3 * 0.3	w3 * 0.4	w3 * 0.3
Average Probability	0.37	0.4	0.23

In the table, taking Classifier 1 as an example, the three class probabilities returned by the classifier are 0.2, 0.5, 0.3, which need to be multiplied by the preset weights w1, w2, w3 respectively. Finally, since the average weighted probability of Class 2 is the largest, it is determined that the sample belongs to Class 2.

Exercise 23.1

Challenge: Learn and use VotingClassifier to complete heterogeneous integrated voting classification.

Requirement: Compare the test results of three “individual classifiers”, namely Logistic Regression, Decision Tree, and Gaussian Naive Bayes, and the VotingClassifier composed of them. You can freely select parameters.

                          from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import VotingClassifier

## 代码开始 ### (>5 行代码)

## 代码结束 ###

                        

Solution to Exercise 23.1

                          from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import VotingClassifier

# Code starts ### (>5 lines of code)
clf1 = LogisticRegression(
    solver='lbfgs', multi_class='auto', max_iter=1000, random_state=1)
clf2 = DecisionTreeClassifier(random_state=1)
clf3 = GaussianNB()

eclf = VotingClassifier(
    estimators=[('lr', clf1), ('dt', clf2), ('gnb', clf3)], voting='hard')

for clf, label in zip([clf1, clf2, clf3, eclf],
                      ['LogisticRegression:', 'DecisionTreeClassifier:',
                       'GaussianNB:', 'VotingClassifier:']):
    clf.fit(X_train, y_train)
    scores = clf.score(X_test, y_test)
    print(label, round(scores, 2))
### Code ends ###

                        

Finally, the classification accuracies of the three classifiers and the VotingClassifier on the test set are expected to be obtained.

Expected output

                      LogisticRegression: 0.76
DecisionTreeClassifier: 0.87
GaussianNB: 0.53
VotingClassifier: 0.78

                    

The results given above are for reference only, and the values may change with parameters.

You may find that the results of the VotingClassifier are not necessarily better than all individual classifiers during the process of completing the challenge, just like the results given in the reference output above. In fact, this is normal. Individual learners tend to overfit the training data, while the VotingClassifier can avoid this well because it uses majority voting. In addition, the result of voting='soft' may be worse than that of voting='hard' because it is not necessarily possible to accurately set the class weights.

Therefore, the VotingClassifier may not be the classifier with the best performance on a single dataset, but it should be a classifier that can better avoid overfitting.

○ Sharethis article link to your social media, blog, forum, etc. More external links will increase the search engine ranking of this site.

If you find this content helpful, you can buy me a coffee

22. Bagging and Boosting Ensemble Learning Methods

24. Quickly Select Models Using Cross-Validation