23. Application of Heterogeneous Integration Voting Method#
23.1. Introduction#
In the previous experiments on ensemble learning algorithms, we mainly introduced two categories: Bagging and Boosting. Among them, voting method is mainly applied in Bagging. However, whether it is Bagging Tree or Random Forest, the decision tree algorithm is used for homogeneous integration. In this challenge, we will learn to apply different algorithms for heterogeneous ensemble learning.
23.2. Key Points#
CART Decision Tree Classification
Grid Search Parameter Selection
In this challenge, we still use the student performance dataset in the ensemble learning experiment. First, we load the dataset and complete the splitting of the training and test sets.
wget -nc https://cdn.aibydoing.com/aibydoing/files/course-14-student.csv
import pandas as pd
from sklearn.model_selection import train_test_split
stu_data = pd.read_csv("course-14-student.csv", index_col=0)
X_train, X_test, y_train, y_test = train_test_split(
stu_data.iloc[:, :-1], stu_data["G3"], test_size=0.3, random_state=35
)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((276, 26), (119, 26), (276,), (119,))
23.3. Voting Classifier#
Next, we introduce the
VotingClassifier
provided by scikit-learn
official documentation.
sklearn.ensemble.VotingClassifier(estimators, voting='hard')
Among them:
-
estimators
: Multiple different classifiers can be passed in as a list of tuples[('name1', model1), ('name2', model2)]
. -
voting
: Can be eitherhard
orsoft
.
When
voting='hard'
, it is equivalent to the majority voting method mentioned
earlier. For example, for a certain sample judgment:
Classifier 1 → Class 1
Classifier 2 → Class 1
Classifier 3 → Class 2
Finally, it is predicted that this sample belongs to Class 1.
When
voting='soft'
, it is equivalent to the weighted voting method mentioned
earlier. For example, for a certain sample judgment, we
preset the weights of the three classes as
\(w_1 = 1\),
\(w_2 = 1\),
\(w_3 = 1\). Then, according to the class probabilities returned by
the classifier, we can obtain the average probability for
the three classes. The example calculation table is as
follows:
Classifier | Class 1 | Class 2 | Class 3 |
---|---|---|---|
Classifier 1 | w1 * 0.2 | w1 * 0.5 | w1 * 0.3 |
Classifier 2 | w2 * 0.6 | w2 * 0.3 | w2 * 0.1 |
Classifier 3 | w3 * 0.3 | w3 * 0.4 | w3 * 0.3 |
Average Probability | 0.37 | 0.4 | 0.23 |
In the table, taking Classifier 1 as an example, the three
class probabilities returned by the classifier are
0.2,
0.5,
0.3
, which need to be multiplied by the preset weights
w1,
w2,
w3
respectively. Finally, since the average weighted
probability of Class 2 is the largest, it is determined that
the sample belongs to Class 2.
Exercise 23.1
Challenge: Learn and use VotingClassifier to complete heterogeneous integrated voting classification.
Requirement: Compare the test results of three “individual classifiers”, namely Logistic Regression, Decision Tree, and Gaussian Naive Bayes, and the VotingClassifier composed of them. You can freely select parameters.
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import VotingClassifier
## 代码开始 ### (>5 行代码)
## 代码结束 ###
Solution to Exercise 23.1
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import VotingClassifier
# Code starts ### (>5 lines of code)
clf1 = LogisticRegression(
solver='lbfgs', multi_class='auto', max_iter=1000, random_state=1)
clf2 = DecisionTreeClassifier(random_state=1)
clf3 = GaussianNB()
eclf = VotingClassifier(
estimators=[('lr', clf1), ('dt', clf2), ('gnb', clf3)], voting='hard')
for clf, label in zip([clf1, clf2, clf3, eclf],
['LogisticRegression:', 'DecisionTreeClassifier:',
'GaussianNB:', 'VotingClassifier:']):
clf.fit(X_train, y_train)
scores = clf.score(X_test, y_test)
print(label, round(scores, 2))
### Code ends ###
Finally, the classification accuracies of the three classifiers and the VotingClassifier on the test set are expected to be obtained.
Expected output
LogisticRegression: 0.76
DecisionTreeClassifier: 0.87
GaussianNB: 0.53
VotingClassifier: 0.78
The results given above are for reference only, and the values may change with parameters.
You may find that the results of the
VotingClassifier
are not necessarily better than all individual classifiers
during the process of completing the challenge, just like
the results given in the reference output above. In fact,
this is normal. Individual learners tend to overfit the
training data, while the
VotingClassifier
can avoid this well because it uses majority voting. In
addition, the result of
voting='soft'
may be worse than that of
voting='hard'
because it is not necessarily possible to accurately set the
class weights.
Therefore, the
VotingClassifier
may not be the classifier with the best performance on a
single dataset, but it should be a classifier that can
better avoid overfitting.