17. Classification Model Evaluation Methods#
17.1. Introduction#
In the previous classification experiments, we used accuracy as the only method to evaluate the model. In fact, there are many other evaluation methods for classification models. In this experiment, we will learn about other commonly used methods to have a more comprehensive understanding of the evaluation of classification models.
17.2. Key Points#
Accuracy
Precision
Recall
F1 Score
ROC Curve
In the previous experiment, only accuracy was introduced as an evaluation method for classification models. Next, we will comprehensively understand the commonly used evaluation metrics for classification models. For better understanding, a logistic regression will be used here to build a risk classification prediction model for credit card holders.
17.3. Dataset Introduction#
The dataset is in CSV format. You can use Pandas to read and preview the dataset.
wget -nc https://cdn.aibydoing.com/aibydoing/files/credit_risk_train.csv
import pandas as pd
df = pd.read_csv("credit_risk_train.csv") # 读取数据文件
df.head()
BILL_1 | BILL_2 | BILL_3 | BILL_4 | BILL_5 | BILL_6 | AGE | SEX | EDUCATION | MARRIAGE | RISK | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 37 | Female | Graduate School | Married | LOW |
1 | 8525 | 5141 | 5239 | 7911 | 17890 | 10000 | 25 | Male | High School | Single | HIGH |
2 | 628 | 662 | 596 | 630 | 664 | 598 | 39 | Male | Graduate School | Married | HIGH |
3 | 4649 | 3964 | 3281 | 934 | 467 | 12871 | 41 | Female | Graduate School | Single | HIGH |
4 | 46300 | 10849 | 8857 | 9658 | 9359 | 9554 | 55 | Female | High School | Married | HIGH |
This dataset contains 10 columns of features and one column of class labels. Among them:
-
Columns 1 to 6 are the recent historical bill information of the customers. (Features)
-
Column 7 is the age of the customer. (Feature)
-
Column 8 is the gender of the customer. (Feature)
-
Column 9 is the education level of the customer. (Feature)
-
Column 10 is the marital status of the customer. (Feature)
-
Column 11 is the risk status of the customer’s credit card holding. (Classification label: LOW, HIGH)
Our goal is to use this dataset to train a credit card holder risk prediction model and evaluate the model. First, following the process of machine learning modeling, we need to divide the dataset into a training set and a test set.
Although the above dataset seems to be very tidy, you will find that the feature data in columns 7, 8, and 9 are categorical (Female / Male). Therefore, here when dividing the dataset, we will use one-hot encoding introduced in data preprocessing to convert categorical features into numerical features.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import scale
df.RISK = df.RISK.replace({"LOW": 0, "HIGH": 1}) # 将分类标签替换为数值,方便后面计算
train_data = df.iloc[:, :-1] # 特征数据列
train_data = pd.get_dummies(train_data) # 对特征数据进行独热编码
train_data = scale(train_data) # 规范化处理
train_target = df["RISK"] # 目标数据列
# 划分数据集,训练集占 70%,测试集占 30%
X_train, X_test, y_train, y_test = train_test_split(
train_data, train_target, test_size=0.3, random_state=0
)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((14000, 16), (6000, 16), (14000,), (6000,))
Next, we use scikit-learn to build a logistic regression
classification model. We use the training data to complete
the model training and use the test data to evaluate the
model. The process of training a model using scikit-learn is
very simple. After instantiating the class of the
corresponding model, we can complete the training using
fit()
.
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver="lbfgs") # 定义逻辑回归模型
model.fit(X_train, y_train) # 使用训练数据完成模型训练
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
For classification models, we generally use accuracy to evaluate the model. Specifically, we make predictions on the test set using the already trained model, and then calculate the prediction accuracy.
17.4. Accuracy#
In the credit card risk prediction model, the target values correspond to two categories, which is often referred to as a binary classification problem. In a binary classification problem, we often define the positive class and the negative class. For example, here we define HIGH as the positive class and LOW as the negative class (you can also define it the other way around). Then, we can present the confusion matrix of the actual classes (row names) and the predicted classes (column names) as follows.
Credit Risk |
HIGH |
LOW |
---|---|---|
HIGH |
True Positive (TP) |
False Negative (FN) |
LOW |
False Positive (FP) |
True Negative (TN) |
Specifically for the above table:
-
TP: The number of positive classes predicted as positive classes → Correct prediction
-
TN: The number of negative classes predicted as negative classes → Correct prediction
-
FP: The number of negative classes predicted as positive classes → Incorrect prediction
-
FN: The number of positive classes predicted as negative classes → Missed prediction
Based on this confusion matrix, we can present the calculation methods for the common evaluation metrics of the classification model.
Accuracy is the proportion of the number of correctly classified test samples to the total number of test instances. Then, the formula for calculating accuracy is:
Of course, for the convenience of implementation below, we can also write the classification accuracy in the following form:
Where \(N\) represents the total number of data entries, \(\bar{y_{i}}\) represents the predicted class value of the \(i\)-th data entry, \(y_{i}\) represents the true class value of the \(i\)-th data entry, and \(I\) is also an indicator function, representing the number of times \(\bar{y_{i}}\) and \(y_{i}\) are the same.
First, we need to obtain the prediction results of the
model. This step is also very simple, and we only need to
use the
predict
method.
y_pred = model.predict(X_test) # 输入测试集特征数据得到预测结果
y_pred
array([1, 1, 1, ..., 1, 1, 1])
import numpy as np
def get_accuracy(test_labels, pred_lables):
# 准确率计算公式,根据公式 2 实现
correct = np.sum(test_labels == pred_lables) # 计算预测正确的数据个数
n = len(test_labels) # 总测试集数据个数
acc = correct / n
return acc
Now, we only need to input the true labels
y_test
and the predicted labels of the model
y_pred
.
get_accuracy(y_test, y_pred) # 计算模型预测准确率
0.7678333333333334
You can also directly use the scikit-learn calculation
method for accuracy:
sklearn.metrics.accuracy_score(y_true,
y_pred)
.
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred) # 传入真实类别和预测类别
0.7678333333333334
In fact, when building a model with scikit-learn, you can
also directly use
model.score()
to obtain the classification accuracy:
model.score(X_test, y_test) # 传入测试数据特征和类别
0.7678333333333334
Above, we have finished talking about the three commonly used methods for calculating classification accuracy. As the most frequently encountered modeling problem in machine learning, for classification models, in addition to using “accuracy” to evaluate the model, there are actually several other commonly used performance evaluation metrics, which will be introduced together here.
17.5. Precision#
Precision, also known as positive predictive value, is the ratio of the number of correctly classified positive examples to the number of examples classified as positive.
The scikit-learn calculation method for precision:
sklearn.metrics.precision_score(y_true,
y_pred)
.
from sklearn.metrics import precision_score
precision_score(y_test, y_pred)
0.7678333333333334
17.6. Recall#
Recall, also known as sensitivity, is the ratio of the number of correctly classified positive examples to the number of actual positive examples.
The scikit-learn calculation method for recall:
sklearn.metrics.recall_score(y_true,
y_pred)
.
from sklearn.metrics import recall_score
recall_score(y_test, y_pred)
1.0
17.7. F1 Score#
The F1 score is the weighted average of precision and recall.
F1 is equivalent to a comprehensive evaluation index of precision and recall, which is more beneficial for measuring data and is also commonly used.
The scikit-learn calculation method for the F1 score:
sklearn.metrics.f1_score(y_true,
y_pred)
.
from sklearn.metrics import f1_score
f1_score(y_test, y_pred)
0.8686716319411709
17.8. ROC Curve#
In some classification models (such as logistic regression), a threshold is usually set, and it is stipulated that samples greater than this threshold are positive classes, while those less than it are negative classes. Therefore, when we decrease the threshold, more samples will be classified as positive classes. This will increase the recognition rate of positive classes, but at the same time, more negative classes will be misidentified as positive classes.
Therefore, the purpose of the ROC curve is to visualize this change process and thus evaluate the quality of a classifier.
There are two metrics in the ROC curve, namely TPR and FPR, and the formulas are as follows:
Among them, TPR represents the probability of correctly classifying positive examples (recall rate), while FPR represents the probability of misclassifying negative examples as positive examples.
In the ROC curve, we set the horizontal axis as FPR and the vertical axis as TPR, so that we can intuitively see the relationship between FPR and TPR.
Then:
-
When FPR = 0 and TPR = 0, it means that every instance is predicted as a negative example.
-
When FPR = 1 and TPR = 1, it means that every instance is predicted as a positive example.
-
When FPR = 0 and TPR = 1, it means the optimal classifier point.
Then, the ROC curve corresponding to an excellent classifier should be as close to the upper left corner as possible. The closer the curve is to the 45-degree diagonal line, the worse the classifier performance.
The scikit-learn calculation method for the ROC curve:
sklearn.metrics.roc_curve(y_true,
y_score)
Although it is intuitive to use the ROC curve to represent the quality of a classifier, people often prefer to use a numerical value to evaluate the classifier. At this time, the concept of AUC is proposed. The full name of AUC is Area Under Curve, which means the area under the curve, that is, the area under the ROC curve.
-
\(AUC = 1\): Perfect classifier.
-
\(0.5 < AUC < 1\): The classifier is better than random guessing.
-
\(AUC = 0.5\): The classifier’s result is close to random guessing.
-
\(AUC < 0.5\): The classifier performs worse than random guessing.
The scikit-learn calculation method for AUC:
sklearn.metrics.auc(fpr,
tpr)
Next, we will plot the ROC curve for the prediction results of this time:
from matplotlib import pyplot as plt
%matplotlib inline
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
y_score = model.decision_function(X_test)
fpr, tpr, _ = roc_curve(y_test, y_score)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label="ROC curve (area = %0.2f)" % roc_auc)
plt.plot([0, 1], [0, 1], color="navy", linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend()
17.9. Summary#
This experiment introduced several commonly used evaluation metrics after classification prediction, namely: accuracy, precision, recall, F1-score, and ROC curve. For these methods, more practice is needed in the future to master them proficiently.