cover

17. Classification Model Evaluation Methods#

17.1. Introduction#

In the previous classification experiments, we used accuracy as the only method to evaluate the model. In fact, there are many other evaluation methods for classification models. In this experiment, we will learn about other commonly used methods to have a more comprehensive understanding of the evaluation of classification models.

17.2. Key Points#

  • Accuracy

  • Precision

  • Recall

  • F1 Score

  • ROC Curve

In the previous experiment, only accuracy was introduced as an evaluation method for classification models. Next, we will comprehensively understand the commonly used evaluation metrics for classification models. For better understanding, a logistic regression will be used here to build a risk classification prediction model for credit card holders.

17.3. Dataset Introduction#

The dataset is in CSV format. You can use Pandas to read and preview the dataset.

wget -nc https://cdn.aibydoing.com/aibydoing/files/credit_risk_train.csv
import pandas as pd

df = pd.read_csv("credit_risk_train.csv")  # 读取数据文件
df.head()
BILL_1 BILL_2 BILL_3 BILL_4 BILL_5 BILL_6 AGE SEX EDUCATION MARRIAGE RISK
0 0 0 0 0 0 0 37 Female Graduate School Married LOW
1 8525 5141 5239 7911 17890 10000 25 Male High School Single HIGH
2 628 662 596 630 664 598 39 Male Graduate School Married HIGH
3 4649 3964 3281 934 467 12871 41 Female Graduate School Single HIGH
4 46300 10849 8857 9658 9359 9554 55 Female High School Married HIGH

This dataset contains 10 columns of features and one column of class labels. Among them:

  • Columns 1 to 6 are the recent historical bill information of the customers. (Features)

  • Column 7 is the age of the customer. (Feature)

  • Column 8 is the gender of the customer. (Feature)

  • Column 9 is the education level of the customer. (Feature)

  • Column 10 is the marital status of the customer. (Feature)

  • Column 11 is the risk status of the customer’s credit card holding. (Classification label: LOW, HIGH)

Our goal is to use this dataset to train a credit card holder risk prediction model and evaluate the model. First, following the process of machine learning modeling, we need to divide the dataset into a training set and a test set.

Although the above dataset seems to be very tidy, you will find that the feature data in columns 7, 8, and 9 are categorical (Female / Male). Therefore, here when dividing the dataset, we will use one-hot encoding introduced in data preprocessing to convert categorical features into numerical features.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import scale

df.RISK = df.RISK.replace({"LOW": 0, "HIGH": 1})  # 将分类标签替换为数值,方便后面计算

train_data = df.iloc[:, :-1]  # 特征数据列
train_data = pd.get_dummies(train_data)  # 对特征数据进行独热编码
train_data = scale(train_data)  # 规范化处理

train_target = df["RISK"]  # 目标数据列

# 划分数据集,训练集占 70%,测试集占 30%
X_train, X_test, y_train, y_test = train_test_split(
    train_data, train_target, test_size=0.3, random_state=0
)

X_train.shape, X_test.shape, y_train.shape, y_test.shape
((14000, 16), (6000, 16), (14000,), (6000,))

Next, we use scikit-learn to build a logistic regression classification model. We use the training data to complete the model training and use the test data to evaluate the model. The process of training a model using scikit-learn is very simple. After instantiating the class of the corresponding model, we can complete the training using fit().

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver="lbfgs")  # 定义逻辑回归模型
model.fit(X_train, y_train)  # 使用训练数据完成模型训练
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

For classification models, we generally use accuracy to evaluate the model. Specifically, we make predictions on the test set using the already trained model, and then calculate the prediction accuracy.

17.4. Accuracy#

In the credit card risk prediction model, the target values correspond to two categories, which is often referred to as a binary classification problem. In a binary classification problem, we often define the positive class and the negative class. For example, here we define HIGH as the positive class and LOW as the negative class (you can also define it the other way around). Then, we can present the confusion matrix of the actual classes (row names) and the predicted classes (column names) as follows.

Credit Risk

HIGH

LOW

HIGH

True Positive (TP)

False Negative (FN)

LOW

False Positive (FP)

True Negative (TN)

Specifically for the above table:

  • TP: The number of positive classes predicted as positive classes → Correct prediction

  • TN: The number of negative classes predicted as negative classes → Correct prediction

  • FP: The number of negative classes predicted as positive classes → Incorrect prediction

  • FN: The number of positive classes predicted as negative classes → Missed prediction

Based on this confusion matrix, we can present the calculation methods for the common evaluation metrics of the classification model.

Accuracy is the proportion of the number of correctly classified test samples to the total number of test instances. Then, the formula for calculating accuracy is:

\[ Accuracy = \frac{TP+TN}{TP+TN+FP+FN} \tag{1} \]

Of course, for the convenience of implementation below, we can also write the classification accuracy in the following form:

\[ acc=\frac{\sum_{i=1}^{N}I(\bar{y_{i}}=y_{i})}{N} \tag{2} \]

Where \(N\) represents the total number of data entries, \(\bar{y_{i}}\) represents the predicted class value of the \(i\)-th data entry, \(y_{i}\) represents the true class value of the \(i\)-th data entry, and \(I\) is also an indicator function, representing the number of times \(\bar{y_{i}}\) and \(y_{i}\) are the same.

First, we need to obtain the prediction results of the model. This step is also very simple, and we only need to use the predict method.

y_pred = model.predict(X_test)  # 输入测试集特征数据得到预测结果
y_pred
array([1, 1, 1, ..., 1, 1, 1])
import numpy as np


def get_accuracy(test_labels, pred_lables):
    # 准确率计算公式,根据公式 2 实现
    correct = np.sum(test_labels == pred_lables)  # 计算预测正确的数据个数
    n = len(test_labels)  # 总测试集数据个数
    acc = correct / n
    return acc

Now, we only need to input the true labels y_test and the predicted labels of the model y_pred.

get_accuracy(y_test, y_pred)  # 计算模型预测准确率
0.7678333333333334

You can also directly use the scikit-learn calculation method for accuracy: sklearn.metrics.accuracy_score(y_true, y_pred).

from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)  # 传入真实类别和预测类别
0.7678333333333334

In fact, when building a model with scikit-learn, you can also directly use model.score() to obtain the classification accuracy:

model.score(X_test, y_test)  # 传入测试数据特征和类别
0.7678333333333334

Above, we have finished talking about the three commonly used methods for calculating classification accuracy. As the most frequently encountered modeling problem in machine learning, for classification models, in addition to using “accuracy” to evaluate the model, there are actually several other commonly used performance evaluation metrics, which will be introduced together here.

17.5. Precision#

Precision, also known as positive predictive value, is the ratio of the number of correctly classified positive examples to the number of examples classified as positive.

\[Precision = \frac{TP}{TP+FP} \tag{3}\]

The scikit-learn calculation method for precision: sklearn.metrics.precision_score(y_true, y_pred).

from sklearn.metrics import precision_score

precision_score(y_test, y_pred)
0.7678333333333334

17.6. Recall#

Recall, also known as sensitivity, is the ratio of the number of correctly classified positive examples to the number of actual positive examples.

\[Recall = \frac{TP}{TP+FN} \tag{4}\]

The scikit-learn calculation method for recall: sklearn.metrics.recall_score(y_true, y_pred).

from sklearn.metrics import recall_score

recall_score(y_test, y_pred)
1.0

17.7. F1 Score#

The F1 score is the weighted average of precision and recall.

\[F1 = \frac{2*(Precision * Recall)}{Precision + Recall} \tag{5}\]

F1 is equivalent to a comprehensive evaluation index of precision and recall, which is more beneficial for measuring data and is also commonly used.

The scikit-learn calculation method for the F1 score: sklearn.metrics.f1_score(y_true, y_pred).

from sklearn.metrics import f1_score

f1_score(y_test, y_pred)
0.8686716319411709

17.8. ROC Curve#

In some classification models (such as logistic regression), a threshold is usually set, and it is stipulated that samples greater than this threshold are positive classes, while those less than it are negative classes. Therefore, when we decrease the threshold, more samples will be classified as positive classes. This will increase the recognition rate of positive classes, but at the same time, more negative classes will be misidentified as positive classes.

Therefore, the purpose of the ROC curve is to visualize this change process and thus evaluate the quality of a classifier.

There are two metrics in the ROC curve, namely TPR and FPR, and the formulas are as follows:

\[TPR = \frac{TP}{TP+FN} \tag{6a}\]
\[FPR = \frac{FP}{FP+TN} \tag{6b}\]

Among them, TPR represents the probability of correctly classifying positive examples (recall rate), while FPR represents the probability of misclassifying negative examples as positive examples.

In the ROC curve, we set the horizontal axis as FPR and the vertical axis as TPR, so that we can intuitively see the relationship between FPR and TPR.

image

Then:

  • When FPR = 0 and TPR = 0, it means that every instance is predicted as a negative example.

  • When FPR = 1 and TPR = 1, it means that every instance is predicted as a positive example.

  • When FPR = 0 and TPR = 1, it means the optimal classifier point.

Then, the ROC curve corresponding to an excellent classifier should be as close to the upper left corner as possible. The closer the curve is to the 45-degree diagonal line, the worse the classifier performance.

The scikit-learn calculation method for the ROC curve: sklearn.metrics.roc_curve(y_true, y_score)

Although it is intuitive to use the ROC curve to represent the quality of a classifier, people often prefer to use a numerical value to evaluate the classifier. At this time, the concept of AUC is proposed. The full name of AUC is Area Under Curve, which means the area under the curve, that is, the area under the ROC curve.

  • \(AUC = 1\): Perfect classifier.

  • \(0.5 < AUC < 1\): The classifier is better than random guessing.

  • \(AUC = 0.5\): The classifier’s result is close to random guessing.

  • \(AUC < 0.5\): The classifier performs worse than random guessing.

The scikit-learn calculation method for AUC: sklearn.metrics.auc(fpr, tpr)

Next, we will plot the ROC curve for the prediction results of this time:

from matplotlib import pyplot as plt

%matplotlib inline

from sklearn.metrics import roc_curve
from sklearn.metrics import auc

y_score = model.decision_function(X_test)
fpr, tpr, _ = roc_curve(y_test, y_score)
roc_auc = auc(fpr, tpr)

plt.plot(fpr, tpr, label="ROC curve (area = %0.2f)" % roc_auc)
plt.plot([0, 1], [0, 1], color="navy", linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend()
<matplotlib.legend.Legend at 0x16a552d70>
../_images/9251f8bbdf75a62dbfca0164d0cd691d76f63eb3828f0c31523f7383f079f861.png

17.9. Summary#

This experiment introduced several commonly used evaluation metrics after classification prediction, namely: accuracy, precision, recall, F1-score, and ROC curve. For these methods, more practice is needed in the future to master them proficiently.


○ Sharethis article link to your social media, blog, forum, etc. More external links will increase the search engine ranking of this site.