Scikit-learn 模型评估：准确率、精确率、召回率、F1 分数

介绍

在训练机器学习模型后，评估其性能至关重要，以了解模型在新数据上的泛化能力。Scikit-learn 是一个强大的 Python 机器学习库，在其 sklearn.metrics 模块中提供了全面的模型评估工具。

在本实验中，你将学习如何使用一些最常见的指标来评估分类模型。我们将使用预定义的真实标签和预测标签集，以便只关注评估过程。你将学习计算：

Accuracy Score (准确率得分)
Confusion Matrix (混淆矩阵)
Precision Score (精确率得分)
Recall Score (召回率得分)
F1 Score (F1 分数)

在本实验结束时，你将熟练掌握使用这些基础的 scikit-learn 函数来评估你的分类模型性能。

使用 sklearn.metrics 中的 accuracy_score 计算准确率分数

在此步骤中，我们将计算模型预测的准确率。准确率是最直观的分类指标之一。它衡量了正确预测的实例数占总实例数的比例。

sklearn.metrics 中的 accuracy_score 函数用于计算此值。它将真实标签和预测标签作为参数。

首先，打开左侧文件浏览器中的 evaluate.py 文件。该文件已包含 y_true 和 y_pred 列表。现在，在文件末尾添加以下代码，以导入 accuracy_score 函数，计算准确率，并打印结果。

from sklearn.metrics import accuracy_score

## Calculate accuracy
accuracy = accuracy_score(y_true, y_pred)

print(f"Accuracy: {accuracy}")

现在，你的完整 evaluate.py 文件应如下所示：

## In this lab, we will use a predefined set of true labels and predicted labels
## to understand different evaluation metrics.

## y_true represents the actual, ground truth labels for our data points.
## For a binary classification, 0 could mean 'negative' and 1 could mean 'positive'.
y_true = [0, 1, 1, 0, 1, 0, 0, 1, 0, 1]

## y_pred represents the labels predicted by our hypothetical classification model.
y_pred = [0, 1, 0, 0, 1, 1, 0, 1, 0, 1]

print("Setup complete. True and predicted labels are defined in evaluate.py.")
print(f"True labels:    {y_true}")
print(f"Predicted labels: {y_pred}")

from sklearn.metrics import accuracy_score

## Calculate accuracy
accuracy = accuracy_score(y_true, y_pred)

print(f"Accuracy: {accuracy}")

现在，让我们运行脚本。在你的 IDE 中打开终端，并执行以下命令：

python3 evaluate.py

你应该会看到以下输出，其中包含准确率得分。准确率为 0.8 意味着 80% 的预测是正确的。

Setup complete. True and predicted labels are defined in evaluate.py.
True labels:    [0, 1, 1, 0, 1, 0, 0, 1, 0, 1]
Predicted labels: [0, 1, 0, 0, 1, 1, 0, 1, 0, 1]
Accuracy: 0.8

使用 sklearn.metrics 中的 confusion_matrix 生成混淆矩阵

在此步骤中，我们将生成一个混淆矩阵。虽然准确率可以快速总结性能，但它可能具有误导性，尤其是在数据集不平衡的情况下。混淆矩阵通过显示每个类别的正确和错误预测数量，提供了分类器性能的更详细细分。

该矩阵是一个包含预测值和实际值四种组合的表格：

True Negatives (TN) / 真阴性： 模型正确预测了负类。
False Positives (FP) / 假阳性： 模型错误地预测了正类。
False Negatives (FN) / 假阴性： 模型错误地预测了负类。
True Positives (TP) / 真阳性： 模型正确预测了正类。

我们将使用 sklearn.metrics 中的 confusion_matrix 函数。将以下代码添加到你的 evaluate.py 文件末尾。

from sklearn.metrics import confusion_matrix

## Generate confusion matrix
cm = confusion_matrix(y_true, y_pred)

print("Confusion Matrix:")
print(cm)

现在，再次从终端运行脚本：

python3 evaluate.py

输出现在将包含混淆矩阵。

Setup complete. True and predicted labels are defined in evaluate.py.
True labels:    [0, 1, 1, 0, 1, 0, 0, 1, 0, 1]
Predicted labels: [0, 1, 0, 0, 1, 1, 0, 1, 0, 1]
Accuracy: 0.8
Confusion Matrix:
[[4 1]
 [1 4]]

此矩阵告诉我们：

TN = 4 (左上角)
FP = 1 (右上角)
FN = 1 (左下角)
TP = 4 (右下角)

使用 sklearn.metrics 中的 precision_score 计算精确率分数

在此步骤中，我们将计算精确率得分。精确率回答了这个问题：“在模型预测为正的所有实例中，有多少比例实际上是正的？”它是对分类器准确性的度量。

精确率计算公式为：Precision = True Positives / (True Positives + False Positives)

低精确率表明假阳性数量很高。我们将使用 sklearn.metrics 中的 precision_score 函数。

将以下代码添加到你的 evaluate.py 文件末尾，以计算并打印精确率。

from sklearn.metrics import precision_score

## Calculate precision
precision = precision_score(y_true, y_pred)

print(f"Precision: {precision}")

从终端运行脚本：

python3 evaluate.py

你将在输出中看到添加的精确率得分。根据我们的混淆矩阵（TP=4，FP=1），精确率为 4 / (4 + 1) = 0.8。

Setup complete. True and predicted labels are defined in evaluate.py.
True labels:    [0, 1, 1, 0, 1, 0, 0, 1, 0, 1]
Predicted labels: [0, 1, 0, 0, 1, 1, 0, 1, 0, 1]
Accuracy: 0.8
Confusion Matrix:
[[4 1]
 [1 4]]
Precision: 0.8

使用 sklearn.metrics 中的 recall_score 计算召回率分数

在此步骤中，我们将计算召回率得分。召回率，也称为敏感度或真阳性率，回答了这个问题：“在所有实际为正的实例中，模型正确识别了多少比例？”它是对分类器完整性的度量。

召回率计算公式为：Recall = True Positives / (True Positives + False Negatives)

低召回率表明假阴性数量很高。我们将使用 sklearn.metrics 中的 recall_score 函数。

将以下代码添加到你的 evaluate.py 文件末尾。

from sklearn.metrics import recall_score

## Calculate recall
recall = recall_score(y_true, y_pred)

print(f"Recall: {recall}")

从终端运行脚本：

python3 evaluate.py

输出现在将包含召回率得分。根据我们的混淆矩阵（TP=4，FN=1），召回率为 4 / (4 + 1) = 0.8。

Setup complete. True and predicted labels are defined in evaluate.py.
True labels:    [0, 1, 1, 0, 1, 0, 0, 1, 0, 1]
Predicted labels: [0, 1, 0, 0, 1, 1, 0, 1, 0, 1]
Accuracy: 0.8
Confusion Matrix:
[[4 1]
 [1 4]]
Precision: 0.8
Recall: 0.8

使用 sklearn.metrics 中的 f1_score 计算 F1 分数

在最后这个步骤中，我们将计算 F1 分数。F1 分数是精确率和召回率的调和平均值。它试图在这两者之间找到一个平衡。精确率侧重于最小化假阳性，召回率侧重于最小化假阴性，而 F1 分数提供了一个同时考虑两者的单一指标。

F1 分数计算公式为：F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

当你需要在精确率和召回率之间取得平衡，并且存在类别分布不均时，它特别有用。我们将使用 sklearn.metrics 中的 f1_score 函数。

将最后一段代码添加到你的 evaluate.py 文件中。

from sklearn.metrics import f1_score

## Calculate F1 score
f1 = f1_score(y_true, y_pred)

print(f"F1 Score: {f1}")

最后一次从终端运行脚本：

python3 evaluate.py

最终输出将显示我们计算的所有指标。当精确率和召回率均为 0.8 时，F1 分数也将是 0.8。

Setup complete. True and predicted labels are defined in evaluate.py.
True labels:    [0, 1, 1, 0, 1, 0, 0, 1, 0, 1]
Predicted labels: [0, 1, 0, 0, 1, 1, 0, 1, 0, 1]
Accuracy: 0.8
Confusion Matrix:
[[4 1]
 [1 4]]
Precision: 0.8
Recall: 0.8
F1 Score: 0.8

总结

恭喜你完成了本次实验！你已成功学会如何使用 scikit-learn 库中的几个关键指标来评估分类模型。

在本次实验中，你练习了：

使用 accuracy_score 计算准确率，以获得模型性能的总体印象。
使用 confusion_matrix 生成混淆矩阵，以获得真阳性/假阳性及真阴性/假阴性的详细视图。
使用 precision_score 计算精确率，以衡量模型的准确性。
使用 recall_score 计算召回率，以衡量模型的完整性。
使用 f1_score 计算F1 分数，以找到精确率和召回率之间的平衡。

这些指标是任何数据科学家或机器学习工程师的基本工具。理解它们能让你更好地诊断模型的优缺点，并为你的特定问题选择合适的模型。