不平衡分类中的精确率 - 召回率指标

Machine LearningMachine LearningBeginner
立即练习

This tutorial is from open-source community. Access the source code

💡 本教程由 AI 辅助翻译自英文原版。如需查看原文,您可以 切换至英文原版

简介

本教程提供了一份关于如何使用精确率-召回率指标来评估分类器输出质量的分步指南。当类别极度不平衡时,精确率-召回率曲线是衡量预测成功与否的有用指标。在信息检索中,精确率是对结果相关性的度量,而召回率则是对返回的真正相关结果数量的度量。

虚拟机使用提示

虚拟机启动完成后,点击左上角切换到“笔记本”标签页,以访问 Jupyter Notebook 进行练习。

有时,你可能需要等待几秒钟让 Jupyter Notebook 完成加载。由于 Jupyter Notebook 的限制,操作验证无法自动化。

如果你在学习过程中遇到问题,随时向 Labby 提问。课程结束后提供反馈,我们会立即为你解决问题。


Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL sklearn(("Sklearn")) -.-> sklearn/CoreModelsandAlgorithmsGroup(["Core Models and Algorithms"]) sklearn(("Sklearn")) -.-> sklearn/DataPreprocessingandFeatureEngineeringGroup(["Data Preprocessing and Feature Engineering"]) sklearn(("Sklearn")) -.-> sklearn/ModelSelectionandEvaluationGroup(["Model Selection and Evaluation"]) sklearn(("Sklearn")) -.-> sklearn/AdvancedDataAnalysisandDimensionalityReductionGroup(["Advanced Data Analysis and Dimensionality Reduction"]) sklearn(("Sklearn")) -.-> sklearn/UtilitiesandDatasetsGroup(["Utilities and Datasets"]) ml(("Machine Learning")) -.-> ml/FrameworkandSoftwareGroup(["Framework and Software"]) sklearn/CoreModelsandAlgorithmsGroup -.-> sklearn/svm("Support Vector Machines") sklearn/DataPreprocessingandFeatureEngineeringGroup -.-> sklearn/preprocessing("Preprocessing and Normalization") sklearn/DataPreprocessingandFeatureEngineeringGroup -.-> sklearn/pipeline("Pipeline") sklearn/ModelSelectionandEvaluationGroup -.-> sklearn/model_selection("Model Selection") sklearn/ModelSelectionandEvaluationGroup -.-> sklearn/metrics("Metrics") sklearn/AdvancedDataAnalysisandDimensionalityReductionGroup -.-> sklearn/multiclass("Multiclass Classification") sklearn/UtilitiesandDatasetsGroup -.-> sklearn/datasets("Datasets") ml/FrameworkandSoftwareGroup -.-> ml/sklearn("scikit-learn") subgraph Lab Skills sklearn/svm -.-> lab-49249{{"不平衡分类中的精确率 - 召回率指标"}} sklearn/preprocessing -.-> lab-49249{{"不平衡分类中的精确率 - 召回率指标"}} sklearn/pipeline -.-> lab-49249{{"不平衡分类中的精确率 - 召回率指标"}} sklearn/model_selection -.-> lab-49249{{"不平衡分类中的精确率 - 召回率指标"}} sklearn/metrics -.-> lab-49249{{"不平衡分类中的精确率 - 召回率指标"}} sklearn/multiclass -.-> lab-49249{{"不平衡分类中的精确率 - 召回率指标"}} sklearn/datasets -.-> lab-49249{{"不平衡分类中的精确率 - 召回率指标"}} ml/sklearn -.-> lab-49249{{"不平衡分类中的精确率 - 召回率指标"}} end

数据集与模型

我们将使用鸢尾花数据集和线性支持向量分类器(Linear SVC)来区分两种鸢尾花。首先,我们将导入必要的库并加载数据集。

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

X, y = load_iris(return_X_y=True)

接下来,我们将向数据集中添加噪声特征,并将其拆分为训练集和测试集。

random_state = np.random.RandomState(0)
n_samples, n_features = X.shape
X = np.concatenate([X, random_state.randn(n_samples, 200 * n_features)], axis=1)

X_train, X_test, y_train, y_test = train_test_split(
    X[y < 2], y[y < 2], test_size=0.5, random_state=random_state
)

最后,我们将使用标准缩放器(StandardScaler)对数据进行缩放,并将线性支持向量分类器拟合到训练数据上。

classifier = make_pipeline(
    StandardScaler(), LinearSVC(random_state=random_state, dual="auto")
)
classifier.fit(X_train, y_train)

绘制精确率-召回率曲线

为了绘制精确率-召回率曲线,我们将使用 sklearn.metrics 库中的 PrecisionRecallDisplay 类。我们可以使用 from_estimatorfrom_predictions 方法来计算曲线。from_estimator 方法在绘制曲线之前为我们计算预测结果,而 from_predictions 方法则要求我们提供预测分数。

from sklearn.metrics import PrecisionRecallDisplay

## 使用 from_estimator 方法
display = PrecisionRecallDisplay.from_estimator(
    classifier, X_test, y_test, name="LinearSVC", plot_chance_level=True
)
_ = display.ax_.set_title("2 类精确率-召回率曲线")

## 使用 from_predictions 方法
y_score = classifier.decision_function(X_test)

display = PrecisionRecallDisplay.from_predictions(
    y_test, y_score, name="LinearSVC", plot_chance_level=True
)
_ = display.ax_.set_title("2 类精确率-召回率曲线")

绘制多标签分类的精确率-召回率曲线

精确率-召回率曲线不支持多标签设置。不过,我们可以决定如何处理这种情况。我们将创建一个多标签数据集,使用一对多分类器(OneVsRestClassifier)进行拟合和预测,然后绘制精确率-召回率曲线。

from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import average_precision_score

## 创建多标签数据
Y = label_binarize(y, classes=[0, 1, 2])
n_classes = Y.shape[1]
X_train, X_test, Y_train, Y_test = train_test_split(
    X, Y, test_size=0.5, random_state=random_state
)

## 使用 OneVsRestClassifier 进行拟合和预测
classifier = OneVsRestClassifier(
    make_pipeline(StandardScaler(), LinearSVC(random_state=random_state, dual="auto"))
)
classifier.fit(X_train, Y_train)
y_score = classifier.decision_function(X_test)

## 计算每个类别的精确率和召回率
precision = dict()
recall = dict()
average_precision = dict()
for i in range(n_classes):
    precision[i], recall[i], _ = precision_recall_curve(Y_test[:, i], y_score[:, i])
    average_precision[i] = average_precision_score(Y_test[:, i], y_score[:, i])

## 计算微平均精确率和召回率
precision["micro"], recall["micro"], _ = precision_recall_curve(Y_test.ravel(), y_score.ravel())
average_precision["micro"] = average_precision_score(Y_test, y_score, average="micro")

## 绘制微平均精确率-召回率曲线
display = PrecisionRecallDisplay(
    recall=recall["micro"],
    precision=precision["micro"],
    average_precision=average_precision["micro"],
    prevalence_pos_label=Counter(Y_test.ravel())[1] / Y_test.size,
)
display.plot(plot_chance_level=True)
_ = display.ax_.set_title("所有类别上的微平均")

## 绘制每个类别的精确率-召回率曲线和等 f1 曲线
colors = cycle(["navy", "turquoise", "darkorange", "cornflowerblue", "teal"])
_, ax = plt.subplots(figsize=(7, 8))
f_scores = np.linspace(0.2, 0.8, num=4)
lines, labels = [], []
for f_score in f_scores:
    x = np.linspace(0.01, 1)
    y = f_score * x / (2 * x - f_score)
    (l,) = plt.plot(x[y >= 0], y[y >= 0], color="gray", alpha=0.2)
    plt.annotate("f1={0:0.1f}".format(f_score), xy=(0.9, y[45] + 0.02))

display = PrecisionRecallDisplay(
    recall=recall["micro"],
    precision=precision["micro"],
    average_precision=average_precision["micro"],
)
display.plot(ax=ax, name="微平均精确率-召回率", color="gold")

for i, color in zip(range(n_classes), colors):
    display = PrecisionRecallDisplay(
        recall=recall[i],
        precision=precision[i],
        average_precision=average_precision[i],
    )
    display.plot(ax=ax, name=f"类别 {i} 的精确率-召回率", color=color)

handles, labels = display.ax_.get_legend_handles_labels()
handles.extend([l])
labels.extend(["等 f1 曲线"])
ax.set_xlim([0.0, 1.0])
ax.set_ylim([0.0, 1.05])
ax.legend(handles=handles, labels=labels, loc="best")
ax.set_title("精确率-召回率曲线扩展到多类别")
plt.show()

总结

本教程提供了一份关于如何使用精确率-召回率指标来评估分类器输出质量的分步指南。我们学习了如何使用 sklearn.metrics 库中的 PrecisionRecallDisplay 类为二分类绘制精确率-召回率曲线。我们还学习了如何使用一对多分类器(OneVsRestClassifier)为多标签分类绘制精确率-召回率曲线,以及如何计算每个类别的精确率和召回率。