机器学习 | PCA | LinearSVC | 数字数据集

简介

在本实验中，我们将学习如何通过在最佳准确率分数的 1 个标准差范围内找到一个合适的准确率，同时最小化主成分分析（PCA）组件的数量，来平衡模型复杂度和交叉验证分数。我们将使用来自 scikit-learn 的数字数据集以及一个由 PCA 和线性支持向量分类器（LinearSVC）组成的管道。

虚拟机提示

虚拟机启动完成后，点击左上角切换到“笔记本”标签页，以访问 Jupyter Notebook 进行练习。

有时，你可能需要等待几秒钟让 Jupyter Notebook 完成加载。由于 Jupyter Notebook 的限制，操作验证无法自动化。

如果你在学习过程中遇到问题，随时向 Labby 提问。课程结束后提供反馈，我们将立即为你解决问题。

导入库

我们将通过导入本实验所需的库来开始。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

定义函数

我们将定义两个稍后在实验中会用到的函数。

def lower_bound(cv_results):
    """
    计算最佳 `mean_test_scores` 的 1 个标准差范围内的下限。

    参数
    ----------
    cv_results : numpy（掩码）ndarray 的字典
        请参阅 `GridSearchCV` 的 cv_results_ 属性

    返回
    -------
    float
        最佳 `mean_test_score` 的 1 个标准差范围内的下限。
    """
    best_score_idx = np.argmax(cv_results["mean_test_score"])

    return (
        cv_results["mean_test_score"][best_score_idx]
        - cv_results["std_test_score"][best_score_idx]
    )


def best_low_complexity(cv_results):
    """
    平衡模型复杂度与交叉验证分数。

    参数
    ----------
    cv_results : numpy（掩码）ndarray 的字典
        请参阅 `GridSearchCV` 的 cv_results_ 属性。

    返回
    ------
    int
        具有最少 PCA 组件数量且其测试分数在最佳 `mean_test_score` 的 1 个标准差范围内的模型的索引。
    """
    threshold = lower_bound(cv_results)
    candidate_idx = np.flatnonzero(cv_results["mean_test_score"] >= threshold)
    best_idx = candidate_idx[
        cv_results["param_reduce_dim__n_components"][candidate_idx].argmin()
    ]
    return best_idx

加载数据并定义管道

我们将从 scikit-learn 中加载数字数据集，并定义一个由主成分分析（PCA）和线性支持向量分类器（LinearSVC）组成的管道。

pipe = Pipeline(
    [
        ("reduce_dim", PCA(random_state=42)),
        ("classify", LinearSVC(random_state=42, C=0.01, dual="auto")),
    ]
)

X, y = load_digits(return_X_y=True)

为 GridSearchCV 定义参数

我们将为 GridSearchCV 定义参数。

param_grid = {"reduce_dim__n_components": [6, 8, 10, 12, 14]}

定义 GridSearchCV 对象

我们将定义 GridSearchCV 对象并拟合模型。

grid = GridSearchCV(
    pipe,
    cv=10,
    n_jobs=1,
    param_grid=param_grid,
    scoring="accuracy",
    refit=best_low_complexity,
)

grid.fit(X, y)

可视化结果

我们将通过绘制准确率与主成分分析（PCA）组件数量的关系图来可视化结果。

n_components = grid.cv_results_["param_reduce_dim__n_components"]
test_scores = grid.cv_results_["mean_test_score"]

plt.figure()
plt.bar(n_components, test_scores, width=1.3, color="b")

lower = lower_bound(grid.cv_results_)
plt.axhline(np.max(test_scores), linestyle="--", color="y", label="最佳分数")
plt.axhline(lower, linestyle="--", color=".5", label="最佳分数 - 1 个标准差")

plt.title("平衡模型复杂度和交叉验证分数")
plt.xlabel("使用的 PCA 组件数量")
plt.ylabel("数字分类准确率")
plt.xticks(n_components.tolist())
plt.ylim((0, 1.0))
plt.legend(loc="upper left")

best_index_ = grid.best_index_

print("最佳索引是 %d" % best_index_)
print("选择的主成分数量是 %d" % n_components[best_index_])
print(
    "相应的准确率分数是 %.2f"
    % grid.cv_results_["mean_test_score"][best_index_]
)
plt.show()

总结

在本实验中，我们学习了如何使用主成分分析（PCA）和线性支持向量分类器（LinearSVC）来平衡模型复杂度和交叉验证分数。我们使用网格搜索交叉验证（GridSearchCV）来找到最佳的主成分数量，同时在最佳分数的 1 个标准差范围内最大化准确率分数。我们还对结果进行了可视化，以更好地理解模型复杂度和准确率之间的权衡。