使用管道和网格搜索交叉验证优化机器学习模型

简介

本实验展示了如何在 scikit-learn 中使用 Pipeline 和 GridSearchCV，以便在单次交叉验证运行中对不同类别的估计器进行优化。我们将使用支持向量分类器从流行的 MNIST 数据集中预测手写数字。

虚拟机使用提示

虚拟机启动完成后，点击左上角切换到 笔记本 标签页，以访问 Jupyter Notebook 进行练习。

有时，你可能需要等待几秒钟让 Jupyter Notebook 完成加载。由于 Jupyter Notebook 的限制，操作验证无法自动化。

如果你在学习过程中遇到问题，随时向 Labby 提问。课程结束后提供反馈，我们会及时为你解决问题。

导入必要的库并加载数据

我们将首先导入必要的库，并从 scikit-learn 中加载数字数据集。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.decomposition import PCA, NMF
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.preprocessing import MinMaxScaler

X, y = load_digits(return_X_y=True)

创建管道并定义参数网格

我们将创建一个管道，该管道先进行降维，然后使用支持向量分类器进行预测。我们将使用无监督的主成分分析（PCA）和非负矩阵分解（NMF）进行降维，并在网格搜索期间进行单变量特征选择。

pipe = Pipeline(
    [
        ("scaling", MinMaxScaler()),
        ## 降维阶段由 param_grid 填充
        ("reduce_dim", "passthrough"),
        ("classify", LinearSVC(dual=False, max_iter=10000)),
    ]
)

N_FEATURES_OPTIONS = [2, 4, 8]
C_OPTIONS = [1, 10, 100, 1000]
param_grid = [
    {
        "reduce_dim": [PCA(iterated_power=7), NMF(max_iter=1_000)],
        "reduce_dim__n_components": N_FEATURES_OPTIONS,
        "classify__C": C_OPTIONS,
    },
    {
        "reduce_dim": [SelectKBest(mutual_info_classif)],
        "reduce_dim__k": N_FEATURES_OPTIONS,
        "classify__C": C_OPTIONS,
    },
]
reducer_labels = ["PCA", "NMF", "KBest(mutual_info_classif)"]

创建一个 GridSearchCV 对象并拟合数据

我们将使用上一步中定义的管道和参数网格创建一个GridSearchCV对象。然后，我们会将数据拟合到该对象上。

grid = GridSearchCV(pipe, n_jobs=1, param_grid=param_grid)
grid.fit(X, y)

绘制结果

我们将使用柱状图来绘制GridSearchCV的结果。这将使我们能够比较不同特征约简技术的准确性。

import pandas as pd

mean_scores = np.array(grid.cv_results_["mean_test_score"])
## 分数按照 param_grid 迭代顺序排列，即字母顺序
mean_scores = mean_scores.reshape(len(C_OPTIONS), -1, len(N_FEATURES_OPTIONS))
## 选择最佳 C 的分数
mean_scores = mean_scores.max(axis=0)
## 创建一个数据框以方便绘图
mean_scores = pd.DataFrame(
    mean_scores.T, index=N_FEATURES_OPTIONS, columns=reducer_labels
)

ax = mean_scores.plot.bar()
ax.set_title("Comparing feature reduction techniques")
ax.set_xlabel("Reduced number of features")
ax.set_ylabel("Digit classification accuracy")
ax.set_ylim((0, 1))
ax.legend(loc="upper left")

plt.show()

在管道中缓存转换器

现在我们将演示如何存储特定转换器的状态，因为它可能会被再次使用。在GridSearchCV中使用管道会引发这种情况。因此，我们使用参数memory来启用缓存。

from joblib import Memory
from shutil import rmtree

## 创建一个临时文件夹来存储管道的转换器
location = "cachedir"
memory = Memory(location=location, verbose=10)
cached_pipe = Pipeline(
    [("reduce_dim", PCA()), ("classify", LinearSVC(dual=False, max_iter=10000))],
    memory=memory,
)

## 这次，将在网格搜索中使用缓存的管道

## 在退出前删除临时缓存
memory.clear(warn=False)
rmtree(location)

总结

在本实验中，我们在 scikit-learn 中使用了Pipeline和GridSearchCV，以便在单次交叉验证运行中对不同类别的估计器进行优化。我们还演示了如何使用memory参数来存储特定转换器的状态，从而启用缓存。当拟合转换器成本较高时，这可能会特别有用。

使用管道和网格搜索交叉验证进行降维

简介