Pipeline と GridSearchCV を使って機械学習モデルを最適化する

はじめに

この実験では、scikit-learn の Pipeline と GridSearchCV を使用して、単一の交差検証実行でさまざまな種類の推定器を最適化する方法を示します。人気のある MNIST データセットからの手書き数字の予測にサポートベクトル分類器を使用します。

VM のヒント

VM の起動が完了したら、左上隅をクリックして ノートブック タブに切り替え、Jupyter Notebook を使って練習しましょう。

Jupyter Notebook の読み込みには数秒かかる場合があります。Jupyter Notebook の制限により、操作の検証を自動化することはできません。

学習中に問題が発生した場合は、Labby にお問い合わせください。セッション後にフィードバックを提供してください。すぐに問題を解決いたします。

必要なライブラリをインポートしてデータを読み込む

必要なライブラリをインポートし、scikit-learn から digits データセットを読み込むことから始めます。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.decomposition import PCA, NMF
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.preprocessing import MinMaxScaler

X, y = load_digits(return_X_y=True)

パイプラインを作成してパラメータグリッドを定義する

次元削減を行った後、サポートベクトル分類器を使って予測を行うパイプラインを作成します。グリッドサーチ中に、非監督学習の PCA と NMF による次元削減と単変量特徴選択を使用します。

pipe = Pipeline(
    [
        ("scaling", MinMaxScaler()),
        ## the reduce_dim stage is populated by the param_grid
        ("reduce_dim", "passthrough"),
        ("classify", LinearSVC(dual=False, max_iter=10000)),
    ]
)

N_FEATURES_OPTIONS = [2, 4, 8]
C_OPTIONS = [1, 10, 100, 1000]
param_grid = [
    {
        "reduce_dim": [PCA(iterated_power=7), NMF(max_iter=1_000)],
        "reduce_dim__n_components": N_FEATURES_OPTIONS,
        "classify__C": C_OPTIONS,
    },
    {
        "reduce_dim": [SelectKBest(mutual_info_classif)],
        "reduce_dim__k": N_FEATURES_OPTIONS,
        "classify__C": C_OPTIONS,
    },
]
reducer_labels = ["PCA", "NMF", "KBest(mutual_info_classif)"]

GridSearchCV オブジェクトを作成してデータに適合させる

前のステップで定義したパイプラインとパラメータグリッドを使用して GridSearchCV オブジェクトを作成します。その後、データをこのオブジェクトに適合させます。

grid = GridSearchCV(pipe, n_jobs=1, param_grid=param_grid)
grid.fit(X, y)

結果をプロットする

棒グラフを使って GridSearchCV の結果をプロットします。これにより、さまざまな特徴量削減手法の精度を比較することができます。

import pandas as pd

mean_scores = np.array(grid.cv_results_["mean_test_score"])
## scores are in the order of param_grid iteration, which is alphabetical
mean_scores = mean_scores.reshape(len(C_OPTIONS), -1, len(N_FEATURES_OPTIONS))
## select score for best C
mean_scores = mean_scores.max(axis=0)
## create a dataframe to ease plotting
mean_scores = pd.DataFrame(
    mean_scores.T, index=N_FEATURES_OPTIONS, columns=reducer_labels
)

ax = mean_scores.plot.bar()
ax.set_title("Comparing feature reduction techniques")
ax.set_xlabel("Reduced number of features")
ax.set_ylabel("Digit classification accuracy")
ax.set_ylim((0, 1))
ax.legend(loc="upper left")

plt.show()

パイプライン内でトランスフォーマーをキャッシュする

特定のトランスフォーマーの状態を保存する方法を示します。再度使用できる場合があるためです。GridSearchCV でパイプラインを使用すると、このような状況が発生します。したがって、キャッシュを有効にするために memory 引数を使用します。

from joblib import Memory
from shutil import rmtree

## Create a temporary folder to store the transformers of the pipeline
location = "cachedir"
memory = Memory(location=location, verbose=10)
cached_pipe = Pipeline(
    [("reduce_dim", PCA()), ("classify", LinearSVC(dual=False, max_iter=10000))],
    memory=memory,
)

## This time, a cached pipeline will be used within the grid search

## Delete the temporary cache before exiting
memory.clear(warn=False)
rmtree(location)

まとめ

この実験では、scikit-learn の Pipeline と GridSearchCV を使用して、単一の交差検証実行でさまざまな種類の推定器を最適化しました。また、キャッシュを有効にするために memory 引数を使用して特定のトランスフォーマーの状態を保存する方法を示しました。トランスフォーマーの適合にコストがかかる場合、これは特に役立ちます。

Pipeline と GridSearchCV を用いた次元削減