グリッドサーチを使った数字分類のマスター

はじめに

この実験では、scikit-learn ライブラリを使って交差検証によるハイパーパラメータチューニングを行う方法を示します。目的は、分類を容易にするために 2 値分類を使って手書き数字画像を分類することです：数字が 8 であるかどうかを識別することです。使用するデータセットは、digits データセットです。その後、選択されたハイパーパラメータと学習済みモデルの性能を、モデル選択ステップでは使用されなかった専用の評価セットで測定します。

VM のヒント

VM の起動が完了した後、左上隅をクリックしてノートブックタブに切り替えて、Jupyter Notebook にアクセスして練習します。

時々、Jupyter Notebook が読み込み終了するまで数秒待つ必要があります。Jupyter Notebook の制限により、操作の検証を自動化することはできません。

学習中に問題に直面した場合は、Labby にお問い合わせください。セッション後にフィードバックを提供してください。そうすれば、迅速に問題を解決します。

データの読み込み

digits データセットを読み込み、画像をベクトルにフラット化します。8×8 ピクセルの各画像を 64 ピクセルのベクトルに変換する必要があります。これにより、形状が(n_images, n_pixels)の最終的なデータ配列が得られます。また、データを等しいサイズの学習用とテスト用のセットに分割します。

from sklearn import datasets
from sklearn.model_selection import train_test_split

digits = datasets.load_digits()

n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))
y = digits.target == 8

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

グリッドサーチ戦略の定義

GridSearchCV インスタンスの refit パラメータに渡す関数を定義します。この関数は、GridSearchCV の cv_results_ 属性から最適な候補を選択するためのカスタム戦略を実装します。候補が選択されると、自動的に GridSearchCV インスタンスによって再学習されます。

ここでの戦略は、適合率と再現率の面で最適なモデルを絞り込むことです。選択されたモデルの中から、予測時に最も高速なモデルを最終的に選択します。これらのカスタム選択は完全に恣意的であることに注意してください。

import pandas as pd
from sklearn.metrics import classification_report

def print_dataframe(filtered_cv_results):
    """Pretty print for filtered dataframe"""
    for mean_precision, std_precision, mean_recall, std_recall, params in zip(
        filtered_cv_results["mean_test_precision"],
        filtered_cv_results["std_test_precision"],
        filtered_cv_results["mean_test_recall"],
        filtered_cv_results["std_test_recall"],
        filtered_cv_results["params"],
    ):
        print(
            f"precision: {mean_precision:0.3f} (±{std_precision:0.03f}),"
            f" recall: {mean_recall:0.3f} (±{std_recall:0.03f}),"
            f" for {params}"
        )
    print()


def refit_strategy(cv_results):
    """Define the strategy to select the best estimator.

    The strategy defined here is to filter-out all results below a precision threshold
    of 0.98, rank the remaining by recall and keep all models with one standard
    deviation of the best by recall. Once these models are selected, we can select the
    fastest model to predict.

    Parameters
    ----------
    cv_results : dict of numpy (masked) ndarrays
        CV results as returned by the `GridSearchCV`.

    Returns
    -------
    best_index : int
        The index of the best estimator as it appears in `cv_results`.
    """
    ## print the info about the grid-search for the different scores
    precision_threshold = 0.98

    cv_results_ = pd.DataFrame(cv_results)
    print("All grid-search results:")
    print_dataframe(cv_results_)

    ## Filter-out all results below the threshold
    high_precision_cv_results = cv_results_[
        cv_results_["mean_test_precision"] > precision_threshold
    ]

    print(f"Models with a precision higher than {precision_threshold}:")
    print_dataframe(high_precision_cv_results)

    high_precision_cv_results = high_precision_cv_results[
        [
            "mean_score_time",
            "mean_test_recall",
            "std_test_recall",
            "mean_test_precision",
            "std_test_precision",
            "rank_test_recall",
            "rank_test_precision",
            "params",
        ]
    ]

    ## Select the most performant models in terms of recall
    ## (within 1 sigma from the best)
    best_recall_std = high_precision_cv_results["mean_test_recall"].std()
    best_recall = high_precision_cv_results["mean_test_recall"].max()
    best_recall_threshold = best_recall - best_recall_std

    high_recall_cv_results = high_precision_cv_results[
        high_precision_cv_results["mean_test_recall"] > best_recall_threshold
    ]
    print(
        "Out of the previously selected high precision models, we keep all the\n"
        "the models within one standard deviation of the highest recall model:"
    )
    print_dataframe(high_recall_cv_results)

    ## From the best candidates, select the fastest model to predict
    fastest_top_recall_high_precision_index = high_recall_cv_results[
        "mean_score_time"
    ].idxmin()

    print(
        "\nThe selected final model is the fastest to predict out of the previously\n"
        "selected subset of best models based on precision and recall.\n"
        "Its scoring time is:\n\n"
        f"{high_recall_cv_results.loc[fastest_top_recall_high_precision_index]}"
    )

    return fastest_top_recall_high_precision_index

ハイパーパラメータの定義

ハイパーパラメータを定義し、GridSearchCV インスタンスを作成します。

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

tuned_parameters = [
    {"kernel": ["rbf"], "gamma": [1e-3, 1e-4], "C": [1, 10, 100, 1000]},
    {"kernel": ["linear"], "C": [1, 10, 100, 1000]},
]

grid_search = GridSearchCV(
    SVC(), tuned_parameters, scoring=["precision", "recall"], refit=refit_strategy
)

モデルの適合と予測の実行

モデルを適合させ、評価セットに対して予測を行います。

grid_search.fit(X_train, y_train)

## グリッドサーチにより、私たちのカスタム戦略で選択されたパラメータは：
grid_search.best_params_

## 最後に、残された評価セットで微調整されたモデルを評価します：
## `grid_search` オブジェクトは、私たちのカスタム再学習戦略によって選択されたパラメータを使って、完全な学習セットで **自動的に再学習されています**。
y_pred = grid_search.predict(X_test)
print(classification_report(y_test, y_pred))

まとめ

この実験では、scikit-learn ライブラリを使って交差検証によるハイパーパラメータチューニングを行う方法を学びました。digits データセットを使用し、GridSearchCV インスタンスの cv_results_ 属性から最適な候補を選択するためのカスタム再学習戦略を定義しました。最後に、残された評価セットで微調整されたモデルを評価しました。

手書き数字のグリッドサーチをプロットする