Мастерство в классификации цифр с использованием сеточного поиска

Введение

В этом практическом занятии показано, как выполнять настройку гиперпараметров с использованием кросс-валидации с помощью библиотеки scikit-learn. Цель - классифицировать изображения рукописных цифр с использованием бинарной классификации для более простого понимания: определение, является ли цифра 8 или нет. В качестве датасета используется digits dataset. Затем производится измерение производительности выбранных гиперпараметров и обученной модели на специальном наборе для оценки, который не использовался на этапе выбора модели.

Советы по использованию ВМ

После запуска ВМ нажмите в левом верхнем углу, чтобы переключиться на вкладку Notebook и получить доступ к Jupyter Notebook для практики.

Иногда вам может потребоваться подождать несколько секунд, пока Jupyter Notebook не загрузится полностью. Валидация операций не может быть автоматизирована из-за ограничений Jupyter Notebook.

Если вы сталкиваетесь с проблемами во время обучения, не стесняйтесь обращаться к Labby. Оставьте отзыв после занятия, и мы оперативно решим проблему для вас.

Загрузка данных

Мы загрузим датасет digits и сгладим изображения до векторов. Каждый пиксель изображения размером 8 на 8 пикселей должен быть преобразован в вектор из 64 пикселей. Таким образом, мы получим окончательный массив данных формы (n_images, n_pixels). Также мы разделим данные на обучающий и тестовый наборы равного размера.

from sklearn import datasets
from sklearn.model_selection import train_test_split

digits = datasets.load_digits()

n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))
y = digits.target == 8

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

Определение стратегии Grid-Search

Мы определим функцию, которую будем передавать в параметр refit экземпляра GridSearchCV. Она будет реализовывать пользовательскую стратегию для выбора наилучшего кандидата из атрибута cv_results_ экземпляра GridSearchCV. Как только кандидат выбран, он автоматически переобучается экземпляром GridSearchCV.

Здесь стратегия заключается в том, чтобы сократить список моделей, которые являются наилучшими по точности и полноте. Из выбранных моделей мы в конечном итоге выбираем самую быструю модель при предсказании. Обратите внимание, что эти пользовательские выборы совершенно произвольны.

import pandas as pd
from sklearn.metrics import classification_report

def print_dataframe(filtered_cv_results):
    """Pretty print for filtered dataframe"""
    for mean_precision, std_precision, mean_recall, std_recall, params in zip(
        filtered_cv_results["mean_test_precision"],
        filtered_cv_results["std_test_precision"],
        filtered_cv_results["mean_test_recall"],
        filtered_cv_results["std_test_recall"],
        filtered_cv_results["params"],
    ):
        print(
            f"precision: {mean_precision:0.3f} (±{std_precision:0.03f}),"
            f" recall: {mean_recall:0.3f} (±{std_recall:0.03f}),"
            f" for {params}"
        )
    print()


def refit_strategy(cv_results):
    """Define the strategy to select the best estimator.

    The strategy defined here is to filter-out all results below a precision threshold
    of 0.98, rank the remaining by recall and keep all models with one standard
    deviation of the best by recall. Once these models are selected, we can select the
    fastest model to predict.

    Parameters
    ----------
    cv_results : dict of numpy (masked) ndarrays
        CV results as returned by the `GridSearchCV`.

    Returns
    -------
    best_index : int
        The index of the best estimator as it appears in `cv_results`.
    """
    ## print the info about the grid-search for the different scores
    precision_threshold = 0.98

    cv_results_ = pd.DataFrame(cv_results)
    print("All grid-search results:")
    print_dataframe(cv_results_)

    ## Filter-out all results below the threshold
    high_precision_cv_results = cv_results_[
        cv_results_["mean_test_precision"] > precision_threshold
    ]

    print(f"Models with a precision higher than {precision_threshold}:")
    print_dataframe(high_precision_cv_results)

    high_precision_cv_results = high_precision_cv_results[
        [
            "mean_score_time",
            "mean_test_recall",
            "std_test_recall",
            "mean_test_precision",
            "std_test_precision",
            "rank_test_recall",
            "rank_test_precision",
            "params",
        ]
    ]

    ## Select the most performant models in terms of recall
    ## (within 1 sigma from the best)
    best_recall_std = high_precision_cv_results["mean_test_recall"].std()
    best_recall = high_precision_cv_results["mean_test_recall"].max()
    best_recall_threshold = best_recall - best_recall_std

    high_recall_cv_results = high_precision_cv_results[
        high_precision_cv_results["mean_test_recall"] > best_recall_threshold
    ]
    print(
        "Out of the previously selected high precision models, we keep all the\n"
        "the models within one standard deviation of the highest recall model:"
    )
    print_dataframe(high_recall_cv_results)

    ## From the best candidates, select the fastest model to predict
    fastest_top_recall_high_precision_index = high_recall_cv_results[
        "mean_score_time"
    ].idxmin()

    print(
        "\nThe selected final model is the fastest to predict out of the previously\n"
        "selected subset of best models based on precision and recall.\n"
        "Its scoring time is:\n\n"
        f"{high_recall_cv_results.loc[fastest_top_recall_high_precision_index]}"
    )

    return fastest_top_recall_high_precision_index

Определение гиперпараметров

Мы определим гиперпараметры и создадим экземпляр GridSearchCV.

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

tuned_parameters = [
    {"kernel": ["rbf"], "gamma": [1e-3, 1e-4], "C": [1, 10, 100, 1000]},
    {"kernel": ["linear"], "C": [1, 10, 100, 1000]},
]

grid_search = GridSearchCV(
    SVC(), tuned_parameters, scoring=["precision", "recall"], refit=refit_strategy
)

Обучение модели и предсказание

Мы обучим модель и сделаем предсказания на наборе для оценки.

grid_search.fit(X_train, y_train)

## The parameters selected by the grid-search with our custom strategy are:
grid_search.best_params_

## Finally, we evaluate the fine-tuned model on the left-out evaluation set: the
## `grid_search` object **has automatically been refit** on the full training
## set with the parameters selected by our custom refit strategy.
y_pred = grid_search.predict(X_test)
print(classification_report(y_test, y_pred))

Резюме

В этом практическом занятии мы научились настраивать гиперпараметры с использованием кросс-валидации с помощью библиотеки scikit-learn. Мы использовали датасет digits и определили пользовательскую стратегию переобучения, чтобы выбрать наилучшего кандидата из атрибута cv_results_ экземпляра GridSearchCV. Наконец, мы оценили настройленную модель на оставленном на тестирование наборе данных.

Построение сеточного поиска для цифр