Plot Grid Search Digits

Machine LearningMachine LearningBeginner
Practice Now

This tutorial is from open-source community. Access the source code

Introduction

This lab shows how to perform hyperparameter tuning with cross-validation using the scikit-learn library. The aim is to classify handwritten digits images using a binary classification for easier understanding: identifying whether a digit is 8 or not. The dataset used is the digits dataset. The performance of the selected hyper-parameters and trained model is then measured on a dedicated evaluation set that was not used during the model selection step.

VM Tips

After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.

Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.

If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.


Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL sklearn(("`Sklearn`")) -.-> sklearn/ModelSelectionandEvaluationGroup(["`Model Selection and Evaluation`"]) sklearn(("`Sklearn`")) -.-> sklearn/CoreModelsandAlgorithmsGroup(["`Core Models and Algorithms`"]) ml(("`Machine Learning`")) -.-> ml/FrameworkandSoftwareGroup(["`Framework and Software`"]) sklearn/ModelSelectionandEvaluationGroup -.-> sklearn/metrics("`Metrics`") sklearn/ModelSelectionandEvaluationGroup -.-> sklearn/model_selection("`Model Selection`") sklearn/CoreModelsandAlgorithmsGroup -.-> sklearn/svm("`Support Vector Machines`") ml/FrameworkandSoftwareGroup -.-> ml/sklearn("`scikit-learn`") subgraph Lab Skills sklearn/metrics -.-> lab-49155{{"`Plot Grid Search Digits`"}} sklearn/model_selection -.-> lab-49155{{"`Plot Grid Search Digits`"}} sklearn/svm -.-> lab-49155{{"`Plot Grid Search Digits`"}} ml/sklearn -.-> lab-49155{{"`Plot Grid Search Digits`"}} end

Load Data

We will load the digits dataset and flatten the images to vectors. Each image of 8 by 8 pixels needs to be transformed to a vector of 64 pixels. Thus, we will get a final data array of shape (n_images, n_pixels). We will also split the data into a training and a testing set of equal size.

from sklearn import datasets
from sklearn.model_selection import train_test_split

digits = datasets.load_digits()

n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))
y = digits.target == 8

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

We will define a function to be passed to the refit parameter of the GridSearchCV instance. It will implement the custom strategy to select the best candidate from the cv_results_ attribute of the GridSearchCV. Once the candidate is selected, it is automatically refitted by the GridSearchCV instance.

Here, the strategy is to short-list the models which are the best in terms of precision and recall. From the selected models, we finally select the fastest model at predicting. Notice that these custom choices are completely arbitrary.

import pandas as pd
from sklearn.metrics import classification_report

def print_dataframe(filtered_cv_results):
    """Pretty print for filtered dataframe"""
    for mean_precision, std_precision, mean_recall, std_recall, params in zip(
        filtered_cv_results["mean_test_precision"],
        filtered_cv_results["std_test_precision"],
        filtered_cv_results["mean_test_recall"],
        filtered_cv_results["std_test_recall"],
        filtered_cv_results["params"],
    ):
        print(
            f"precision: {mean_precision:0.3f} (±{std_precision:0.03f}),"
            f" recall: {mean_recall:0.3f} (±{std_recall:0.03f}),"
            f" for {params}"
        )
    print()


def refit_strategy(cv_results):
    """Define the strategy to select the best estimator.

    The strategy defined here is to filter-out all results below a precision threshold
    of 0.98, rank the remaining by recall and keep all models with one standard
    deviation of the best by recall. Once these models are selected, we can select the
    fastest model to predict.

    Parameters
    ----------
    cv_results : dict of numpy (masked) ndarrays
        CV results as returned by the `GridSearchCV`.

    Returns
    -------
    best_index : int
        The index of the best estimator as it appears in `cv_results`.
    """
    ## print the info about the grid-search for the different scores
    precision_threshold = 0.98

    cv_results_ = pd.DataFrame(cv_results)
    print("All grid-search results:")
    print_dataframe(cv_results_)

    ## Filter-out all results below the threshold
    high_precision_cv_results = cv_results_[
        cv_results_["mean_test_precision"] > precision_threshold
    ]

    print(f"Models with a precision higher than {precision_threshold}:")
    print_dataframe(high_precision_cv_results)

    high_precision_cv_results = high_precision_cv_results[
        [
            "mean_score_time",
            "mean_test_recall",
            "std_test_recall",
            "mean_test_precision",
            "std_test_precision",
            "rank_test_recall",
            "rank_test_precision",
            "params",
        ]
    ]

    ## Select the most performant models in terms of recall
    ## (within 1 sigma from the best)
    best_recall_std = high_precision_cv_results["mean_test_recall"].std()
    best_recall = high_precision_cv_results["mean_test_recall"].max()
    best_recall_threshold = best_recall - best_recall_std

    high_recall_cv_results = high_precision_cv_results[
        high_precision_cv_results["mean_test_recall"] > best_recall_threshold
    ]
    print(
        "Out of the previously selected high precision models, we keep all the\n"
        "the models within one standard deviation of the highest recall model:"
    )
    print_dataframe(high_recall_cv_results)

    ## From the best candidates, select the fastest model to predict
    fastest_top_recall_high_precision_index = high_recall_cv_results[
        "mean_score_time"
    ].idxmin()

    print(
        "\nThe selected final model is the fastest to predict out of the previously\n"
        "selected subset of best models based on precision and recall.\n"
        "Its scoring time is:\n\n"
        f"{high_recall_cv_results.loc[fastest_top_recall_high_precision_index]}"
    )

    return fastest_top_recall_high_precision_index

Define Hyperparameters

We will define the hyperparameters and create the GridSearchCV instance.

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

tuned_parameters = [
    {"kernel": ["rbf"], "gamma": [1e-3, 1e-4], "C": [1, 10, 100, 1000]},
    {"kernel": ["linear"], "C": [1, 10, 100, 1000]},
]

grid_search = GridSearchCV(
    SVC(), tuned_parameters, scoring=["precision", "recall"], refit=refit_strategy
)

Fit the Model and Make Predictions

We will fit the model and make predictions on the evaluation set.

grid_search.fit(X_train, y_train)

## The parameters selected by the grid-search with our custom strategy are:
grid_search.best_params_

## Finally, we evaluate the fine-tuned model on the left-out evaluation set: the
## `grid_search` object **has automatically been refit** on the full training
## set with the parameters selected by our custom refit strategy.
y_pred = grid_search.predict(X_test)
print(classification_report(y_test, y_pred))

Summary

In this lab, we learned how to perform hyperparameter tuning with cross-validation using the scikit-learn library. We used the digits dataset and defined a custom refit strategy to select the best candidate from the cv_results_ attribute of the GridSearchCV instance. Finally, we evaluated the fine-tuned model on the left-out evaluation set.

Other Machine Learning Tutorials you may like