Optimize Machine Learning Models with Pipeline and GridSearchCV

Introduction

This lab demonstrates the use of Pipeline and GridSearchCV in scikit-learn to optimize over different classes of estimators in a single CV run. We will be using a support vector classifier to predict hand-written digits from the popular MNIST dataset.

VM Tips

After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.

Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.

If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.

Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL sklearn(("`Sklearn`")) -.-> sklearn/CoreModelsandAlgorithmsGroup(["`Core Models and Algorithms`"]) sklearn(("`Sklearn`")) -.-> sklearn/DataPreprocessingandFeatureEngineeringGroup(["`Data Preprocessing and Feature Engineering`"]) sklearn(("`Sklearn`")) -.-> sklearn/ModelSelectionandEvaluationGroup(["`Model Selection and Evaluation`"]) sklearn(("`Sklearn`")) -.-> sklearn/AdvancedDataAnalysisandDimensionalityReductionGroup(["`Advanced Data Analysis and Dimensionality Reduction`"]) sklearn(("`Sklearn`")) -.-> sklearn/UtilitiesandDatasetsGroup(["`Utilities and Datasets`"]) ml(("`Machine Learning`")) -.-> ml/FrameworkandSoftwareGroup(["`Framework and Software`"]) sklearn/CoreModelsandAlgorithmsGroup -.-> sklearn/svm("`Support Vector Machines`") sklearn/DataPreprocessingandFeatureEngineeringGroup -.-> sklearn/preprocessing("`Preprocessing and Normalization`") sklearn/DataPreprocessingandFeatureEngineeringGroup -.-> sklearn/feature_selection("`Feature Selection`") sklearn/DataPreprocessingandFeatureEngineeringGroup -.-> sklearn/pipeline("`Pipeline`") sklearn/ModelSelectionandEvaluationGroup -.-> sklearn/model_selection("`Model Selection`") sklearn/AdvancedDataAnalysisandDimensionalityReductionGroup -.-> sklearn/decomposition("`Matrix Decomposition`") sklearn/UtilitiesandDatasetsGroup -.-> sklearn/datasets("`Datasets`") ml/FrameworkandSoftwareGroup -.-> ml/sklearn("`scikit-learn`") subgraph Lab Skills sklearn/svm -.-> lab-49092{{"`Dimensionality Reduction with Pipeline and GridSearchCV`"}} sklearn/preprocessing -.-> lab-49092{{"`Dimensionality Reduction with Pipeline and GridSearchCV`"}} sklearn/feature_selection -.-> lab-49092{{"`Dimensionality Reduction with Pipeline and GridSearchCV`"}} sklearn/pipeline -.-> lab-49092{{"`Dimensionality Reduction with Pipeline and GridSearchCV`"}} sklearn/model_selection -.-> lab-49092{{"`Dimensionality Reduction with Pipeline and GridSearchCV`"}} sklearn/decomposition -.-> lab-49092{{"`Dimensionality Reduction with Pipeline and GridSearchCV`"}} sklearn/datasets -.-> lab-49092{{"`Dimensionality Reduction with Pipeline and GridSearchCV`"}} ml/sklearn -.-> lab-49092{{"`Dimensionality Reduction with Pipeline and GridSearchCV`"}} end

Import necessary libraries and load data

We will begin by importing the necessary libraries and load the digits dataset from scikit-learn.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.decomposition import PCA, NMF
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.preprocessing import MinMaxScaler

X, y = load_digits(return_X_y=True)

Create a pipeline and define parameter grid

We will create a pipeline that does dimensionality reduction followed by prediction with a support vector classifier. We will use unsupervised PCA and NMF dimensionality reductions, along with univariate feature selection during the grid search.

pipe = Pipeline(
    [
        ("scaling", MinMaxScaler()),
        ## the reduce_dim stage is populated by the param_grid
        ("reduce_dim", "passthrough"),
        ("classify", LinearSVC(dual=False, max_iter=10000)),
    ]
)

N_FEATURES_OPTIONS = [2, 4, 8]
C_OPTIONS = [1, 10, 100, 1000]
param_grid = [
    {
        "reduce_dim": [PCA(iterated_power=7), NMF(max_iter=1_000)],
        "reduce_dim__n_components": N_FEATURES_OPTIONS,
        "classify__C": C_OPTIONS,
    },
    {
        "reduce_dim": [SelectKBest(mutual_info_classif)],
        "reduce_dim__k": N_FEATURES_OPTIONS,
        "classify__C": C_OPTIONS,
    },
]
reducer_labels = ["PCA", "NMF", "KBest(mutual_info_classif)"]

Create a GridSearchCV object and fit data

We will create a GridSearchCV object using the pipeline and parameter grid we defined in the previous step. We will then fit the data to the object.

grid = GridSearchCV(pipe, n_jobs=1, param_grid=param_grid)
grid.fit(X, y)

Plot results

We will plot the results of the GridSearchCV using a bar chart. This will allow us to compare the accuracy of different feature reduction techniques.

import pandas as pd

mean_scores = np.array(grid.cv_results_["mean_test_score"])
## scores are in the order of param_grid iteration, which is alphabetical
mean_scores = mean_scores.reshape(len(C_OPTIONS), -1, len(N_FEATURES_OPTIONS))
## select score for best C
mean_scores = mean_scores.max(axis=0)
## create a dataframe to ease plotting
mean_scores = pd.DataFrame(
    mean_scores.T, index=N_FEATURES_OPTIONS, columns=reducer_labels
)

ax = mean_scores.plot.bar()
ax.set_title("Comparing feature reduction techniques")
ax.set_xlabel("Reduced number of features")
ax.set_ylabel("Digit classification accuracy")
ax.set_ylim((0, 1))
ax.legend(loc="upper left")

plt.show()

Caching transformers within a Pipeline

We will now demonstrate how to store the state of a specific transformer, since it could be used again. Using a pipeline in GridSearchCV triggers such situations. Therefore, we use the argument memory to enable caching.

from joblib import Memory
from shutil import rmtree

## Create a temporary folder to store the transformers of the pipeline
location = "cachedir"
memory = Memory(location=location, verbose=10)
cached_pipe = Pipeline(
    [("reduce_dim", PCA()), ("classify", LinearSVC(dual=False, max_iter=10000))],
    memory=memory,
)

## This time, a cached pipeline will be used within the grid search

## Delete the temporary cache before exiting
memory.clear(warn=False)
rmtree(location)

Summary

In this lab, we used Pipeline and GridSearchCV in scikit-learn to optimize over different classes of estimators in a single CV run. We also demonstrated how to store the state of a specific transformer using the memory argument to enable caching. This can be particularly useful when fitting a transformer is costly.

Dimensionality Reduction with Pipeline and GridSearchCV