Precomputing Nearest Neighbors for Efficient KNeighborsClassifier

Introduction

This lab demonstrates how to precompute the k nearest neighbors before using them in KNeighborsClassifier. KNeighborsClassifier can compute the nearest neighbors internally, but precomputing them can have several benefits, such as finer parameter control, caching for multiple use, or custom implementations. Here we use the caching property of pipelines to cache the nearest neighbors graph between multiple fits of KNeighborsClassifier.

VM Tips

After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.

Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.

If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.

Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL sklearn(("`Sklearn`")) -.-> sklearn/DataPreprocessingandFeatureEngineeringGroup(["`Data Preprocessing and Feature Engineering`"]) sklearn(("`Sklearn`")) -.-> sklearn/UtilitiesandDatasetsGroup(["`Utilities and Datasets`"]) sklearn(("`Sklearn`")) -.-> sklearn/ModelSelectionandEvaluationGroup(["`Model Selection and Evaluation`"]) sklearn(("`Sklearn`")) -.-> sklearn/CoreModelsandAlgorithmsGroup(["`Core Models and Algorithms`"]) ml(("`Machine Learning`")) -.-> ml/FrameworkandSoftwareGroup(["`Framework and Software`"]) sklearn/DataPreprocessingandFeatureEngineeringGroup -.-> sklearn/pipeline("`Pipeline`") sklearn/UtilitiesandDatasetsGroup -.-> sklearn/datasets("`Datasets`") sklearn/ModelSelectionandEvaluationGroup -.-> sklearn/model_selection("`Model Selection`") sklearn/CoreModelsandAlgorithmsGroup -.-> sklearn/neighbors("`Nearest Neighbors`") ml/FrameworkandSoftwareGroup -.-> ml/sklearn("`scikit-learn`") subgraph Lab Skills sklearn/pipeline -.-> lab-49072{{"`Caching Nearest Neighbors`"}} sklearn/datasets -.-> lab-49072{{"`Caching Nearest Neighbors`"}} sklearn/model_selection -.-> lab-49072{{"`Caching Nearest Neighbors`"}} sklearn/neighbors -.-> lab-49072{{"`Caching Nearest Neighbors`"}} ml/sklearn -.-> lab-49072{{"`Caching Nearest Neighbors`"}} end

Import Libraries

In this step, we will import all the necessary libraries.

from tempfile import TemporaryDirectory
import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsTransformer, KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_digits
from sklearn.pipeline import Pipeline

Load Data

In this step, we will load the digits dataset from scikit-learn.

X, y = load_digits(return_X_y=True)
n_neighbors_list = [1, 2, 3, 4, 5, 6, 7, 8, 9]

Compute Nearest Neighbors Graph

In this step, we will compute the nearest neighbors graph using KNeighborsTransformer.

## The transformer computes the nearest neighbors graph using the maximum number
## of neighbors necessary in the grid search. The classifier model filters the
## nearest neighbors graph as required by its own n_neighbors parameter.
graph_model = KNeighborsTransformer(n_neighbors=max(n_neighbors_list), mode="distance")

Define Classifier Model

In this step, we will define the KNeighborsClassifier model.

classifier_model = KNeighborsClassifier(metric="precomputed")

Cache Nearest Neighbors Graph

In this step, we will cache the nearest neighbors graph between multiple fits of KNeighborsClassifier using the caching property of pipelines.

## Note that we give `memory` a directory to cache the graph computation
## that will be used several times when tuning the hyperparameters of the
## classifier.
with TemporaryDirectory(prefix="sklearn_graph_cache_") as tmpdir:
    full_model = Pipeline(
        steps=[("graph", graph_model), ("classifier", classifier_model)], memory=tmpdir
    )

Tune Hyperparameters

In this step, we will tune the hyperparameters of the classifier using GridSearchCV.

    param_grid = {"classifier__n_neighbors": n_neighbors_list}
    grid_model = GridSearchCV(full_model, param_grid)
    grid_model.fit(X, y)

Visualize Results

In this step, we will visualize the results of the grid search.

## Plot the results of the grid search.
fig, axes = plt.subplots(1, 2, figsize=(8, 4))
axes[0].errorbar(
    x=n_neighbors_list,
    y=grid_model.cv_results_["mean_test_score"],
    yerr=grid_model.cv_results_["std_test_score"],
)
axes[0].set(xlabel="n_neighbors", title="Classification accuracy")
axes[1].errorbar(
    x=n_neighbors_list,
    y=grid_model.cv_results_["mean_fit_time"],
    yerr=grid_model.cv_results_["std_fit_time"],
    color="r",
)
axes[1].set(xlabel="n_neighbors", title="Fit time (with caching)")
fig.tight_layout()
plt.show()

Summary

In this lab, we have learned how to precompute the k nearest neighbors before using them in KNeighborsClassifier using the caching property of pipelines. We have also learned how to tune the hyperparameters of the classifier using GridSearchCV and visualize the results.

Caching Nearest Neighbors