Polynomial Kernel Approximation With Scikit-Learn

Introduction

This lab will demonstrate how to use polynomial kernel approximation in scikit-learn to efficiently generate polynomial kernel feature-space approximations. This is used to train linear classifiers that approximate the accuracy of kernelized ones. We will be using the Covtype dataset, which contains 581,012 samples with 54 features each, distributed among 6 classes. The goal of this dataset is to predict forest cover type from cartographic variables only (no remotely sensed data). After loading, we transform it into a binary classification problem to match the version of the dataset in the LIBSVM webpage, which was the one used in the original paper.

VM Tips

After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.

Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.

If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.

Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL sklearn(("`Sklearn`")) -.-> sklearn/UtilitiesandDatasetsGroup(["`Utilities and Datasets`"]) sklearn(("`Sklearn`")) -.-> sklearn/DataPreprocessingandFeatureEngineeringGroup(["`Data Preprocessing and Feature Engineering`"]) sklearn(("`Sklearn`")) -.-> sklearn/ModelSelectionandEvaluationGroup(["`Model Selection and Evaluation`"]) sklearn(("`Sklearn`")) -.-> sklearn/CoreModelsandAlgorithmsGroup(["`Core Models and Algorithms`"]) ml(("`Machine Learning`")) -.-> ml/FrameworkandSoftwareGroup(["`Framework and Software`"]) sklearn/UtilitiesandDatasetsGroup -.-> sklearn/datasets("`Datasets`") sklearn/DataPreprocessingandFeatureEngineeringGroup -.-> sklearn/kernel_approximation("`Kernel Approximation`") sklearn/ModelSelectionandEvaluationGroup -.-> sklearn/model_selection("`Model Selection`") sklearn/DataPreprocessingandFeatureEngineeringGroup -.-> sklearn/preprocessing("`Preprocessing and Normalization`") sklearn/CoreModelsandAlgorithmsGroup -.-> sklearn/svm("`Support Vector Machines`") ml/FrameworkandSoftwareGroup -.-> ml/sklearn("`scikit-learn`") subgraph Lab Skills sklearn/datasets -.-> lab-49276{{"`Polynomial Kernel Approximation With Scikit-Learn`"}} sklearn/kernel_approximation -.-> lab-49276{{"`Polynomial Kernel Approximation With Scikit-Learn`"}} sklearn/model_selection -.-> lab-49276{{"`Polynomial Kernel Approximation With Scikit-Learn`"}} sklearn/preprocessing -.-> lab-49276{{"`Polynomial Kernel Approximation With Scikit-Learn`"}} sklearn/svm -.-> lab-49276{{"`Polynomial Kernel Approximation With Scikit-Learn`"}} ml/sklearn -.-> lab-49276{{"`Polynomial Kernel Approximation With Scikit-Learn`"}} end

Load and Prepare the Data

We will first load the Covtype dataset and transform it into a binary classification problem by selecting only one class. Then, we will partition the data into a training set and a testing set, and normalize the features.

from sklearn.datasets import fetch_covtype
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, Normalizer

## Load the Covtype dataset, selecting only one class
X, y = fetch_covtype(return_X_y=True)
y[y != 2] = 0
y[y == 2] = 1

## Partition the data into a training set and a testing set
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=5000, test_size=10000, random_state=42
)

## Normalize the features
mm = make_pipeline(MinMaxScaler(), Normalizer())
X_train = mm.fit_transform(X_train)
X_test = mm.transform(X_test)

Establish a Baseline Model

We will train a linear SVM on the original features to establish a baseline model and print its accuracy.

from sklearn.svm import LinearSVC

## Train a linear SVM on the original features
lsvm = LinearSVC(dual="auto")
lsvm.fit(X_train, y_train)
lsvm_score = 100 * lsvm.score(X_test, y_test)

## Print the accuracy of the baseline model
print(f"Linear SVM score on raw features: {lsvm_score:.2f}%")

Establish the Kernel Approximation Model

We will now train linear SVMs on features generated by PolynomialCountSketch with different values for n_components. We will use a loop to iterate through different values for n_components and print the accuracy of each model.

from sklearn.kernel_approximation import PolynomialCountSketch

n_runs = 1
N_COMPONENTS = [250, 500, 1000, 2000]

for n_components in N_COMPONENTS:
    ps_lsvm_score = 0
    for _ in range(n_runs):
        ## Train a linear SVM on features generated by PolynomialCountSketch
        pipeline = make_pipeline(
            PolynomialCountSketch(n_components=n_components, degree=4),
            LinearSVC(dual="auto"),
        )
        pipeline.fit(X_train, y_train)
        ps_lsvm_score += 100 * pipeline.score(X_test, y_test)

    ps_lsvm_score /= n_runs

    ## Print the accuracy of the model
    print(f"Linear SVM score on {n_components} PolynomialCountSketch features: {ps_lsvm_score:.2f}%")

Establish the Kernelized SVM Model

We will train a kernelized SVM to see how well PolynomialCountSketch is approximating the performance of the kernel.

from sklearn.svm import SVC

## Train a kernelized SVM
ksvm = SVC(C=500.0, kernel="poly", degree=4, coef0=0, gamma=1.0)
ksvm.fit(X_train, y_train)
ksvm_score = 100 * ksvm.score(X_test, y_test)

## Print the accuracy of the kernelized SVM
print(f"Kernel-SVM score on raw features: {ksvm_score:.2f}%")

Compare the Results

We will plot the results of the different methods against their training times to compare their performance.

import matplotlib.pyplot as plt

## Plot the results of the different methods
fig, ax = plt.subplots(figsize=(7, 7))
ax.scatter(
    [
        lsvm_time,
    ],
    [
        lsvm_score,
    ],
    label="Linear SVM",
    c="green",
    marker="^",
)

for n_components in N_COMPONENTS:
    ax.scatter(
        [
            results[f"LSVM + PS({n_components})"]["time"],
        ],
        [
            results[f"LSVM + PS({n_components})"]["score"],
        ],
        c="blue",
    )
    ax.annotate(
        f"n_comp.={n_components}",
        (
            results[f"LSVM + PS({n_components})"]["time"],
            results[f"LSVM + PS({n_components})"]["score"],
        ),
        xytext=(-30, 10),
        textcoords="offset pixels",
    )

ax.scatter(
    [
        ksvm_time,
    ],
    [
        ksvm_score,
    ],
    label="Kernel SVM",
    c="red",
    marker="x",
)

ax.set_xlabel("Training time (s)")
ax.set_ylabel("Accuracy (%)")
ax.legend()
plt.show()

Summary

This lab demonstrated how to use polynomial kernel approximation in scikit-learn to efficiently generate polynomial kernel feature-space approximations. We applied this technique to the Covtype dataset, transforming it into a binary classification problem and training linear classifiers that approximate the accuracy of kernelized ones. We also compared the performance of the different methods and plotted the results against their training times.