Exploring Discretization Strategies with KBinsDiscretizer

Introduction

In this lab, we will explore the different strategies available in the KBinsDiscretizer class of the Python scikit-learn library. KBinsDiscretizer is a class for discretizing continuous features into categorical features by binning the continuous data. We will visualize the different strategies by applying them to different datasets.

VM Tips

After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.

Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.

If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.

Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL sklearn(("`Sklearn`")) -.-> sklearn/UtilitiesandDatasetsGroup(["`Utilities and Datasets`"]) sklearn(("`Sklearn`")) -.-> sklearn/DataPreprocessingandFeatureEngineeringGroup(["`Data Preprocessing and Feature Engineering`"]) ml(("`Machine Learning`")) -.-> ml/FrameworkandSoftwareGroup(["`Framework and Software`"]) sklearn/UtilitiesandDatasetsGroup -.-> sklearn/datasets("`Datasets`") sklearn/DataPreprocessingandFeatureEngineeringGroup -.-> sklearn/preprocessing("`Preprocessing and Normalization`") ml/FrameworkandSoftwareGroup -.-> ml/sklearn("`scikit-learn`") subgraph Lab Skills sklearn/datasets -.-> lab-49114{{"`Demonstrating KBinsDiscretizer Strategies`"}} sklearn/preprocessing -.-> lab-49114{{"`Demonstrating KBinsDiscretizer Strategies`"}} ml/sklearn -.-> lab-49114{{"`Demonstrating KBinsDiscretizer Strategies`"}} end

Import Libraries

We will begin by importing the necessary libraries. We will be using numpy, matplotlib.pyplot, make_blobs, and KBinsDiscretizer from the sklearn.preprocessing module.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.datasets import make_blobs

Create Datasets

We will create three datasets for visualization purposes. The first dataset will be a random set of 200 samples from a uniform distribution between -3 and 3 in both dimensions. The second dataset will be a set of 200 samples generated using the make_blobs function from sklearn.datasets. The third dataset will also be generated using the make_blobs function.

n_samples = 200
centers_0 = np.array([[0, 0], [0, 5], [2, 4], [8, 8]])
centers_1 = np.array([[0, 0], [3, 1]])

X_list = [
    np.random.RandomState(42).uniform(-3, 3, size=(n_samples, 2)),
    make_blobs(
        n_samples=[n_samples // 10, n_samples * 4 // 10, n_samples // 10, n_samples * 4 // 10],
        cluster_std=0.5,
        centers=centers_0,
        random_state=42,
    )[0],
    make_blobs(
        n_samples=[n_samples // 5, n_samples * 4 // 5],
        cluster_std=0.5,
        centers=centers_1,
        random_state=42,
    )[0],
]

Apply Different Strategies

We will now apply the three different strategies available in KBinsDiscretizer to each of the datasets. The strategies are:

'uniform': The discretization is uniform in each feature, which means that the bin widths are constant in each dimension.
'quantile': The discretization is done on the quantiled values, which means that each bin has approximately the same number of samples.
'kmeans': The discretization is based on the centroids of a KMeans clustering procedure.

strategies = ["uniform", "quantile", "kmeans"]

figure = plt.figure(figsize=(14, 9))
i = 1
for ds_cnt, X in enumerate(X_list):
    ax = plt.subplot(len(X_list), len(strategies) + 1, i)
    ax.scatter(X[:, 0], X[:, 1], edgecolors="k")
    if ds_cnt == 0:
        ax.set_title("Input data", size=14)

    xx, yy = np.meshgrid(
        np.linspace(X[:, 0].min(), X[:, 0].max(), 300),
        np.linspace(X[:, 1].min(), X[:, 1].max(), 300),
    )
    grid = np.c_[xx.ravel(), yy.ravel()]

    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    ax.set_xticks(())
    ax.set_yticks(())

    i += 1
    ## transform the dataset with KBinsDiscretizer
    for strategy in strategies:
        enc = KBinsDiscretizer(n_bins=4, encode="ordinal", strategy=strategy)
        enc.fit(X)
        grid_encoded = enc.transform(grid)

        ax = plt.subplot(len(X_list), len(strategies) + 1, i)

        ## horizontal stripes
        horizontal = grid_encoded[:, 0].reshape(xx.shape)
        ax.contourf(xx, yy, horizontal, alpha=0.5)
        ## vertical stripes
        vertical = grid_encoded[:, 1].reshape(xx.shape)
        ax.contourf(xx, yy, vertical, alpha=0.5)

        ax.scatter(X[:, 0], X[:, 1], edgecolors="k")
        ax.set_xlim(xx.min(), xx.max())
        ax.set_ylim(yy.min(), yy.max())
        ax.set_xticks(())
        ax.set_yticks(())
        if ds_cnt == 0:
            ax.set_title("strategy='%s'" % (strategy,), size=14)

        i += 1

plt.tight_layout()
plt.show()

Visualize Results

We will now visualize the results of applying the different strategies to the datasets. The plots show the regions where the discretized encoding is constant.

Summary

In this lab, we explored the different strategies available in the KBinsDiscretizer class of the Python scikit-learn library. We created three datasets and applied the three different strategies to each of them. We then visualized the results of the discretization.

Demonstrating KBinsDiscretizer Strategies