Bisecting K-Means vs Regular K-Means Performance

Introduction

This is a step-by-step tutorial to compare the performance of Regular K-Means algorithm and Bisecting K-Means. The tutorial will demonstrate the differences between these algorithms in terms of clustering with increasing n_clusters.

VM Tips

After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.

Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.

If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.

Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL sklearn(("`Sklearn`")) -.-> sklearn/CoreModelsandAlgorithmsGroup(["`Core Models and Algorithms`"]) sklearn(("`Sklearn`")) -.-> sklearn/UtilitiesandDatasetsGroup(["`Utilities and Datasets`"]) ml(("`Machine Learning`")) -.-> ml/FrameworkandSoftwareGroup(["`Framework and Software`"]) sklearn/CoreModelsandAlgorithmsGroup -.-> sklearn/cluster("`Clustering`") sklearn/UtilitiesandDatasetsGroup -.-> sklearn/datasets("`Datasets`") ml/FrameworkandSoftwareGroup -.-> ml/sklearn("`scikit-learn`") subgraph Lab Skills sklearn/cluster -.-> lab-49071{{"`Bisecting K-Means and Regular K-Means Performance Comparison`"}} sklearn/datasets -.-> lab-49071{{"`Bisecting K-Means and Regular K-Means Performance Comparison`"}} ml/sklearn -.-> lab-49071{{"`Bisecting K-Means and Regular K-Means Performance Comparison`"}} end

Import Libraries

In this step, we will import the necessary libraries required for this tutorial.

import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import BisectingKMeans, KMeans

Generate Sample Data

In this step, we will generate sample data using the make_blobs() function from scikit-learn. We will generate 10000 samples with 2 centers.

n_samples = 10000
random_state = 0
X, _ = make_blobs(n_samples=n_samples, centers=2, random_state=random_state)

Define Number of Clusters and Algorithms

In this step, we will define the number of cluster centers for KMeans and BisectingKMeans. We will also define the algorithms to be compared.

n_clusters_list = [4, 8, 16]
clustering_algorithms = {
    "Bisecting K-Means": BisectingKMeans,
    "K-Means": KMeans,
}

Visualize Results

In this step, we will visualize the results of the algorithms using subplots. We will use the scatter plot to represent the data points and the cluster centroids. We will iterate through each algorithm and the number of clusters to be compared and plot the results.

fig, axs = plt.subplots(len(clustering_algorithms), len(n_clusters_list), figsize=(12, 5))
axs = axs.T

for i, (algorithm_name, Algorithm) in enumerate(clustering_algorithms.items()):
    for j, n_clusters in enumerate(n_clusters_list):
        algo = Algorithm(n_clusters=n_clusters, random_state=random_state, n_init=3)
        algo.fit(X)
        centers = algo.cluster_centers_

        axs[j, i].scatter(X[:, 0], X[:, 1], s=10, c=algo.labels_)
        axs[j, i].scatter(centers[:, 0], centers[:, 1], c="r", s=20)

        axs[j, i].set_title(f"{algorithm_name} : {n_clusters} clusters")

for ax in axs.flat:
    ax.label_outer()
    ax.set_xticks([])
    ax.set_yticks([])

plt.show()

Summary

This tutorial compared the performance of Regular K-Means algorithm and Bisecting K-Means using sample data generated from scikit-learn. We visualized the results using subplots with scatter plots representing the data points and the cluster centroids. We found that the Bisecting K-Means algorithm tends to create clusters that have a more regular large-scale structure, whereas the Regular K-Means algorithm creates different clusterings when increasing n_clusters.

Bisecting K-Means and Regular K-Means Performance Comparison