Gaussian Mixture Models | Machine Learning | Data Clustering

Introduction

In this lab, we will learn about different initialization methods for Gaussian Mixture Models (GMM). We will use scikit-learn library to generate sample data and visualize the clustering results. There are four different methods for the initialization parameter init_param: kmeans (default), random, random_from_data, and k-means++.

VM Tips

After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.

Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.

If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.

Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL sklearn(("`Sklearn`")) -.-> sklearn/UtilitiesandDatasetsGroup(["`Utilities and Datasets`"]) sklearn(("`Sklearn`")) -.-> sklearn/CoreModelsandAlgorithmsGroup(["`Core Models and Algorithms`"]) ml(("`Machine Learning`")) -.-> ml/FrameworkandSoftwareGroup(["`Framework and Software`"]) sklearn/UtilitiesandDatasetsGroup -.-> sklearn/datasets("`Datasets`") sklearn/CoreModelsandAlgorithmsGroup -.-> sklearn/mixture("`Gaussian Mixture Models`") sklearn/UtilitiesandDatasetsGroup -.-> sklearn/utils("`Utilities`") ml/FrameworkandSoftwareGroup -.-> ml/sklearn("`scikit-learn`") subgraph Lab Skills sklearn/datasets -.-> lab-49135{{"`Gaussian Mixture Model Initialization Methods`"}} sklearn/mixture -.-> lab-49135{{"`Gaussian Mixture Model Initialization Methods`"}} sklearn/utils -.-> lab-49135{{"`Gaussian Mixture Model Initialization Methods`"}} ml/sklearn -.-> lab-49135{{"`Gaussian Mixture Model Initialization Methods`"}} end

Import libraries and generate sample data

We will start by importing the necessary libraries and generating some sample data using the make_blobs function from scikit-learn.

import matplotlib.pyplot as plt
import numpy as np
from sklearn.mixture import GaussianMixture
from sklearn.utils.extmath import row_norms
from sklearn.datasets._samples_generator import make_blobs
from timeit import default_timer as timer

## Generate some data
X, y_true = make_blobs(n_samples=4000, centers=4, cluster_std=0.60, random_state=0)
X = X[:, ::-1]

n_samples = 4000
n_components = 4
x_squared_norms = row_norms(X, squared=True)

Define a function to get initial means

Next, we will define a function get_initial_means that takes the sample data, initialization method, and random state as inputs and returns the initialization means.

def get_initial_means(X, init_params, r):
    ## Run a GaussianMixture with max_iter=0 to output the initialization means
    gmm = GaussianMixture(
        n_components=4, init_params=init_params, tol=1e-9, max_iter=0, random_state=r
    ).fit(X)
    return gmm.means_

Plot the sample data and initialization centers

We will now plot the sample data and the initialization centers for each initialization method using different colors. We will also calculate the time taken for initialization and the number of iterations taken for the GMM to converge.

methods = ["kmeans", "random_from_data", "k-means++", "random"]
colors = ["navy", "turquoise", "cornflowerblue", "darkorange"]
times_init = {}
relative_times = {}

plt.figure(figsize=(4 * len(methods) // 2, 6))
plt.subplots_adjust(
    bottom=0.1, top=0.9, hspace=0.15, wspace=0.05, left=0.05, right=0.95
)

for n, method in enumerate(methods):
    r = np.random.RandomState(seed=1234)
    plt.subplot(2, len(methods) // 2, n + 1)

    start = timer()
    ini = get_initial_means(X, method, r)
    end = timer()
    init_time = end - start

    gmm = GaussianMixture(
        n_components=4, means_init=ini, tol=1e-9, max_iter=2000, random_state=r
    ).fit(X)

    times_init[method] = init_time
    for i, color in enumerate(colors):
        data = X[gmm.predict(X) == i]
        plt.scatter(data[:, 0], data[:, 1], color=color, marker="x")

    plt.scatter(
        ini[:, 0], ini[:, 1], s=75, marker="D", c="orange", lw=1.5, edgecolors="black"
    )
    relative_times[method] = times_init[method] / times_init[methods[0]]

    plt.xticks(())
    plt.yticks(())
    plt.title(method, loc="left", fontsize=12)
    plt.title(
        "Iter %i | Init Time %.2fx" % (gmm.n_iter_, relative_times[method]),
        loc="right",
        fontsize=10,
    )
plt.suptitle("GMM iterations and relative time taken to initialize")
plt.show()

Interpret the results

We can see from the plot that k-means++ does a good job of both low time to initialize and low number of GaussianMixture iterations to converge. When initialized with random_from_data or random, the model takes more iterations to converge. All three alternative methods take less time to initialize when compared to kmeans.

Summary

In this lab, we learned about different initialization methods for Gaussian Mixture Models (GMM) and how to use them to cluster sample data using scikit-learn library. We plotted the sample data and initialization centers for each initialization method and interpreted the results.

Gaussian Mixture Model Initialization Methods