Feature Agglomeration for High-Dimensional Data

Machine LearningMachine LearningBeginner
Practice Now

This tutorial is from open-source community. Access the source code

Introduction

This tutorial will demonstrate how to use feature agglomeration to merge similar features in a dataset. Feature agglomeration is useful when working with high-dimensional datasets by reducing the number of features while preserving the most important information.

VM Tips

After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.

Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.

If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.


Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL sklearn(("`Sklearn`")) -.-> sklearn/DataPreprocessingandFeatureEngineeringGroup(["`Data Preprocessing and Feature Engineering`"]) ml(("`Machine Learning`")) -.-> ml/FrameworkandSoftwareGroup(["`Framework and Software`"]) sklearn/DataPreprocessingandFeatureEngineeringGroup -.-> sklearn/feature_extraction("`Feature Extraction`") ml/FrameworkandSoftwareGroup -.-> ml/sklearn("`scikit-learn`") subgraph Lab Skills sklearn/feature_extraction -.-> lab-49105{{"`Feature Agglomeration for High-Dimensional Data`"}} ml/sklearn -.-> lab-49105{{"`Feature Agglomeration for High-Dimensional Data`"}} end

Import Libraries

In this step, we will import the necessary libraries to perform feature agglomeration.

import numpy as np
import matplotlib.pyplot as plt

from sklearn import datasets, cluster
from sklearn.feature_extraction.image import grid_to_graph

Load Dataset

In this step, we will load the digits dataset from scikit-learn. This dataset contains images of handwritten digits from 0 to 9.

digits = datasets.load_digits()
images = digits.images
X = np.reshape(images, (len(images), -1))

Define Connectivity Matrix

In this step, we will define the connectivity matrix using the grid_to_graph function from scikit-learn. This function creates a connectivity graph based on the pixel grid of the images.

connectivity = grid_to_graph(*images[0].shape)

Perform Feature Agglomeration

In this step, we will perform feature agglomeration using the FeatureAgglomeration class from scikit-learn. We will set the number of clusters to 32.

agglo = cluster.FeatureAgglomeration(connectivity=connectivity, n_clusters=32)
agglo.fit(X)
X_reduced = agglo.transform(X)

Inverse Transform

In this step, we will perform an inverse transform on the reduced dataset to restore the original number of features.

X_restored = agglo.inverse_transform(X_reduced)
images_restored = np.reshape(X_restored, images.shape)

Visualize Results

In this step, we will visualize the original images, the agglomerated images, and the labels assigned to each cluster.

plt.figure(1, figsize=(4, 3.5))
plt.clf()
plt.subplots_adjust(left=0.01, right=0.99, bottom=0.01, top=0.91)
for i in range(4):
    plt.subplot(3, 4, i + 1)
    plt.imshow(images[i], cmap=plt.cm.gray, vmax=16, interpolation="nearest")
    plt.xticks(())
    plt.yticks(())
    if i == 1:
        plt.title("Original data")
    plt.subplot(3, 4, 4 + i + 1)
    plt.imshow(images_restored[i], cmap=plt.cm.gray, vmax=16, interpolation="nearest")
    if i == 1:
        plt.title("Agglomerated data")
    plt.xticks(())
    plt.yticks(())

plt.subplot(3, 4, 10)
plt.imshow(
    np.reshape(agglo.labels_, images[0].shape),
    interpolation="nearest",
    cmap=plt.cm.nipy_spectral,
)
plt.xticks(())
plt.yticks(())
plt.title("Labels")
plt.show()

Summary

In this tutorial, we have learned how to use feature agglomeration to merge similar features in a dataset. By reducing the number of features, we can improve the performance of machine learning algorithms while preserving the most important information in the dataset.

Other Machine Learning Tutorials you may like