使用 Python 进行层次聚类 | Scikit-Learn 教程

简介

在本实验中，我们将使用 Python 的 scikit-learn 库对一些简单的数据集执行层次聚类。层次聚类是一种聚类方法，你可以通过自上而下或自下而上的方式构建聚类层次结构。层次聚类的目标是找到彼此相似且与其他聚类中的点不同的点聚类。

虚拟机使用提示

虚拟机启动完成后，点击左上角切换到“笔记本”标签页，以访问 Jupyter Notebook 进行练习。

有时，你可能需要等待几秒钟让 Jupyter Notebook 完成加载。由于 Jupyter Notebook 的限制，操作验证无法自动化。

如果你在学习过程中遇到问题，随时向 Labby 提问。课程结束后提供反馈，我们将立即为你解决问题。

导入库并加载数据

我们将首先导入必要的库，并加载用于层次聚类示例的简单数据集。

import time
import warnings

import numpy as np
import matplotlib.pyplot as plt

from sklearn import cluster, datasets
from sklearn.preprocessing import StandardScaler
from itertools import cycle, islice

np.random.seed(0)

## %%
## 生成数据集。我们选择足够大的规模以观察算法的可扩展性，
## 但又不能太大以免运行时间过长

n_samples = 1500
noisy_circles = datasets.make_circles(n_samples=n_samples, factor=0.5, noise=0.05)
noisy_moons = datasets.make_moons(n_samples=n_samples, noise=0.05)
blobs = datasets.make_blobs(n_samples=n_samples, random_state=8)
no_structure = np.random.rand(n_samples, 2), None

## 各向异性分布的数据
random_state = 170
X, y = datasets.make_blobs(n_samples=n_samples, random_state=random_state)
transformation = [[0.6, -0.6], [-0.4, 0.8]]
X_aniso = np.dot(X, transformation)
aniso = (X_aniso, y)

## 具有不同方差的 blobs
varied = datasets.make_blobs(
    n_samples=n_samples, cluster_std=[1.0, 2.5, 0.5], random_state=random_state
)

执行层次聚类

现在，我们将对在步骤 1 中加载的简单数据集执行层次聚类。我们将使用不同的链接方法，如单链、平均链、完全链和沃德法来构建聚类。

## 设置聚类参数
plt.figure(figsize=(9 * 1.3 + 2, 14.5))
plt.subplots_adjust(
    left=0.02, right=0.98, bottom=0.001, top=0.96, wspace=0.05, hspace=0.01
)

plot_num = 1

default_base = {"n_neighbors": 10, "n_clusters": 3}

datasets = [
    (noisy_circles, {"n_clusters": 2}),
    (noisy_moons, {"n_clusters": 2}),
    (varied, {"n_neighbors": 2}),
    (aniso, {"n_neighbors": 2}),
    (blobs, {}),
    (no_structure, {}),
]

for i_dataset, (dataset, algo_params) in enumerate(datasets):
    ## 使用特定于数据集的值更新参数
    params = default_base.copy()
    params.update(algo_params)

    X, y = dataset

    ## 标准化数据集以便于参数选择
    X = StandardScaler().fit_transform(X)

    ## ============
    ## 创建聚类对象
    ## ============
    ward = cluster.AgglomerativeClustering(
        n_clusters=params["n_clusters"], linkage="ward"
    )
    complete = cluster.AgglomerativeClustering(
        n_clusters=params["n_clusters"], linkage="complete"
    )
    average = cluster.AgglomerativeClustering(
        n_clusters=params["n_clusters"], linkage="average"
    )
    single = cluster.AgglomerativeClustering(
        n_clusters=params["n_clusters"], linkage="single"
    )

    clustering_algorithms = (
        ("单链", single),
        ("平均链", average),
        ("完全链", complete),
        ("沃德链", ward),
    )

    for name, algorithm in clustering_algorithms:
        t0 = time.time()

        ## 捕获与 kneighbors_graph 相关的警告
        with warnings.catch_warnings():
            warnings.filterwarnings(
                "ignore",
                message="the number of connected components of the "
                + "connectivity matrix is [0-9]{1,2}"
                + " > 1. Completing it to avoid stopping the tree early.",
                category=UserWarning,
            )
            algorithm.fit(X)

        t1 = time.time()
        if hasattr(algorithm, "labels_"):
            y_pred = algorithm.labels_.astype(int)
        else:
            y_pred = algorithm.predict(X)

        plt.subplot(len(datasets), len(clustering_algorithms), plot_num)
        if i_dataset == 0:
            plt.title(name, size=18)

        colors = np.array(
            list(
                islice(
                    cycle(
                        [
                            "#377eb8",
                            "#ff7f00",
                            "#4daf4a",
                            "#f781bf",
                            "#a65628",
                            "#984ea3",
                            "#999999",
                            "#e41a1c",
                            "#dede00",
                        ]
                    ),
                    int(max(y_pred) + 1),
                )
            )
        )
        plt.scatter(X[:, 0], X[:, 1], s=10, color=colors[y_pred])

        plt.xlim(-2.5, 2.5)
        plt.ylim(-2.5, 2.5)
        plt.xticks(())
        plt.yticks(())
        plt.text(
            0.99,
            0.01,
            ("%.2fs" % (t1 - t0)).lstrip("0"),
            transform=plt.gca().transAxes,
            size=15,
            horizontalalignment="right",
        )
        plot_num += 1

plt.show()

分析结果

现在我们将分析层次聚类的结果。基于我们使用的简单数据集，我们可以得出以下观察结果：

单链法速度快，在非球状数据上表现良好，但在有噪声的情况下表现较差。
平均链法和完全链法在清晰分离的球状聚类上表现良好，但在其他情况下结果不一。
沃德法是处理有噪声数据最有效的方法。

需要注意的是，虽然这些观察结果让我们对这些算法有了一些直观认识，但这种直观认识可能不适用于非常高维的数据。

总结

在本实验中，我们学习了如何使用 Python 的 scikit-learn 库执行层次聚类。我们使用了单链、平均链、完全链和沃德等不同的链接方法来构建聚类，并基于一些简单数据集分析了结果。层次聚类是一种强大的技术，可用于识别相似数据点的聚类，并且在生物学、市场营销和金融等各种领域都可能有用。

使用 Scikit-Learn 进行层次聚类

简介

虚拟机使用提示

导入库并加载数据

执行层次聚类

分析结果

总结