用 Python 探索凝聚聚类

简介

凝聚聚类是一种层次聚类算法，它将相似的数据点聚集在一起。它从每个数据点作为一个单独的聚类开始，然后根据它们的相似性迭代地合并这些聚类，直到所有数据点都属于一个单一的聚类。在本实验中，我们将探索施加一个连通性图以捕捉数据中的局部结构的效果。

虚拟机提示

虚拟机启动完成后，点击左上角切换到“笔记本”标签，以访问 Jupyter Notebook 进行练习。

有时，你可能需要等待几秒钟让 Jupyter Notebook 完成加载。由于 Jupyter Notebook 的限制，操作的验证无法自动化。

如果你在学习过程中遇到问题，随时向 Labby 提问。课程结束后提供反馈，我们将立即为你解决问题。

导入所需库

我们将首先导入所需的库，包括 numpy、matplotlib 和 sklearn。

import time
import matplotlib.pyplot as plt
import numpy as np

from sklearn.cluster import AgglomerativeClustering
from sklearn.neighbors import kneighbors_graph

生成样本数据

我们通过创建一个带有随机噪声的正弦波来生成样本数据。

n_samples = 1500
np.random.seed(0)
t = 1.5 * np.pi * (1 + 3 * np.random.rand(1, n_samples))
x = t * np.cos(t)
y = t * np.sin(t)

X = np.concatenate((x, y))
X += 0.7 * np.random.randn(2, n_samples)
X = X.T

创建一个图

创建一个捕捉局部连通性的图。邻居数量越多，聚类将越均匀，但会以计算时间为代价。邻居数量非常多会使聚类大小分布更均匀，但可能无法体现数据的局部流形结构。

knn_graph = kneighbors_graph(X, 30, include_self=False)

可视化无连通性的聚类

通过用不同颜色绘制数据点来可视化无连通性的聚类。

for n_clusters in (30, 3):
    plt.figure(figsize=(10, 4))
    for index, linkage in enumerate(("average", "complete", "ward", "single")):
        plt.subplot(1, 4, index + 1)
        model = AgglomerativeClustering(
            linkage=linkage, connectivity=None, n_clusters=n_clusters
        )
        t0 = time.time()
        model.fit(X)
        elapsed_time = time.time() - t0
        plt.scatter(X[:, 0], X[:, 1], c=model.labels_, cmap=plt.cm.nipy_spectral)
        plt.title(
            "linkage=%s\n(time %.2fs)" % (linkage, elapsed_time),
            fontdict=dict(verticalalignment="top"),
        )
        plt.axis("equal")
        plt.axis("off")

        plt.subplots_adjust(bottom=0, top=0.83, wspace=0, left=0, right=1)
        plt.suptitle(
            "n_cluster=%i, connectivity=%r"
            % (n_clusters, False),
            size=17,
        )

plt.show()

可视化具有连通性的聚类

通过用不同颜色绘制数据点来可视化具有连通性的聚类。

for n_clusters in (30, 3):
    plt.figure(figsize=(10, 4))
    for index, linkage in enumerate(("average", "complete", "ward", "single")):
        plt.subplot(1, 4, index + 1)
        model = AgglomerativeClustering(
            linkage=linkage, connectivity=knn_graph, n_clusters=n_clusters
        )
        t0 = time.time()
        model.fit(X)
        elapsed_time = time.time() - t0
        plt.scatter(X[:, 0], X[:, 1], c=model.labels_, cmap=plt.cm.nipy_spectral)
        plt.title(
            "linkage=%s\n(time %.2fs)" % (linkage, elapsed_time),
            fontdict=dict(verticalalignment="top"),
        )
        plt.axis("equal")
        plt.axis("off")

        plt.subplots_adjust(bottom=0, top=0.83, wspace=0, left=0, right=1)
        plt.suptitle(
            "n_cluster=%i, connectivity=%r"
            % (n_clusters, True),
            size=17,
        )

plt.show()

总结

在本实验中，我们探讨了使用凝聚聚类（Agglomerative Clustering）引入连通性图以捕捉数据中的局部结构所产生的影响。我们可视化了有无连通性情况下的聚类，并观察到使用连通性图可以得到更稳定且更有意义的聚类。我们还观察到，邻居数量越多，聚类越均匀，但会以计算时间为代价。

绘制凝聚聚类

简介