머신러닝 | 군집화 알고리즘 | K-Means vs MiniBatchKMeans

소개

이 실습에서는 K-Means 와 MiniBatchKMeans 라는 두 가지 클러스터링 알고리즘을 비교합니다. K-Means 는 머신 러닝에서 널리 사용되는 인기 있는 클러스터링 알고리즘입니다. MiniBatchKMeans 는 K-Means 의 변형으로 속도가 빠르지만 약간 다른 결과를 제공합니다. 두 알고리즘을 사용하여 데이터 집합을 클러스터링하고 결과를 플롯합니다. 또한 두 알고리즘에서 레이블이 다른 점들을 플롯합니다.

VM 팁

VM 시작이 완료되면 왼쪽 상단 모서리를 클릭하여 Notebook 탭으로 전환하여 연습을 위한 Jupyter Notebook에 접근합니다.

때때로 Jupyter Notebook 이 완전히 로드되기까지 몇 초 정도 기다려야 할 수 있습니다. Jupyter Notebook 의 제한으로 인해 작업의 유효성 검사를 자동화할 수 없습니다.

학습 중 문제가 발생하면 Labby 에 문의하십시오. 세션 후 피드백을 제공하면 문제를 신속하게 해결해 드리겠습니다.

데이터 생성

클러스터링할 데이터 블롭을 생성하는 것으로 시작합니다.

import numpy as np
from sklearn.datasets import make_blobs

np.random.seed(0)

batch_size = 45
centers = [[1, 1], [-1, -1], [1, -1]]
n_clusters = len(centers)
X, labels_true = make_blobs(n_samples=3000, centers=centers, cluster_std=0.7)

KMeans 를 이용한 클러스터링

KMeans 를 이용하여 클러스터링을 수행합니다.

import time
from sklearn.cluster import KMeans

k_means = KMeans(init="k-means++", n_clusters=3, n_init=10)
t0 = time.time()
k_means.fit(X)
t_batch = time.time() - t0

MiniBatchKMeans 를 이용한 클러스터링

MiniBatchKMeans 를 이용하여 클러스터링을 수행합니다.

from sklearn.cluster import MiniBatchKMeans

mbk = MiniBatchKMeans(
    init="k-means++",
    n_clusters=3,
    batch_size=batch_size,
    n_init=10,
    max_no_improvement=10,
    verbose=0,
)
t0 = time.time()
mbk.fit(X)
t_mini_batch = time.time() - t0

클러스터 간 동일성 확립

MiniBatchKMeans 와 KMeans 알고리즘에서 동일한 클러스터에 대해 동일한 색상을 사용하고자 합니다. 가장 가까운 클러스터 센터를 기준으로 페어링합니다.

from sklearn.metrics.pairwise import pairwise_distances_argmin

k_means_cluster_centers = k_means.cluster_centers_
order = pairwise_distances_argmin(k_means.cluster_centers_, mbk.cluster_centers_)
mbk_means_cluster_centers = mbk.cluster_centers_[order]

k_means_labels = pairwise_distances_argmin(X, k_means_cluster_centers)
mbk_means_labels = pairwise_distances_argmin(X, mbk_means_cluster_centers)

결과 시각화

결과를 시각화합니다.

import matplotlib.pyplot as plt

fig = plt.figure(figsize=(8, 3))
fig.subplots_adjust(left=0.02, right=0.98, bottom=0.05, top=0.9)
colors = ["#4EACC5", "#FF9C34", "#4E9A06"]

## KMeans
ax = fig.add_subplot(1, 3, 1)
for k, col in zip(range(n_clusters), colors):
    my_members = k_means_labels == k
    cluster_center = k_means_cluster_centers[k]
    ax.plot(X[my_members, 0], X[my_members, 1], "w", markerfacecolor=col, marker=".")
    ax.plot(
        cluster_center[0],
        cluster_center[1],
        "o",
        markerfacecolor=col,
        markeredgecolor="k",
        markersize=6,
    )
ax.set_title("KMeans")
ax.set_xticks(())
ax.set_yticks(())
plt.text(-3.5, 1.8, "train time: %.2fs\ninertia: %f" % (t_batch, k_means.inertia_))

## MiniBatchKMeans
ax = fig.add_subplot(1, 3, 2)
for k, col in zip(range(n_clusters), colors):
    my_members = mbk_means_labels == k
    cluster_center = mbk_means_cluster_centers[k]
    ax.plot(X[my_members, 0], X[my_members, 1], "w", markerfacecolor=col, marker=".")
    ax.plot(
        cluster_center[0],
        cluster_center[1],
        "o",
        markerfacecolor=col,
        markeredgecolor="k",
        markersize=6,
    )
ax.set_title("MiniBatchKMeans")
ax.set_xticks(())
ax.set_yticks(())
plt.text(-3.5, 1.8, "train time: %.2fs\ninertia: %f" % (t_mini_batch, mbk.inertia_))

## Initialize the different array to all False
different = mbk_means_labels == 4
ax = fig.add_subplot(1, 3, 3)

for k in range(n_clusters):
    different += (k_means_labels == k) != (mbk_means_labels == k)

identical = np.logical_not(different)
ax.plot(X[identical, 0], X[identical, 1], "w", markerfacecolor="#bbbbbb", marker=".")
ax.plot(X[different, 0], X[different, 1], "w", markerfacecolor="m", marker=".")
ax.set_title("Difference")
ax.set_xticks(())
ax.set_yticks(())

plt.show()

요약

이 실험에서는 K-Means 와 MiniBatchKMeans 라는 두 가지 군집화 알고리즘을 비교하는 방법을 배웠습니다. 두 알고리즘을 사용하여 데이터 집합을 군집화하고 결과를 시각화했습니다. 또한, 두 알고리즘 간에 레이블이 다른 점들을 시각화했습니다. 이러한 비교를 통해 두 알고리즘의 차이점을 이해하고 우리의 필요에 가장 적합한 알고리즘을 선택하는 데 도움이 됩니다.

K-Means 와 MiniBatchKMeans 비교

소개