가우시안 혼합 모델 공분산 튜토리얼

소개

이 튜토리얼에서는 가우시안 혼합 모델 (GMM) 의 다양한 공분산 유형의 사용법을 보여줍니다. GMM 은 종종 클러스터링에 사용되며, 얻은 클러스터를 데이터 세트의 실제 클래스와 비교할 수 있습니다. 이러한 비교를 유효하게 하기 위해 가우시안의 평균을 훈련 세트의 클래스 평균으로 초기화합니다. 우리는 아이리스 데이터 세트에서 다양한 GMM 공분산 유형을 사용하여 훈련 및 홀드아웃 테스트 데이터에 예측된 레이블을 플롯합니다. 우리는 구형, 대각, 전체 및 결합된 공분산 행렬의 GMM 을 성능이 증가하는 순서대로 비교합니다.

일반적으로 전체 공분산이 가장 좋은 성능을 보일 것으로 예상되지만, 작은 데이터 세트에서 과적합되기 쉽고 홀드아웃 테스트 데이터에 잘 일반화되지 않습니다.

플롯에서 훈련 데이터는 점으로, 테스트 데이터는 십자표시로 표시됩니다. 아이리스 데이터 세트는 4 차원입니다. 여기서는 앞의 두 차원만 표시되므로 일부 점은 다른 차원에서 분리됩니다.

VM 팁

VM 시작이 완료되면 왼쪽 상단 모서리를 클릭하여 Notebook 탭으로 전환하여 연습을 위한 Jupyter Notebook에 접근할 수 있습니다.

때때로 Jupyter Notebook 이 완전히 로드되기까지 몇 초 정도 기다려야 할 수 있습니다. Jupyter Notebook 의 제한으로 인해 작업의 유효성 검사를 자동화할 수 없습니다.

학습 중 문제가 발생하면 Labby 에 문의하십시오. 세션 후 피드백을 제공하면 문제를 신속하게 해결해 드리겠습니다.

라이브러리 가져오기

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
from sklearn.mixture import GaussianMixture
from sklearn.model_selection import StratifiedKFold

아이리스 데이터셋 로드

iris = datasets.load_iris()

학습 및 테스트 데이터 준비

skf = StratifiedKFold(n_splits=4)
train_index, test_index = next(iter(skf.split(iris.data, iris.target)))

X_train = iris.data[train_index]
y_train = iris.target[train_index]
X_test = iris.data[test_index]
y_test = iris.target[test_index]

서로 다른 공분산 유형에 대한 GMM 추정기 설정

colors = ["navy", "turquoise", "darkorange"]
n_classes = len(np.unique(y_train))

estimators = {
    cov_type: GaussianMixture(
        n_components=n_classes, covariance_type=cov_type, max_iter=20, random_state=0
    )
    for cov_type in ["spherical", "diag", "tied", "full"]
}

n_estimators = len(estimators)

GMM 에 대한 타원형을 그리는 함수 정의

def make_ellipses(gmm, ax):
    for n, color in enumerate(colors):
        if gmm.covariance_type == "full":
            covariances = gmm.covariances_[n][:2, :2]
        elif gmm.covariance_type == "tied":
            covariances = gmm.covariances_[:2, :2]
        elif gmm.covariance_type == "diag":
            covariances = np.diag(gmm.covariances_[n][:2])
        elif gmm.covariance_type == "spherical":
            covariances = np.eye(gmm.means_.shape[1]) * gmm.covariances_[n]
        v, w = np.linalg.eigh(covariances)
        u = w[0] / np.linalg.norm(w[0])
        angle = np.arctan2(u[1], u[0])
        angle = 180 * angle / np.pi
        v = 2.0 * np.sqrt(2.0) * np.sqrt(v)
        ell = mpl.patches.Ellipse(
            gmm.means_[n, :2], v[0], v[1], angle=180 + angle, color=color
        )
        ell.set_clip_box(ax.bbox)
        ell.set_alpha(0.5)
        ax.add_artist(ell)
        ax.set_aspect("equal", "datalim")

서로 다른 공분산 유형에 대한 GMM 플롯

plt.figure(figsize=(3 * n_estimators // 2, 6))
plt.subplots_adjust(
    bottom=0.01, top=0.95, hspace=0.15, wspace=0.05, left=0.01, right=0.99
)

for index, (name, estimator) in enumerate(estimators.items()):
    estimator.means_init = np.array(
        [X_train[y_train == i].mean(axis=0) for i in range(n_classes)]
    )

    estimator.fit(X_train)

    h = plt.subplot(2, n_estimators // 2, index + 1)
    make_ellipses(estimator, h)

    for n, color in enumerate(colors):
        data = iris.data[iris.target == n]
        plt.scatter(
            data[:, 0], data[:, 1], s=0.8, color=color, label=iris.target_names[n]
        )

    for n, color in enumerate(colors):
        data = X_test[y_test == n]
        plt.scatter(data[:, 0], data[:, 1], marker="x", color=color)

    y_train_pred = estimator.predict(X_train)
    train_accuracy = np.mean(y_train_pred.ravel() == y_train.ravel()) * 100
    plt.text(0.05, 0.9, "Train accuracy: %.1f" % train_accuracy, transform=h.transAxes)

    y_test_pred = estimator.predict(X_test)
    test_accuracy = np.mean(y_test_pred.ravel() == y_test.ravel()) * 100
    plt.text(0.05, 0.8, "Test accuracy: %.1f" % test_accuracy, transform=h.transAxes)

    plt.xticks(())
    plt.yticks(())
    plt.title(name)

plt.legend(scatterpoints=1, loc="lower right", prop=dict(size=12))
plt.show()

요약

이 튜토리얼에서는 파이썬에서 가우시안 혼합 모델 (GMM) 에 대한 서로 다른 공분산 유형의 사용법을 보여주었습니다. 아이리스 데이터셋을 예시로 사용하여 공분산 행렬이 구형, 대각, 전체, 그리고 결합된 순서로 성능이 향상되는 GMM 을 비교했습니다. 훈련 데이터와 홀드아웃 테스트 데이터 모두에 예측된 레이블을 플롯하고, 전체 공분산은 작은 데이터셋에서 과적합되기 쉽고 홀드아웃 테스트 데이터에 잘 일반화되지 않는다는 것을 보여주었습니다.