다양체 학습 | 차원 축소 | 머신러닝

소개

다양체 학습은 고차원 데이터의 기저 구조를 학습하는 머신 러닝의 하위 분야입니다. 이 실습에서는 구형 데이터셋에 다양한 다양체 학습 기법을 적용할 것입니다. 차원 축소를 통해 다양체 학습 방법에 대한 직관을 얻을 것입니다.

VM 팁

VM 시작이 완료되면 왼쪽 상단 모서리를 클릭하여 Notebook 탭으로 전환하여 연습을 위한 Jupyter Notebook에 접근합니다.

때때로 Jupyter Notebook 이 완전히 로드되기까지 몇 초 정도 기다려야 할 수 있습니다. Jupyter Notebook 의 제한으로 인해 작업 검증은 자동화될 수 없습니다.

학습 중 문제가 발생하면 Labby 에 문의하십시오. 세션 후 피드백을 제공하면 문제를 신속하게 해결해 드리겠습니다.

라이브러리 가져오기

필요한 라이브러리를 가져오는 것으로 시작합니다. scikit-learn, matplotlib, numpy 를 사용할 것입니다.

from time import time
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
from sklearn import manifold
from sklearn.utils import check_random_state

import mpl_toolkits.mplot3d

구형 데이터셋 생성

다음으로 구형 데이터셋을 생성합니다. 구를 만들고 극을 잘라내고 측면에 얇은 조각을 만듭니다. 이렇게 하면 다양체 학습 기법이 2 차원으로 투영하면서 '펼쳐낼' 수 있도록 합니다.

n_neighbors = 10
n_samples = 1000

random_state = check_random_state(0)
p = random_state.rand(n_samples) * (2 * np.pi - 0.55)
t = random_state.rand(n_samples) * np.pi

indices = (t < (np.pi - (np.pi / 8))) & (t > ((np.pi / 8)))
colors = p[indices]
x, y, z = (
    np.sin(t[indices]) * np.cos(p[indices]),
    np.sin(t[indices]) * np.sin(p[indices]),
    np.cos(t[indices]),
)

fig = plt.figure(figsize=(15, 8))
plt.suptitle(
    "Manifold Learning with %i points, %i neighbors" % (1000, n_neighbors), fontsize=14
)

ax = fig.add_subplot(251, projection="3d")
ax.scatter(x, y, z, c=p[indices], cmap=plt.cm.rainbow)
ax.view_init(40, -10)

sphere_data = np.array([x, y, z]).T

로컬 선형 임베딩 (LLE) 다양체 학습 수행

이제 로컬 선형 임베딩 (LLE) 다양체 학습을 수행합니다. LLE 는 소수의 샘플로 복잡한 다양체를 펼칠 수 있는 강력한 기법입니다. LLE 의 네 가지 변형을 사용하여 결과를 비교할 것입니다.

methods = ["standard", "ltsa", "hessian", "modified"]
labels = ["LLE", "LTSA", "Hessian LLE", "Modified LLE"]

for i, method in enumerate(methods):
    t0 = time()
    trans_data = (
        manifold.LocallyLinearEmbedding(
            n_neighbors=n_neighbors, n_components=2, method=method
        )
        .fit_transform(sphere_data)
        .T
    )
    t1 = time()
    print("%s: %.2g sec" % (methods[i], t1 - t0))

    ax = fig.add_subplot(252 + i)
    plt.scatter(trans_data[0], trans_data[1], c=colors, cmap=plt.cm.rainbow)
    plt.title("%s (%.2g sec)" % (labels[i], t1 - t0))
    ax.xaxis.set_major_formatter(NullFormatter())
    ax.yaxis.set_major_formatter(NullFormatter())
    plt.axis("tight")

Isomap 다양체 학습 수행

다음으로 Isomap 다양체 학습을 수행합니다. Isomap 은 모든 쌍의 점 사이의 지오데식 거리를 유지하는 데이터의 저차원 임베딩을 찾는 비선형 차원 축소 기법입니다.

t0 = time()
trans_data = (
    manifold.Isomap(n_neighbors=n_neighbors, n_components=2)
    .fit_transform(sphere_data)
    .T
)
t1 = time()
print("%s: %.2g sec" % ("ISO", t1 - t0))

ax = fig.add_subplot(257)
plt.scatter(trans_data[0], trans_data[1], c=colors, cmap=plt.cm.rainbow)
plt.title("%s (%.2g sec)" % ("Isomap", t1 - t0))
ax.xaxis.set_major_formatter(NullFormatter())
ax.yaxis.set_major_formatter(NullFormatter())
plt.axis("tight")

다차원 스케일링 (MDS) 수행

이제 다차원 스케일링 (MDS) 다양체 학습을 수행합니다. MDS 는 점들 사이의 거리가 원래 고차원 공간의 거리를 반영하는 데이터의 저차원 표현을 찾는 기법입니다.

t0 = time()
mds = manifold.MDS(2, max_iter=100, n_init=1, normalized_stress="auto")
trans_data = mds.fit_transform(sphere_data).T
t1 = time()
print("MDS: %.2g sec" % (t1 - t0))

ax = fig.add_subplot(258)
plt.scatter(trans_data[0], trans_data[1], c=colors, cmap=plt.cm.rainbow)
plt.title("MDS (%.2g sec)" % (t1 - t0))
ax.xaxis.set_major_formatter(NullFormatter())
ax.yaxis.set_major_formatter(NullFormatter())
plt.axis("tight")

스펙트럼 임베딩 수행

다음으로 스펙트럼 임베딩 다양체 학습을 수행합니다. 스펙트럼 임베딩은 점들 사이의 쌍대 거리를 보존하는 데이터의 저차원 표현을 찾는 기법입니다.

t0 = time()
se = manifold.SpectralEmbedding(n_components=2, n_neighbors=n_neighbors)
trans_data = se.fit_transform(sphere_data).T
t1 = time()
print("Spectral Embedding: %.2g sec" % (t1 - t0))

ax = fig.add_subplot(259)
plt.scatter(trans_data[0], trans_data[1], c=colors, cmap=plt.cm.rainbow)
plt.title("Spectral Embedding (%.2g sec)" % (t1 - t0))
ax.xaxis.set_major_formatter(NullFormatter())
ax.yaxis.set_major_formatter(NullFormatter())
plt.axis("tight")

t-분포 확률적 이웃 임베딩 (t-SNE) 수행

마지막으로 t-분포 확률적 이웃 임베딩 (t-SNE) 다양체 학습을 수행합니다. t-SNE 는 점들 사이의 국소 거리를 보존하는 데이터의 저차원 표현을 찾는 기법입니다.

t0 = time()
tsne = manifold.TSNE(n_components=2, random_state=0)
trans_data = tsne.fit_transform(sphere_data).T
t1 = time()
print("t-SNE: %.2g sec" % (t1 - t0))

ax = fig.add_subplot(2, 5, 10)
plt.scatter(trans_data[0], trans_data[1], c=colors, cmap=plt.cm.rainbow)
plt.title("t-SNE (%.2g sec)" % (t1 - t0))
ax.xaxis.set_major_formatter(NullFormatter())
ax.yaxis.set_major_formatter(NullFormatter())
plt.axis("tight")

plt.show()

요약

이 실험에서는 구형 데이터셋에 다양한 다양체 학습 기법을 적용했습니다. 국소 선형 임베딩 (LLE), 아이소맵, 다차원 스케일링 (MDS), 스펙트럼 임베딩, 그리고 t-분포 확률적 이웃 임베딩 (t-SNE) 을 사용하여 다양체 학습 방법에 대한 직관적인 이해를 얻었습니다. 이러한 기법들은 고차원 데이터를 분석하고 시각화하는 데 유용합니다.

구형 데이터에 대한 다양체 학습

소개