특징 응집 | 머신러닝 튜토리얼

소개

이 튜토리얼에서는 데이터셋에서 유사한 특징을 병합하는 특징 응집 (feature agglomeration) 을 사용하는 방법을 보여줍니다. 특징 응집은 고차원 데이터셋을 다룰 때 가장 중요한 정보를 유지하면서 특징의 수를 줄이는 데 유용합니다.

VM 팁

VM 시작이 완료되면 왼쪽 상단 모서리를 클릭하여 연습을 위해 Jupyter Notebook에 접근할 수 있는 Notebook 탭으로 전환합니다.

때때로 Jupyter Notebook 이 완전히 로드되기까지 몇 초 정도 기다려야 할 수 있습니다. Jupyter Notebook 의 제한으로 인해 작업의 유효성 검사를 자동화할 수 없습니다.

학습 중 문제가 발생하면 Labby 에 문의하십시오. 세션 후 피드백을 제공하면 문제를 신속하게 해결해 드리겠습니다.

라이브러리 가져오기

이 단계에서는 특징 응집을 수행하기 위한 필요한 라이브러리를 가져옵니다.

import numpy as np
import matplotlib.pyplot as plt

from sklearn import datasets, cluster
from sklearn.feature_extraction.image import grid_to_graph

데이터셋 로드

이 단계에서는 scikit-learn 에서 숫자 데이터셋을 로드합니다. 이 데이터셋에는 0 부터 9 까지의 손글씨 숫자 이미지가 포함되어 있습니다.

digits = datasets.load_digits()
images = digits.images
X = np.reshape(images, (len(images), -1))

연결 행렬 정의

이 단계에서는 scikit-learn 의 grid_to_graph 함수를 사용하여 연결 행렬을 정의합니다. 이 함수는 이미지의 픽셀 그리드를 기반으로 연결 그래프를 생성합니다.

connectivity = grid_to_graph(*images[0].shape)

특징 응집 수행

이 단계에서는 scikit-learn 의 FeatureAgglomeration 클래스를 사용하여 특징 응집을 수행합니다. 클러스터 수는 32 개로 설정합니다.

agglo = cluster.FeatureAgglomeration(connectivity=connectivity, n_clusters=32)
agglo.fit(X)
X_reduced = agglo.transform(X)

역변환

이 단계에서는 감소된 데이터 세트에 역변환을 수행하여 원래 특징 수를 복원합니다.

X_restored = agglo.inverse_transform(X_reduced)
images_restored = np.reshape(X_restored, images.shape)

결과 시각화

이 단계에서는 원본 이미지, 응집된 이미지 및 각 클러스터에 할당된 레이블을 시각화합니다.

plt.figure(1, figsize=(4, 3.5))
plt.clf()
plt.subplots_adjust(left=0.01, right=0.99, bottom=0.01, top=0.91)
for i in range(4):
    plt.subplot(3, 4, i + 1)
    plt.imshow(images[i], cmap=plt.cm.gray, vmax=16, interpolation="nearest")
    plt.xticks(())
    plt.yticks(())
    if i == 1:
        plt.title("Original data")
    plt.subplot(3, 4, 4 + i + 1)
    plt.imshow(images_restored[i], cmap=plt.cm.gray, vmax=16, interpolation="nearest")
    if i == 1:
        plt.title("Agglomerated data")
    plt.xticks(())
    plt.yticks(())

plt.subplot(3, 4, 10)
plt.imshow(
    np.reshape(agglo.labels_, images[0].shape),
    interpolation="nearest",
    cmap=plt.cm.nipy_spectral,
)
plt.xticks(())
plt.yticks(())
plt.title("Labels")
plt.show()

요약

이 튜토리얼에서는 데이터 세트에서 유사한 특징을 병합하는 특징 응집 (feature agglomeration) 을 사용하는 방법을 배웠습니다. 특징의 수를 줄여서 데이터 세트에서 가장 중요한 정보를 유지하면서 기계 학습 알고리즘의 성능을 향상시킬 수 있습니다.

고차원 데이터의 특징 응집

소개