Olivetti 얼굴 데이터셋에 대한 비지도 행렬 분해

소개

이 실험에서는 sklearn.decomposition 모듈의 다양한 비지도 행렬 분해 (차원 축소) 방법을 Olivetti 얼굴 데이터 세트에 적용합니다. Olivetti 얼굴 데이터 세트는 40 명의 개인으로부터 64x64 픽셀 크기의 400 개 얼굴 이미지로 구성되어 있으며, 각 이미지는 다른 얼굴 표정과 조명 조건에서 촬영되었습니다.

VM 팁

VM 시작이 완료되면 왼쪽 상단 모서리를 클릭하여 Notebook 탭으로 전환하여 연습을 위한 Jupyter Notebook에 접근합니다.

때때로 Jupyter Notebook 이 로드되는 데 몇 초 정도 기다려야 할 수 있습니다. Jupyter Notebook 의 제한으로 인해 작업 검증은 자동화될 수 없습니다.

학습 중 문제가 발생하면 Labby 에게 문의하십시오. 세션 후 피드백을 제공하면 문제를 신속하게 해결해 드리겠습니다.

데이터 준비

먼저 Olivetti 얼굴 데이터 세트를 로드하고 전처리합니다. 데이터를 전역적으로 (하나의 특징에 집중하여 모든 샘플을 중심화) 및 지역적으로 (하나의 샘플에 집중하여 모든 특징을 중심화) 0 평균을 갖도록 중심화합니다. 또한 얼굴 갤러리를 플롯하는 기본 함수를 정의합니다.

## Olivetti 얼굴 데이터 세트를 로드하고 전처리합니다.

import logging

from numpy.random import RandomState
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_olivetti_faces
from sklearn import cluster
from sklearn import decomposition

rng = RandomState(0)

## 표준 출력에 진행 로그 표시
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")

faces, _ = fetch_olivetti_faces(return_X_y=True, shuffle=True, random_state=rng)
n_samples, n_features = faces.shape

## 전역 중심화 (하나의 특징에 집중하여 모든 샘플을 중심화)
faces_centered = faces - faces.mean(axis=0)

## 지역 중심화 (하나의 샘플에 집중하여 모든 특징을 중심화)
faces_centered -= faces_centered.mean(axis=1).reshape(n_samples, -1)

print("데이터 세트는 %d개의 얼굴로 구성됩니다" % n_samples)

## 얼굴 갤러리를 플롯하는 기본 함수를 정의합니다.

n_row, n_col = 2, 3
n_components = n_row * n_col
image_shape = (64, 64)


def plot_gallery(title, images, n_col=n_col, n_row=n_row, cmap=plt.cm.gray):
    fig, axs = plt.subplots(
        nrows=n_row,
        ncols=n_col,
        figsize=(2.0 * n_col, 2.3 * n_row),
        facecolor="white",
        constrained_layout=True,
    )
    fig.set_constrained_layout_pads(w_pad=0.01, h_pad=0.02, hspace=0, wspace=0)
    fig.set_edgecolor("black")
    fig.suptitle(title, size=16)
    for ax, vec in zip(axs.flat, images):
        vmax = max(vec.max(), -vec.min())
        im = ax.imshow(
            vec.reshape(image_shape),
            cmap=cmap,
            interpolation="nearest",
            vmin=-vmax,
            vmax=vmax,
        )
        ax.axis("off")

    fig.colorbar(im, ax=axs, orientation="horizontal", shrink=0.99, aspect=40, pad=0.01)
    plt.show()


## 데이터를 살펴봅니다. 회색은 음수 값, 흰색은 양수 값을 나타냅니다.

plot_gallery("데이터 세트의 얼굴", faces_centered[:n_components])

고유 얼굴 - 무작위 SVD 를 사용한 PCA

적용하는 첫 번째 방법은 PCA(주성분 분석) 입니다. PCA 는 데이터의 특이값 분해 (SVD) 를 사용하여 데이터를 저차원 공간으로 투영하는 선형 차원 축소 기법입니다. 표준 SVD 알고리즘의 빠른 근사값인 무작위 SVD 를 사용합니다. 처음 여섯 개의 주성분을 플롯합니다. 이 주성분들을 고유 얼굴이라고 합니다.

## 고유 얼굴 - 무작위 SVD 를 사용한 PCA
pca_estimator = decomposition.PCA(
    n_components=n_components, svd_solver="randomized", whiten=True
)
pca_estimator.fit(faces_centered)
plot_gallery(
    "고유 얼굴 - 무작위 SVD 를 사용한 PCA", pca_estimator.components_[:n_components]
)

음수가 아닌 성분 - NMF

다음으로, 음수가 아닌 행렬 분해 (NMF) 를 적용합니다. NMF 는 데이터 행렬을 두 개의 음수가 아닌 행렬로 분해합니다. 하나는 기저 벡터를 포함하고 다른 하나는 계수를 포함합니다. 이는 데이터의 부분 기반 표현으로 이어집니다.

## 음수가 아닌 성분 - NMF
nmf_estimator = decomposition.NMF(n_components=n_components, tol=5e-3)
nmf_estimator.fit(faces)  ## 원본 음수가 아닌 데이터 세트
plot_gallery("음수가 아닌 성분 - NMF", nmf_estimator.components_[:n_components])

독립 성분 - FastICA

독립 성분 분석 (ICA) 은 다변량 신호를 최대 독립적인 가산 하위 성분으로 분리하는 방법입니다. ICA 에 대한 빠르고 강력한 알고리즘인 FastICA 를 적용합니다.

## 독립 성분 - FastICA
ica_estimator = decomposition.FastICA(
    n_components=n_components, max_iter=400, whiten="arbitrary-variance", tol=15e-5
)
ica_estimator.fit(faces_centered)
plot_gallery(
    "독립 성분 - FastICA", ica_estimator.components_[:n_components]
)

희소 성분 - MiniBatchSparsePCA

희소 PCA 는 로딩 벡터의 희소성을 장려하는 PCA 의 변형으로, 더 해석 가능한 분해를 생성합니다. 대용량 데이터 세트에 더 적합한 SparsePCA 의 빠른 버전인 MiniBatchSparsePCA 를 사용합니다.

## 희소 성분 - MiniBatchSparsePCA
batch_pca_estimator = decomposition.MiniBatchSparsePCA(
    n_components=n_components, alpha=0.1, max_iter=100, batch_size=3, random_state=rng
)
batch_pca_estimator.fit(faces_centered)
plot_gallery(
    "희소 성분 - MiniBatchSparsePCA",
    batch_pca_estimator.components_[:n_components],
)

사전 학습

사전 학습은 입력 데이터를 간단한 요소들의 조합으로 표현하는 희소 표현을 찾는 방법입니다. 이러한 간단한 요소들은 사전을 구성합니다. 대용량 데이터 세트에 더 적합한 DictionaryLearning 의 빠른 버전인 MiniBatchDictionaryLearning 을 적용합니다.

## 사전 학습
batch_dict_estimator = decomposition.MiniBatchDictionaryLearning(
    n_components=n_components, alpha=0.1, max_iter=50, batch_size=3, random_state=rng
)
batch_dict_estimator.fit(faces_centered)
plot_gallery("사전 학습", batch_dict_estimator.components_[:n_components])

클러스터 중심 - MiniBatchKMeans

K-평균 군집화는 각 점과 할당된 클러스터의 중심 사이의 제곱 거리의 합을 최소화하여 데이터 세트를 클러스터로 분할하는 방법입니다. 대용량 데이터 세트에 더 적합한 KMeans 의 빠른 버전인 MiniBatchKMeans 를 적용합니다.

## 클러스터 중심 - MiniBatchKMeans
kmeans_estimator = cluster.MiniBatchKMeans(
    n_clusters=n_components,
    tol=1e-3,
    batch_size=20,
    max_iter=50,
    random_state=rng,
    n_init="auto",
)
kmeans_estimator.fit(faces_centered)
plot_gallery(
    "클러스터 중심 - MiniBatchKMeans",
    kmeans_estimator.cluster_centers_[:n_components],
)

요인 분석 구성 요소 - FA

요인 분석은 입력 공간의 모든 방향의 분산을 독립적으로 (이종분산 노이즈) 모델링하는 방법으로, PCA 와 유사하지만 이러한 장점이 있습니다. scikit-learn 에서 요인 분석을 구현한 FactorAnalysis 를 적용합니다.

## 요인 분석 구성 요소 - FA
fa_estimator = decomposition.FactorAnalysis(n_components=n_components, max_iter=20)
fa_estimator.fit(faces_centered)
plot_gallery("요인 분석 (FA)", fa_estimator.components_[:n_components])

## --- 픽셀별 분산
plt.figure(figsize=(3.2, 3.6), facecolor="white", tight_layout=True)
vec = fa_estimator.noise_variance_
vmax = max(vec.max(), -vec.min())
plt.imshow(
    vec.reshape(image_shape),
    cmap=plt.cm.gray,
    interpolation="nearest",
    vmin=-vmax,
    vmax=vmax,
)
plt.axis("off")
plt.title("요인 분석 (FA) 에서의 픽셀별 분산", size=16, wrap=True)
plt.colorbar(orientation="horizontal", shrink=0.8, pad=0.03)
plt.show()

분해: 사전 학습

이번에는 MiniBatchDictionaryLearning 을 다시 적용하지만, 이번에는 사전 및/또는 코딩 계수를 찾을 때 양의 값을 강제합니다.

사전 학습 - 양의 사전

dict_pos_dict_estimator = decomposition.MiniBatchDictionaryLearning(
    n_components=n_components,
    alpha=0.1,
    max_iter=50,
    batch_size=3,
    random_state=rng,
    positive_dict=True,
)
dict_pos_dict_estimator.fit(faces_centered)
plot_gallery(
    "사전 학습 - 양의 사전",
    dict_pos_dict_estimator.components_[:n_components],
    cmap=plt.cm.RdBu,
)

사전 학습 - 양의 코드

dict_pos_code_estimator = decomposition.MiniBatchDictionaryLearning(
    n_components=n_components,
    alpha=0.1,
    max_iter=50,
    batch_size=3,
    fit_algorithm="cd",
    random_state=rng,
    positive_code=True,
)
dict_pos_code_estimator.fit(faces_centered)
plot_gallery(
    "사전 학습 - 양의 코드",
    dict_pos_code_estimator.components_[:n_components],
    cmap=plt.cm.RdBu,
)

사전 학습 - 양의 사전 및 코드

dict_pos_estimator = decomposition.MiniBatchDictionaryLearning(
    n_components=n_components,
    alpha=0.1,
    max_iter=50,
    batch_size=3,
    fit_algorithm="cd",
    random_state=rng,
    positive_dict=True,
    positive_code=True,
)
dict_pos_estimator.fit(faces_centered)
plot_gallery(
    "사전 학습 - 양의 사전 및 코드",
    dict_pos_estimator.components_[:n_components],
    cmap=plt.cm.RdBu,
)

요약

이 실험에서는 Olivetti 얼굴 데이터 세트에 다양한 비지도 행렬 분해 방법을 적용했습니다. PCA, NMF, ICA, Sparse PCA, 사전 학습, K-평균 군집화 및 요인 분석을 사용하여 데이터에서 서로 다른 유형의 특징을 추출했습니다. 또한 사전 학습 방법에서 사전 및/또는 코드 계수를 찾을 때 양의 제약 조건을 적용했습니다. 전반적으로 이러한 방법은 고차원 데이터의 차원을 줄이고 분류 및 군집화와 같은 다운스트림 작업을 위한 의미 있는 특징을 추출하는 데 유용할 수 있습니다.

얼굴 데이터셋 분해

소개