オリベッティ顔データセットにおける非監督的行列分解

はじめに

この実験では、モジュール sklearn.decomposition からのさまざまな非監督学的行列分解（次元削減）手法を、オリベッティ顔データセットに適用します。オリベッティ顔データセットは、40 人の人物の 64x64 ピクセルのサイズの 400 枚の顔から構成されており、それぞれが異なる表情と照明条件で撮影されています。

VM のヒント

VM の起動が完了したら、左上隅をクリックして ノートブック タブに切り替え、Jupyter Notebook を使って練習しましょう。

時々、Jupyter Notebook が読み込み終了するまで数秒待つ必要がある場合があります。Jupyter Notebook の制限により、操作の検証は自動化できません。

学習中に問題に遭遇した場合は、Labby にお問い合わせください。セッション後にフィードバックを提供してください。すぐに問題を解決いたします。

データセットの準備

まず、オリベッティ顔データセットを読み込み、前処理を行います。データをゼロ平均にするために、グローバル（1 つの特徴に焦点を当ててすべてのサンプルを中心化）とローカル（1 つのサンプルに焦点を当ててすべての特徴を中心化）の両方で中心化します。また、顔のギャラリーを描画するための基本関数も定義します。

## オリベッティ顔データセットを読み込み、前処理を行う。

import logging

from numpy.random import RandomState
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_olivetti_faces
from sklearn import cluster
from sklearn import decomposition

rng = RandomState(0)

## 標準出力に進捗ログを表示する
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")

faces, _ = fetch_olivetti_faces(return_X_y=True, shuffle=True, random_state=rng)
n_samples, n_features = faces.shape

## グローバル中心化（1 つの特徴に焦点を当ててすべてのサンプルを中心化）
faces_centered = faces - faces.mean(axis=0)

## ローカル中心化（1 つのサンプルに焦点を当ててすべての特徴を中心化）
faces_centered -= faces_centered.mean(axis=1).reshape(n_samples, -1)

print("Dataset consists of %d faces" % n_samples)

## 顔のギャラリーを描画するための基本関数を定義する。

n_row, n_col = 2, 3
n_components = n_row * n_col
image_shape = (64, 64)


def plot_gallery(title, images, n_col=n_col, n_row=n_row, cmap=plt.cm.gray):
    fig, axs = plt.subplots(
        nrows=n_row,
        ncols=n_col,
        figsize=(2.0 * n_col, 2.3 * n_row),
        facecolor="white",
        constrained_layout=True,
    )
    fig.set_constrained_layout_pads(w_pad=0.01, h_pad=0.02, hspace=0, wspace=0)
    fig.set_edgecolor("black")
    fig.suptitle(title, size=16)
    for ax, vec in zip(axs.flat, images):
        vmax = max(vec.max(), -vec.min())
        im = ax.imshow(
            vec.reshape(image_shape),
            cmap=cmap,
            interpolation="nearest",
            vmin=-vmax,
            vmax=vmax,
        )
        ax.axis("off")

    fig.colorbar(im, ax=axs, orientation="horizontal", shrink=0.99, aspect=40, pad=0.01)
    plt.show()


## データを見てみましょう。灰色が負の値を、
## 白色が正の値を示します。

plot_gallery("Faces from dataset", faces_centered[:n_components])

固有顔 - ランダム化 SVD を用いた主成分分析

適用する最初の手法は主成分分析（PCA）で、これは線形次元削減手法で、データの特異値分解（SVD）を使ってデータを低次元空間に射影します。標準 SVD アルゴリズムの高速近似であるランダム化 SVD を使用します。最初の 6 つの主成分（固有顔と呼ばれる）を描画します。

## 固有顔 - ランダム化 SVD を用いた主成分分析
pca_estimator = decomposition.PCA(
    n_components=n_components, svd_solver="randomized", whiten=True
)
pca_estimator.fit(faces_centered)
plot_gallery(
    "Eigenfaces - PCA using randomized SVD", pca_estimator.components_[:n_components]
)

非負成分 - NMF

次に、非負行列分解（NMF）を適用します。これは、データ行列を 2 つの非負行列に分解します。1 つは基底ベクトルを含み、もう 1 つは係数を含みます。これにより、データの部分ベースの表現が得られます。

## 非負成分 - NMF
nmf_estimator = decomposition.NMF(n_components=n_components, tol=5e-3)
nmf_estimator.fit(faces)  ## 元の非負データセット
plot_gallery("Non-negative components - NMF", nmf_estimator.components_[:n_components])

独立成分 - FastICA

独立成分分析（ICA）は、多変量信号を最大限独立した加算サブコンポーネントに分離する方法です。私たちは、ICA に対する高速で頑健なアルゴリズムである FastICA を適用します。

## 独立成分 - FastICA
ica_estimator = decomposition.FastICA(
    n_components=n_components, max_iter=400, whiten="arbitrary-variance", tol=15e-5
)
ica_estimator.fit(faces_centered)
plot_gallery(
    "Independent components - FastICA", ica_estimator.components_[:n_components]
)

疎成分 - MiniBatchSparsePCA

疎主成分分析（Sparse PCA）は、主成分分析（PCA）のバリアントで、ロードベクトルに疎性を促すことで、より解釈可能な分解をもたらします。私たちは、大規模なデータセットにより適した、SparsePCA の高速バージョンである MiniBatchSparsePCA を使用します。

## 疎成分 - MiniBatchSparsePCA
batch_pca_estimator = decomposition.MiniBatchSparsePCA(
    n_components=n_components, alpha=0.1, max_iter=100, batch_size=3, random_state=rng
)
batch_pca_estimator.fit(faces_centered)
plot_gallery(
    "Sparse components - MiniBatchSparsePCA",
    batch_pca_estimator.components_[:n_components],
)

辞書学習

辞書学習は、単純な要素の組み合わせとして入力データの疎表現を見つける方法であり、それらの要素が辞書を形成します。私たちは、大規模なデータセットにより適した、DictionaryLearning の高速バージョンである MiniBatchDictionaryLearning を適用します。

## 辞書学習
batch_dict_estimator = decomposition.MiniBatchDictionaryLearning(
    n_components=n_components, alpha=0.1, max_iter=50, batch_size=3, random_state=rng
)
batch_dict_estimator.fit(faces_centered)
plot_gallery("Dictionary learning", batch_dict_estimator.components_[:n_components])

クラスタ中心 - MiniBatchKMeans

K-means クラスタリングは、各点とその割り当てられたクラスタの重心との間の二乗距離の和を最小化することにより、データセットをクラスタに分割する方法です。私たちは、大規模なデータセットにより適した、KMeans の高速バージョンである MiniBatchKMeans を適用します。

## クラスタ中心 - MiniBatchKMeans
kmeans_estimator = cluster.MiniBatchKMeans(
    n_clusters=n_components,
    tol=1e-3,
    batch_size=20,
    max_iter=50,
    random_state=rng,
    n_init="auto",
)
kmeans_estimator.fit(faces_centered)
plot_gallery(
    "Cluster centers - MiniBatchKMeans",
    kmeans_estimator.cluster_centers_[:n_components],
)

因子分析成分 - FA

因子分析は、入力空間の各方向の分散を独立にモデリングする方法（ヘテロスケダスティックノイズ）であり、PCA に似ていますが、この利点があります。私たちは、scikit-learn における因子分析の実装である FactorAnalysis を適用します。

## 因子分析成分 - FA
fa_estimator = decomposition.FactorAnalysis(n_components=n_components, max_iter=20)
fa_estimator.fit(faces_centered)
plot_gallery("Factor Analysis (FA)", fa_estimator.components_[:n_components])

## --- 画素ごとの分散
plt.figure(figsize=(3.2, 3.6), facecolor="white", tight_layout=True)
vec = fa_estimator.noise_variance_
vmax = max(vec.max(), -vec.min())
plt.imshow(
    vec.reshape(image_shape),
    cmap=plt.cm.gray,
    interpolation="nearest",
    vmin=-vmax,
    vmax=vmax,
)
plt.axis("off")
plt.title("画素ごとの分散 from \n 因子分析 (FA)", size=16, wrap=True)
plt.colorbar(orientation="horizontal", shrink=0.8, pad=0.03)
plt.show()

分解：辞書学習

ここでは、再び MiniBatchDictionaryLearning を適用しますが、今回は辞書や符号化係数を見つける際に正の値を強制します。

辞書学習 - 正の辞書

dict_pos_dict_estimator = decomposition.MiniBatchDictionaryLearning(
    n_components=n_components,
    alpha=0.1,
    max_iter=50,
    batch_size=3,
    random_state=rng,
    positive_dict=True,
)
dict_pos_dict_estimator.fit(faces_centered)
plot_gallery(
    "辞書学習 - 正の辞書",
    dict_pos_dict_estimator.components_[:n_components],
    cmap=plt.cm.RdBu,
)

辞書学習 - 正の符号化

dict_pos_code_estimator = decomposition.MiniBatchDictionaryLearning(
    n_components=n_components,
    alpha=0.1,
    max_iter=50,
    batch_size=3,
    fit_algorithm="cd",
    random_state=rng,
    positive_code=True,
)
dict_pos_code_estimator.fit(faces_centered)
plot_gallery(
    "辞書学習 - 正の符号化",
    dict_pos_code_estimator.components_[:n_components],
    cmap=plt.cm.RdBu,
)

辞書学習 - 正の辞書と符号化

dict_pos_estimator = decomposition.MiniBatchDictionaryLearning(
    n_components=n_components,
    alpha=0.1,
    max_iter=50,
    batch_size=3,
    fit_algorithm="cd",
    random_state=rng,
    positive_dict=True,
    positive_code=True,
)
dict_pos_estimator.fit(faces_centered)
plot_gallery(
    "辞書学習 - 正の辞書と符号化",
    dict_pos_estimator.components_[:n_components],
    cmap=plt.cm.RdBu,
)

まとめ

この実験では、オリベッティ顔データセットに対して様々な非監督的行列分解手法を適用しました。PCA、NMF、ICA、スパース PCA、辞書学習、K-平均クラスタリング、因子分析を用いて、データから異なる種類の特徴を抽出しました。また、辞書学習手法において辞書や符号化係数を見つける際に正の値を強制しました。全体的に、これらの手法は、高次元データセットの次元削減や、分類やクラスタリングなどの下流のタスクにとって意味のある特徴を抽出するのに役立つ可能性があります。

顔データセットの分解