Olivetti 人脸数据集上的无监督矩阵分解

简介

本实验将模块 sklearn.decomposition 中的不同无监督矩阵分解（降维）方法应用于 Olivetti 人脸数据集。Olivetti 人脸数据集包含来自 40 个人的 400 张 64x64 像素的人脸图像，每个人的图像都在不同的面部表情和光照条件下拍摄。

虚拟机使用提示

虚拟机启动完成后，点击左上角切换到 笔记本 标签页，以访问 Jupyter Notebook 进行练习。

有时，你可能需要等待几秒钟让 Jupyter Notebook 完成加载。由于 Jupyter Notebook 的限制，操作验证无法自动化。

如果你在学习过程中遇到问题，随时向 Labby 提问。课程结束后提供反馈，我们会及时为你解决问题。

Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL ml(("Machine Learning")) -.-> ml/FrameworkandSoftwareGroup(["Framework and Software"]) sklearn(("Sklearn")) -.-> sklearn/UtilitiesandDatasetsGroup(["Utilities and Datasets"]) sklearn/UtilitiesandDatasetsGroup -.-> sklearn/datasets("Datasets") ml/FrameworkandSoftwareGroup -.-> ml/sklearn("scikit-learn") subgraph Lab Skills sklearn/datasets -.-> lab-49124{{"人脸数据集分解"}} ml/sklearn -.-> lab-49124{{"人脸数据集分解"}} end

数据集准备

首先，我们加载并预处理 Olivetti 人脸数据集。我们对数据进行中心化处理，使均值为零，包括全局中心化（关注一个特征，将所有样本中心化）和局部中心化（关注一个样本，将所有特征中心化）。我们还定义了一个基础函数来绘制人脸图像集。

## 加载并预处理 Olivetti 人脸数据集。

import logging

from numpy.random import RandomState
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_olivetti_faces
from sklearn import cluster
from sklearn import decomposition

rng = RandomState(0)

## 在标准输出上显示进度日志
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")

faces, _ = fetch_olivetti_faces(return_X_y=True, shuffle=True, random_state=rng)
n_samples, n_features = faces.shape

## 全局中心化（关注一个特征，将所有样本中心化）
faces_centered = faces - faces.mean(axis=0)

## 局部中心化（关注一个样本，将所有特征中心化）
faces_centered -= faces_centered.mean(axis=1).reshape(n_samples, -1)

print("数据集包含 %d 张人脸" % n_samples)

## 定义一个基础函数来绘制人脸图像集。

n_row, n_col = 2, 3
n_components = n_row * n_col
image_shape = (64, 64)


def plot_gallery(title, images, n_col=n_col, n_row=n_row, cmap=plt.cm.gray):
    fig, axs = plt.subplots(
        nrows=n_row,
        ncols=n_col,
        figsize=(2.0 * n_col, 2.3 * n_row),
        facecolor="white",
        constrained_layout=True,
    )
    fig.set_constrained_layout_pads(w_pad=0.01, h_pad=0.02, hspace=0, wspace=0)
    fig.set_edgecolor("black")
    fig.suptitle(title, size=16)
    for ax, vec in zip(axs.flat, images):
        vmax = max(vec.max(), -vec.min())
        im = ax.imshow(
            vec.reshape(image_shape),
            cmap=cmap,
            interpolation="nearest",
            vmin=-vmax,
            vmax=vmax,
        )
        ax.axis("off")

    fig.colorbar(im, ax=axs, orientation="horizontal", shrink=0.99, aspect=40, pad=0.01)
    plt.show()


## 让我们看看我们的数据。灰色表示负值，
## 白色表示正值。

plot_gallery("数据集中的人脸", faces_centered[:n_components])

特征脸 - 使用随机奇异值分解的主成分分析

我们应用的第一种方法是主成分分析（PCA），这是一种线性降维技术，它使用数据的奇异值分解（SVD）将数据投影到低维空间。我们使用随机奇异值分解，它是标准奇异值分解算法的一种更快的近似方法。我们绘制前六个主成分，它们被称为特征脸。

## 特征脸 - 使用随机奇异值分解的主成分分析
pca_estimator = decomposition.PCA(
    n_components=n_components, svd_solver="randomized", whiten=True
)
pca_estimator.fit(faces_centered)
plot_gallery(
    "特征脸 - 使用随机奇异值分解的主成分分析", pca_estimator.components_[:n_components]
)

非负分量 - 非负矩阵分解（NMF）

接下来，我们应用非负矩阵分解（NMF），它将数据矩阵分解为两个非负矩阵，一个包含基向量，另一个包含系数。这会得到数据的基于部分的表示。

## 非负分量 - 非负矩阵分解（NMF）
nmf_estimator = decomposition.NMF(n_components=n_components, tol=5e-3)
nmf_estimator.fit(faces)  ## 原始非负数据集
plot_gallery("非负分量 - 非负矩阵分解（NMF）", nmf_estimator.components_[:n_components])

独立分量 - 快速独立成分分析（FastICA）

独立成分分析（ICA）是一种将多变量信号分离为相互独立的加性子成分的方法。我们应用快速独立成分分析（FastICA），它是一种用于ICA的快速且稳健的算法。

## 独立分量 - 快速独立成分分析（FastICA）
ica_estimator = decomposition.FastICA(
    n_components=n_components, max_iter=400, whiten="arbitrary-variance", tol=15e-5
)
ica_estimator.fit(faces_centered)
plot_gallery(
    "独立分量 - 快速独立成分分析（FastICA）", ica_estimator.components_[:n_components]
)

稀疏分量 - 小批量稀疏主成分分析（MiniBatchSparsePCA）

稀疏主成分分析（Sparse PCA）是主成分分析（PCA）的一种变体，它促使载荷向量具有稀疏性，从而得到更易于解释的分解结果。我们使用小批量稀疏主成分分析（MiniBatchSparsePCA），它是稀疏主成分分析（SparsePCA）的一个更快版本，更适合处理大型数据集。

## 稀疏分量 - 小批量稀疏主成分分析（MiniBatchSparsePCA）
batch_pca_estimator = decomposition.MiniBatchSparsePCA(
    n_components=n_components, alpha=0.1, max_iter=100, batch_size=3, random_state=rng
)
batch_pca_estimator.fit(faces_centered)
plot_gallery(
    "稀疏分量 - 小批量稀疏主成分分析（MiniBatchSparsePCA）",
    batch_pca_estimator.components_[:n_components],
)

字典学习

字典学习是一种将输入数据表示为简单元素组合的稀疏表示方法，这些简单元素构成一个字典。我们应用小批量字典学习（MiniBatchDictionaryLearning），它是字典学习（DictionaryLearning）的一个更快版本，更适合处理大型数据集。

## 字典学习
batch_dict_estimator = decomposition.MiniBatchDictionaryLearning(
    n_components=n_components, alpha=0.1, max_iter=50, batch_size=3, random_state=rng
)
batch_dict_estimator.fit(faces_centered)
plot_gallery("字典学习", batch_dict_estimator.components_[:n_components])

聚类中心 - 小批量K均值算法（MiniBatchKMeans）

K均值聚类是一种通过最小化每个点与其所属聚类的质心之间的平方距离之和，将数据集划分为多个聚类的方法。我们应用小批量K均值算法（MiniBatchKMeans），它是K均值算法（KMeans）的一个更快版本，更适合处理大型数据集。

## 聚类中心 - 小批量K均值算法（MiniBatchKMeans）
kmeans_estimator = cluster.MiniBatchKMeans(
    n_clusters=n_components,
    tol=1e-3,
    batch_size=20,
    max_iter=50,
    random_state=rng,
    n_init="auto",
)
kmeans_estimator.fit(faces_centered)
plot_gallery(
    "聚类中心 - 小批量K均值算法（MiniBatchKMeans）",
    kmeans_estimator.cluster_centers_[:n_components],
)

因子分析组件 - FA

因子分析是一种用于独立建模输入空间各个方向上的方差（异方差噪声）的方法，与主成分分析（PCA）类似，但具有此优势。我们应用因子分析（FactorAnalysis），它是scikit-learn中因子分析的一种实现。

## 因子分析组件 - FA
fa_estimator = decomposition.FactorAnalysis(n_components=n_components, max_iter=20)
fa_estimator.fit(faces_centered)
plot_gallery("因子分析 (FA)", fa_estimator.components_[:n_components])

## --- 逐像素方差
plt.figure(figsize=(3.2, 3.6), facecolor="white", tight_layout=True)
vec = fa_estimator.noise_variance_
vmax = max(vec.max(), -vec.min())
plt.imshow(
    vec.reshape(image_shape),
    cmap=plt.cm.gray,
    interpolation="nearest",
    vmin=-vmax,
    vmax=vmax,
)
plt.axis("off")
plt.title("因子分析 (FA) 的逐像素方差", size=16, wrap=True)
plt.colorbar(orientation="horizontal", shrink=0.8, pad=0.03)
plt.show()

分解：字典学习

我们再次应用小批量字典学习（MiniBatchDictionaryLearning），但这次在寻找字典和/或编码系数时强制其为正值。

字典学习 - 正字典

dict_pos_dict_estimator = decomposition.MiniBatchDictionaryLearning(
    n_components=n_components,
    alpha=0.1,
    max_iter=50,
    batch_size=3,
    random_state=rng,
    positive_dict=True,
)
dict_pos_dict_estimator.fit(faces_centered)
plot_gallery(
    "字典学习 - 正字典",
    dict_pos_dict_estimator.components_[:n_components],
    cmap=plt.cm.RdBu,
)

字典学习 - 正编码

dict_pos_code_estimator = decomposition.MiniBatchDictionaryLearning(
    n_components=n_components,
    alpha=0.1,
    max_iter=50,
    batch_size=3,
    fit_algorithm="cd",
    random_state=rng,
    positive_code=True,
)
dict_pos_code_estimator.fit(faces_centered)
plot_gallery(
    "字典学习 - 正编码",
    dict_pos_code_estimator.components_[:n_components],
    cmap=plt.cm.RdBu,
)

字典学习 - 正字典与正编码

dict_pos_estimator = decomposition.MiniBatchDictionaryLearning(
    n_components=n_components,
    alpha=0.1,
    max_iter=50,
    batch_size=3,
    fit_algorithm="cd",
    random_state=rng,
    positive_dict=True,
    positive_code=True,
)
dict_pos_estimator.fit(faces_centered)
plot_gallery(
    "字典学习 - 正字典与正编码",
    dict_pos_estimator.components_[:n_components],
    cmap=plt.cm.RdBu,
)

总结

在本实验中，我们将各种无监督矩阵分解方法应用于 Olivetti 人脸数据集。我们使用主成分分析（PCA）、非负矩阵分解（NMF）、独立成分分析（ICA）、稀疏主成分分析（Sparse PCA）、字典学习、K 均值聚类和因子分析从数据中提取不同类型的特征。我们还在字典学习方法中寻找字典和/或编码系数时强制其为正值。总体而言，这些方法对于降低高维数据集的维度以及为分类和聚类等下游任务提取有意义的特征可能是有用的。