探索用于模型选择的主成分分析和因子分析

简介

在本实验中，我们将探索两种概率模型——概率主成分分析（Probabilistic PCA）和因子分析（Factor Analysis），并比较它们在模型选择和协方差估计方面的有效性。我们将对被同方差或异方差噪声破坏的低秩数据进行交叉验证。此外，我们将把模型似然与从收缩协方差估计器获得的似然进行比较。

虚拟机使用提示

虚拟机启动完成后，点击左上角切换到“笔记本”标签页，以访问 Jupyter Notebook 进行练习。

有时，你可能需要等待几秒钟让 Jupyter Notebook 完成加载。由于 Jupyter Notebook 的限制，操作验证无法自动化。

如果你在学习过程中遇到问题，随时向 Labby 提问。课程结束后提供反馈，我们会及时为你解决问题。

创建数据

我们将创建一个模拟数据集，它包含 500 个样本、25 个特征，秩为 5。我们还将向数据集中添加同方差和异方差噪声。

import numpy as np
from scipy import linalg

n_samples, n_features, rank = 500, 25, 5
sigma = 1.0
rng = np.random.RandomState(42)
U, _, _ = linalg.svd(rng.randn(n_features, n_features))
X = np.dot(rng.randn(n_samples, rank), U[:, :rank].T)

## 添加同方差噪声
X_homo = X + sigma * rng.randn(n_samples, n_features)

## 添加异方差噪声
sigmas = sigma * rng.rand(n_features) + sigma / 2.0
X_hetero = X + rng.randn(n_samples, n_features) * sigmas

拟合模型

我们将把概率主成分分析（Probabilistic PCA）和因子分析（Factor Analysis）模型拟合到数据集上，并使用交叉验证来评估它们的性能。我们还将计算收缩协方差估计器的分数，并比较结果。

import matplotlib.pyplot as plt
from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.covariance import ShrunkCovariance, LedoitWolf
from sklearn.model_selection import cross_val_score, GridSearchCV

n_components = np.arange(0, n_features, 5)  ## n_components 的选项

def compute_scores(X):
    pca = PCA(svd_solver="full")
    fa = FactorAnalysis()

    pca_scores, fa_scores = [], []
    for n in n_components:
        pca.n_components = n
        fa.n_components = n
        pca_scores.append(np.mean(cross_val_score(pca, X)))
        fa_scores.append(np.mean(cross_val_score(fa, X)))

    return pca_scores, fa_scores

def shrunk_cov_score(X):
    shrinkages = np.logspace(-2, 0, 30)
    cv = GridSearchCV(ShrunkCovariance(), {"shrinkage": shrinkages})
    return np.mean(cross_val_score(cv.fit(X).best_estimator_, X))

def lw_score(X):
    return np.mean(cross_val_score(LedoitWolf(), X))

for X, title in [(X_homo, "同方差噪声"), (X_hetero, "异方差噪声")]:
    pca_scores, fa_scores = compute_scores(X)
    n_components_pca = n_components[np.argmax(pca_scores)]
    n_components_fa = n_components[np.argmax(fa_scores)]

    pca = PCA(svd_solver="full", n_components="mle")
    pca.fit(X)
    n_components_pca_mle = pca.n_components_

    print("通过 PCA 交叉验证得到的最佳 n_components = %d" % n_components_pca)
    print("通过因子分析交叉验证得到的最佳 n_components = %d" % n_components_fa)
    print("通过 PCA 最大似然估计得到的最佳 n_components = %d" % n_components_pca_mle)

    plt.figure()
    plt.plot(n_components, pca_scores, "b", label="PCA 分数")
    plt.plot(n_components, fa_scores, "r", label="因子分析分数")
    plt.axvline(rank, color="g", label="真实值：%d" % rank, linestyle="-")
    plt.axvline(
        n_components_pca,
        color="b",
        label="PCA 交叉验证：%d" % n_components_pca,
        linestyle="--"
    )
    plt.axvline(
        n_components_fa,
        color="r",
        label="因子分析交叉验证：%d" % n_components_fa,
        linestyle="--"
    )
    plt.axvline(
        n_components_pca_mle,
        color="k",
        label="PCA 最大似然估计：%d" % n_components_pca_mle,
        linestyle="--"
    )

    ## 与其他协方差估计器比较
    plt.axhline(
        shrunk_cov_score(X),
        color="violet",
        label="收缩协方差最大似然估计",
        linestyle="-."
    )
    plt.axhline(
        lw_score(X),
        color="orange",
        label="LedoitWolf 最大似然估计 %d" % n_components_pca_mle,
        linestyle="-."
    )

    plt.xlabel("成分数量")
    plt.ylabel("交叉验证分数")
    plt.legend(loc="lower right")
    plt.title(title)

plt.show()

总结

在本实验中，我们探讨了概率主成分分析（Probabilistic PCA）和因子分析（Factor Analysis）模型在模型选择和协方差估计方面的有效性。我们创建了一个带有同方差和异方差噪声的模拟数据集，并使用交叉验证比较了模型的性能。我们还将模型似然与从收缩协方差估计器获得的似然进行了比较。结果表明，在存在同方差噪声的情况下，主成分分析（PCA）和因子分析（FA）在恢复低秩子空间的维度方面都是有效的。然而，当存在异方差噪声时，主成分分析（PCA）失败并高估了秩。在适当的情况下，对于低秩模型，留出的数据比收缩模型更有可能出现。

绘制主成分分析与因子分析模型选择对比图

简介

虚拟机使用提示

创建数据

拟合模型

总结