非负矩阵分解 | 潜在狄利克雷分配 | 主题建模

简介

在本实验中，我们将对文档语料库应用非负矩阵分解（Non-negative Matrix Factorization，NMF）和潜在狄利克雷分配（Latent Dirichlet Allocation，LDA），以提取语料库主题结构的加性模型。输出将是一个主题图，每个主题用基于权重的前几个词表示为柱状图。

虚拟机使用提示

虚拟机启动完成后，点击左上角切换到“笔记本”（Notebook）标签页，以访问Jupyter Notebook进行练习。

有时，你可能需要等待几秒钟让Jupyter Notebook完成加载。由于Jupyter Notebook的限制，操作验证无法自动化。

如果你在学习过程中遇到问题，随时向Labby提问。课程结束后提供反馈，我们会及时为你解决问题。

Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL sklearn(("Sklearn")) -.-> sklearn/DataPreprocessingandFeatureEngineeringGroup(["Data Preprocessing and Feature Engineering"]) sklearn(("Sklearn")) -.-> sklearn/AdvancedDataAnalysisandDimensionalityReductionGroup(["Advanced Data Analysis and Dimensionality Reduction"]) sklearn(("Sklearn")) -.-> sklearn/UtilitiesandDatasetsGroup(["Utilities and Datasets"]) ml(("Machine Learning")) -.-> ml/FrameworkandSoftwareGroup(["Framework and Software"]) sklearn/DataPreprocessingandFeatureEngineeringGroup -.-> sklearn/feature_extraction("Feature Extraction") sklearn/AdvancedDataAnalysisandDimensionalityReductionGroup -.-> sklearn/decomposition("Matrix Decomposition") sklearn/UtilitiesandDatasetsGroup -.-> sklearn/datasets("Datasets") ml/FrameworkandSoftwareGroup -.-> ml/sklearn("scikit-learn") subgraph Lab Skills sklearn/feature_extraction -.-> lab-49319{{"使用 NMF 和 LDA 绘制主题提取"}} sklearn/decomposition -.-> lab-49319{{"使用 NMF 和 LDA 绘制主题提取"}} sklearn/datasets -.-> lab-49319{{"使用 NMF 和 LDA 绘制主题提取"}} ml/sklearn -.-> lab-49319{{"使用 NMF 和 LDA 绘制主题提取"}} end

加载数据集

我们将加载20个新闻组数据集并对其进行向量化。我们使用一些启发式方法尽早过滤掉无用的术语：去除帖子的标题、页脚和引用的回复，并删除常见的英语单词、仅出现在一篇文档中或至少出现在95%的文档中的单词。

from sklearn.datasets import fetch_20newsgroups

n_samples = 2000
n_features = 1000

print("Loading dataset...")
data, _ = fetch_20newsgroups(
    shuffle=True,
    random_state=1,
    remove=("headers", "footers", "quotes"),
    return_X_y=True,
)
data_samples = data[:n_samples]

提取特征

我们将从数据集中提取特征，对非负矩阵分解（NMF）使用词频 - 逆文档频率（tf - idf）特征，对潜在狄利克雷分配（LDA）使用原始词频特征。

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

## 对NMF使用tf - idf特征。
print("Extracting tf-idf features for NMF...")
tfidf_vectorizer = TfidfVectorizer(
    max_df=0.95, min_df=2, max_features=n_features, stop_words="english"
)
tfidf = tfidf_vectorizer.fit_transform(data_samples)

## 对LDA使用原始词频特征。
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(
    max_df=0.95, min_df=2, max_features=n_features, stop_words="english"
)
tf = tf_vectorizer.fit_transform(data_samples)

应用非负矩阵分解（NMF）

我们将使用两种不同的目标函数应用非负矩阵分解（NMF）：弗罗贝尼乌斯范数（Frobenius norm）和广义库尔贝克 - 莱布勒散度（generalized Kullback-Leibler divergence）。后者等同于概率潜在语义索引（Probabilistic Latent Semantic Indexing）。

from sklearn.decomposition import NMF

n_components = 10
n_top_words = 20
init = "nndsvda"

## 拟合NMF模型
print(
    "使用tf - idf特征拟合NMF模型（弗罗贝尼乌斯范数），"
    "n_samples = %d 且 n_features = %d..." % (n_samples, n_features)
)
nmf = NMF(
    n_components=n_components,
    random_state=1,
    init=init,
    beta_loss="frobenius",
    alpha_W=0.00005,
    alpha_H=0.00005,
    l1_ratio=1,
).fit(tfidf)

## 绘制NMF模型的前几个关键词
def plot_top_words(model, feature_names, n_top_words, title):
    fig, axes = plt.subplots(2, 5, figsize=(30, 15), sharex=True)
    axes = axes.flatten()
    for topic_idx, topic in enumerate(model.components_):
        top_features_ind = topic.argsort()[: -n_top_words - 1 : -1]
        top_features = [feature_names[i] for i in top_features_ind]
        weights = topic[top_features_ind]

        ax = axes[topic_idx]
        ax.barh(top_features, weights, height=0.7)
        ax.set_title(f"主题 {topic_idx +1}", fontdict={"fontsize": 30})
        ax.invert_yaxis()
        ax.tick_params(axis="both", which="major", labelsize=20)
        for i in "top right left".split():
            ax.spines[i].set_visible(False)
        fig.suptitle(title, fontsize=40)

    plt.subplots_adjust(top=0.90, bottom=0.05, wspace=0.90, hspace=0.3)
    plt.show()

tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()
plot_top_words(
    nmf, tfidf_feature_names, n_top_words, "NMF模型中的主题（弗罗贝尼乌斯范数）"
)

## 使用广义库尔贝克 - 莱布勒散度拟合NMF模型
print(
    "\n" * 2,
    "使用tf - idf特征拟合NMF模型（广义库尔贝克 - 莱布勒散度），"
    "n_samples = %d 且 n_features = %d..."
    % (n_samples, n_features),
)
nmf = NMF(
    n_components=n_components,
    random_state=1,
    init=init,
    beta_loss="kullback-leibler",
    solver="mu",
    max_iter=1000,
    alpha_W=0.00005,
    alpha_H=0.00005,
    l1_ratio=0.5,
).fit(tfidf)

## 绘制使用广义库尔贝克 - 莱布勒散度的NMF模型的前几个关键词
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()
plot_top_words(
    nmf,
    tfidf_feature_names,
    n_top_words,
    "NMF模型中的主题（广义库尔贝克 - 莱布勒散度）"
)

## 拟合MiniBatchNMF模型
from sklearn.decomposition import MiniBatchNMF

batch_size = 128

print(
    "\n" * 2,
    "使用tf - idf特征拟合MiniBatchNMF模型（弗罗贝尼乌斯范数），"
    "n_samples = %d 且 n_features = %d，batch_size = %d..."
    % (n_samples, n_features, batch_size),
)
mbnmf = MiniBatchNMF(
    n_components=n_components,
    random_state=1,
    batch_size=batch_size,
    init=init,
    beta_loss="frobenius",
    alpha_W=0.00005,
    alpha_H=0.00005,
    l1_ratio=0.5,
).fit(tfidf)

## 绘制使用弗罗贝尼乌斯范数的MiniBatchNMF模型的前几个关键词
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()
plot_top_words(
    mbnmf,
    tfidf_feature_names,
    n_top_words,
    "MiniBatchNMF模型中的主题（弗罗贝尼乌斯范数）"
)

## 使用广义库尔贝克 - 莱布勒散度拟合MiniBatchNMF模型
print(
    "\n" * 2,
    "使用tf - idf特征拟合MiniBatchNMF模型（广义库尔贝克 - 莱布勒散度），"
    "n_samples = %d 且 n_features = %d，batch_size = %d..."
    % (n_samples, n_features, batch_size),
)
mbnmf = MiniBatchNMF(
    n_components=n_components,
    random_state=1,
    batch_size=batch_size,
    init=init,
    beta_loss="kullback-leibler",
    alpha_W=0.00005,
    alpha_H=0.00005,
    l1_ratio=0.5,
).fit(tfidf)

## 绘制使用广义库尔贝克 - 莱布勒散度的MiniBatchNMF模型的前几个关键词
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()
plot_top_words(
    mbnmf,
    tfidf_feature_names,
    n_top_words,
    "MiniBatchNMF模型中的主题（广义库尔贝克 - 莱布勒散度）"
)

应用潜在狄利克雷分配（LDA）

我们将对具有词频（tf）特征的数据集应用潜在狄利克雷分配（LDA）模型。

from sklearn.decomposition import LatentDirichletAllocation

print(
    "\n" * 2,
    "使用tf特征拟合LDA模型，n_samples = %d 且 n_features = %d..."
    % (n_samples, n_features),
)
lda = LatentDirichletAllocation(
    n_components=n_components,
    max_iter=5,
    learning_method="online",
    learning_offset=50.0,
    random_state=0,
)
t0 = time()
lda.fit(tf)
print("完成于 %0.3fs。" % (time() - t0))

tf_feature_names = tf_vectorizer.get_feature_names_out()
plot_top_words(lda, tf_feature_names, n_top_words, "LDA模型中的主题")

总结

在本实验中，我们学习了如何在文档语料库上应用非负矩阵分解（Non-negative Matrix Factorization）和潜在狄利克雷分配（Latent Dirichlet Allocation），以提取语料库主题结构的加性模型。我们还学习了如何绘制主题，每个主题都使用基于权重的前几个关键词以条形图的形式表示。