비음수 행렬 분해 (NMF) | 잠재 디리클레 할당 (LDA) | 주제 모델링

소개

이 실습에서는 문서 집합에 대해 비음수 행렬 분해 (NMF) 및 잠재 디리클레 할당 (LDA) 을 적용하여 문서 집합의 주제 구조에 대한 가산 모델을 추출합니다. 출력은 각 주제가 가중치 기반 상위 몇 단어를 사용하여 막대 그래프로 표시되는 주제의 플롯입니다.

VM 팁

VM 시작이 완료되면 왼쪽 상단 모서리를 클릭하여 Notebook 탭으로 전환하여 연습을 위한 Jupyter Notebook에 접근합니다.

때때로 Jupyter Notebook 이 완전히 로드되기까지 몇 초 정도 기다려야 할 수 있습니다. Jupyter Notebook 의 제한으로 인해 작업의 유효성 검사는 자동화될 수 없습니다.

학습 중 문제가 발생하면 Labby 에게 문의하십시오. 세션 후 피드백을 제공하면 문제를 신속하게 해결해 드리겠습니다.

데이터셋 로드

20 뉴스그룹 데이터셋을 로드하고 벡터화합니다. 불필요한 용어를 조기에 걸러내기 위해 몇 가지 휴리스틱을 사용합니다. 게시물에서 헤더, 푸터, 인용 답변을 제거하고, 일반적인 영어 단어, 단일 문서에만 나타나는 단어 또는 최소 95% 의 문서에 나타나는 단어를 제거합니다.

from sklearn.datasets import fetch_20newsgroups

n_samples = 2000
n_features = 1000

print("데이터셋 로드 중...")
data, _ = fetch_20newsgroups(
    shuffle=True,
    random_state=1,
    remove=("headers", "footers", "quotes"),
    return_X_y=True,
)
data_samples = data[:n_samples]

특징 추출

NMF 에는 tf-idf 특징을, LDA 에는 원시 단어 빈도 특징을 사용하여 데이터셋에서 특징을 추출합니다.

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

## NMF 를 위해 tf-idf 특징 사용.
print("NMF 를 위한 tf-idf 특징 추출 중...")
tfidf_vectorizer = TfidfVectorizer(
    max_df=0.95, min_df=2, max_features=n_features, stop_words="english"
)
tfidf = tfidf_vectorizer.fit_transform(data_samples)

## LDA 를 위해 원시 단어 빈도 특징 사용.
print("LDA 를 위한 tf 특징 추출 중...")
tf_vectorizer = CountVectorizer(
    max_df=0.95, min_df=2, max_features=n_features, stop_words="english"
)
tf = tf_vectorizer.fit_transform(data_samples)

NMF 적용

두 가지 다른 목적 함수, Frobenius 노름과 일반화된 Kullback-Leibler 발산을 사용하여 NMF 를 적용합니다. 후자는 확률적 잠재 의미 지수와 동등합니다.

from sklearn.decomposition import NMF

n_components = 10
n_top_words = 20
init = "nndsvda"

## NMF 모델 적합
print(
    "tf-idf 특징, n_samples=%d 및 n_features=%d를 사용하여 NMF 모델 (Frobenius 노름) 을 맞추는 중..." % (n_samples, n_features)
)
nmf = NMF(
    n_components=n_components,
    random_state=1,
    init=init,
    beta_loss="frobenius",
    alpha_W=0.00005,
    alpha_H=0.00005,
    l1_ratio=1,
).fit(tfidf)

## NMF 모델의 상위 단어 플롯
def plot_top_words(model, feature_names, n_top_words, title):
    fig, axes = plt.subplots(2, 5, figsize=(30, 15), sharex=True)
    axes = axes.flatten()
    for topic_idx, topic in enumerate(model.components_):
        top_features_ind = topic.argsort()[: -n_top_words - 1 : -1]
        top_features = [feature_names[i] for i in top_features_ind]
        weights = topic[top_features_ind]

        ax = axes[topic_idx]
        ax.barh(top_features, weights, height=0.7)
        ax.set_title(f"Topic {topic_idx +1}", fontdict={"fontsize": 30})
        ax.invert_yaxis()
        ax.tick_params(axis="both", which="major", labelsize=20)
        for i in "top right left".split():
            ax.spines[i].set_visible(False)
        fig.suptitle(title, fontsize=40)

    plt.subplots_adjust(top=0.90, bottom=0.05, wspace=0.90, hspace=0.3)
    plt.show()

tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()
plot_top_words(
    nmf, tfidf_feature_names, n_top_words, "NMF 모델 (Frobenius 노름) 의 주제"
)

## 일반화된 Kullback-Leibler 발산을 사용하여 NMF 모델 적합
print(
    "\n" * 2,
    "tf-idf 특징, n_samples=%d 및 n_features=%d를 사용하여 NMF 모델 (일반화된 Kullback-Leibler 발산) 을 맞추는 중..."
    % (n_samples, n_features),
)
nmf = NMF(
    n_components=n_components,
    random_state=1,
    init=init,
    beta_loss="kullback-leibler",
    solver="mu",
    max_iter=1000,
    alpha_W=0.00005,
    alpha_H=0.00005,
    l1_ratio=0.5,
).fit(tfidf)

## 일반화된 Kullback-Leibler 발산을 사용한 NMF 모델의 상위 단어 플롯
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()
plot_top_words(
    nmf,
    tfidf_feature_names,
    n_top_words,
    "NMF 모델 (일반화된 Kullback-Leibler 발산) 의 주제",
)

## MiniBatchNMF 모델 적합
from sklearn.decomposition import MiniBatchNMF

batch_size = 128

print(
    "\n" * 2,
    "tf-idf 특징, n_samples=%d 및 n_features=%d, batch_size=%d를 사용하여 MiniBatchNMF 모델 (Frobenius 노름) 을 맞추는 중..."
    % (n_samples, n_features, batch_size),
)
## ... (나머지 코드 생략)

LDA 적용

tf 특징을 사용하여 LDA 모델을 적용합니다.

from sklearn.decomposition import LatentDirichletAllocation

print(
    "\n" * 2,
    "tf 특징, n_samples=%d 및 n_features=%d를 사용하여 LDA 모델을 맞추는 중..."
    % (n_samples, n_features),
)
lda = LatentDirichletAllocation(
    n_components=n_components,
    max_iter=5,
    learning_method="online",
    learning_offset=50.0,
    random_state=0,
)
t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))

tf_feature_names = tf_vectorizer.get_feature_names_out()
plot_top_words(lda, tf_feature_names, n_top_words, "LDA 모델의 주제")

요약

이 실험에서 우리는 문서 집합에 대해 비음수 행렬 분해 (Non-negative Matrix Factorization) 와 잠재 디리클레 할당 (Latent Dirichlet Allocation) 을 적용하여 문서 집합의 주제 구조에 대한 가산 모델을 추출하는 방법을 배웠습니다. 또한 각 주제를 가중치에 따라 상위 몇 개의 단어를 사용하여 막대 그래프로 표시하는 방법을 배웠습니다.

NMF 및 LDA 를 이용한 주제 추출 시각화

소개

VM 팁

데이터셋 로드

특징 추출

NMF 적용

LDA 적용

요약