テキストベクトル化 | 機械学習 | データ前処理

はじめに

この実験では、非数値の入力データ（辞書や文書など）を実数のベクトルとして表現するテキストベクトル化を検討します。独自の Python 関数を使って前処理（トークン化）された文書をベクトル化するために、FeatureHasher と DictVectorizer の 2 つの方法を比較します。

VM のヒント

VM の起動が完了したら、左上隅をクリックして ノートブック タブに切り替え、Jupyter Notebook を使って練習しましょう。

時々、Jupyter Notebook が読み込み終わるまで数秒待つ必要があります。Jupyter Notebook の制限により、操作の検証は自動化できません。

学習中に問題に直面した場合は、Labby にお尋ねください。セッション後にフィードバックを提供してください。すぐに問題を解決いたします。

データの読み込み

20newsgroups_dataset からデータを読み込みます。これは、20 のトピックに関する約 18,000 のニュースグループ投稿から構成されており、2 つのサブセットに分割されています。1 つは学習用、もう 1 つはテスト用です。簡単のために計算コストを削減するため、7 つのトピックのサブセットを選択し、学習セットのみを使用します。

from sklearn.datasets import fetch_20newsgroups

categories = [
    "alt.atheism",
    "comp.graphics",
    "comp.sys.ibm.pc.hardware",
    "misc.forsale",
    "rec.autos",
    "sci.space",
    "talk.religion.misc",
]

print("Loading 20 newsgroups training data")
raw_data, _ = fetch_20newsgroups(subset="train", categories=categories, return_X_y=True)
data_size_mb = sum(len(s.encode("utf-8")) for s in raw_data) / 1e6
print(f"{len(raw_data)} documents - {data_size_mb:.3f}MB")

前処理関数の定義

トークンは、単語、単語の一部、または文字列内の空白や記号の間に含まれる何かです。ここでは、単純な正規表現（regex）を使用してトークンを抽出する関数を定義します。この正規表現は、Unicode の単語文字と一致します。これには、あらゆる言語の単語の一部となり得るほとんどの文字、および数字とアンダースコアが含まれます。

import re

def tokenize(doc):
    """Extract tokens from doc.

    This uses a simple regex that matches word characters to break strings
    into tokens. For a more principled approach, see CountVectorizer or
    TfidfVectorizer.
    """
    return (tok.lower() for tok in re.findall(r"\w+", doc))

与えられた文書内の各トークンの出現回数（頻度）をカウントする追加の関数を定義します。ベクトル化関数によって使用される頻度辞書を返します。

from collections import defaultdict

def token_freqs(doc):
    """Extract a dict mapping tokens from doc to their occurrences."""

    freq = defaultdict(int)
    for tok in tokenize(doc):
        freq[tok] += 1
    return freq

DictVectorizer

辞書を入力として受け取る方法である DictVectorizer をベンチマークします。

from sklearn.feature_extraction import DictVectorizer
from time import time

t0 = time()
vectorizer = DictVectorizer()
vectorizer.fit_transform(token_freqs(d) for d in raw_data)
duration = time() - t0
print(f"done in {duration:.3f} s")
print(f"Found {len(vectorizer.get_feature_names_out())} unique terms")

FeatureHasher

FeatureHasher をベンチマークします。これは、特徴量（例えば、トークン）にハッシュ関数を適用することで事前に定義された長さのベクトルを構築し、そのハッシュ値を直接特徴量インデックスとして使用し、それらのインデックスで結果のベクトルを更新する方法です。

from sklearn.feature_extraction import FeatureHasher
import numpy as np

t0 = time()
hasher = FeatureHasher(n_features=2**18)
X = hasher.transform(token_freqs(d) for d in raw_data)
duration = time() - t0
print(f"done in {duration:.3f} s")
print(f"Found {len(np.unique(X.nonzero()[1]))} unique tokens")

特殊用途のテキストベクトル化手法との比較

前の手法と CountVectorizer および HashingVectorizer とを比較します。

from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfVectorizer

t0 = time()
vectorizer = CountVectorizer()
vectorizer.fit_transform(raw_data)
duration = time() - t0
print(f"done in {duration:.3f} s")
print(f"Found {len(vectorizer.get_feature_names_out())} unique terms")

t0 = time()
vectorizer = HashingVectorizer(n_features=2**18)
vectorizer.fit_transform(raw_data)
duration = time() - t0
print(f"done in {duration:.3f} s")

t0 = time()
vectorizer = TfidfVectorizer()
vectorizer.fit_transform(raw_data)
duration = time() - t0
print(f"done in {duration:.3f} s")
print(f"Found {len(vectorizer.get_feature_names_out())} unique terms")

結果をプロットする

上記のベクトル化手法の速度をプロットします。

import matplotlib.pyplot as plt

dict_count_vectorizers = {
    "vectorizer": [
        "DictVectorizer\non freq dicts",
        "FeatureHasher\non freq dicts",
        "FeatureHasher\non raw tokens",
        "CountVectorizer",
        "HashingVectorizer",
        "TfidfVectorizer"
    ],
    "speed": [
        2.4, 4.4, 7.2, 5.1, 11.7, 2.9
    ]
}

fig, ax = plt.subplots(figsize=(12, 6))

y_pos = np.arange(len(dict_count_vectorizers["vectorizer"]))
ax.barh(y_pos, dict_count_vectorizers["speed"], align="center")
ax.set_yticks(y_pos)
ax.set_yticklabels(dict_count_vectorizers["vectorizer"])
ax.invert_yaxis()
_ = ax.set_xlabel("speed (MB/s)")

まとめ

この実験では、2 つの手法である FeatureHasher と DictVectorizer、および 4 つの特殊用途のテキストベクトル化手法である CountVectorizer、HashingVectorizer、および TfidfVectorizer を比較することで、テキストベクトル化を検討しました。ベクトル化手法をベンチマークし、結果をプロットしました。その結果、HashingVectorizer はハッシュ衝突による変換の逆変換可能性を犠牲にして、CountVectorizer よりも良好な性能を示すことがわかりました。また、DictVectorizer と FeatureHasher は、手動でトークン化されたドキュメントに対して、同等のテキストベクトル化手法よりも良好な性能を示します。なぜなら、前者のベクトル化手法の内部トークン化ステップは、一度正規表現をコンパイルしてから、すべてのドキュメントで再利用するためです。

FeatureHasher と DictVectorizer の比較