Python を使った次元削減と分類のマスター

はじめに

この実験では、主成分分析 (PCA) とロジスティック回帰を使用して次元削減と分類のパイプラインを構築します。scikit-learn ライブラリを使用して、PCA を使って digits データセットに対して非監督学習による次元削減を行います。その後、分類にはロジスティック回帰モデルを使用します。GridSearchCV を使って PCA の次元を設定し、PCA トランケーションと分類器の正則化の最適な組み合わせを見つけます。

VM のヒント

VM の起動が完了したら、左上隅をクリックして ノートブック タブに切り替えて、Jupyter Notebook を使った練習を行います。

時々、Jupyter Notebook が読み込み終わるまで数秒待つ必要があります。Jupyter Notebook の制限により、操作の検証は自動化できません。

学習中に問題がある場合は、Labby にお問い合わせください。セッション後にフィードバックを提供してください。すぐに問題を解決いたします。

必要なライブラリをインポートする

まず、パイプラインの実装に必要なライブラリをインポートします。

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler

パイプラインコンポーネントを定義する

PCA、Standard Scaler、ロジスティック回帰を含むパイプラインコンポーネントを定義します。例を高速化するために許容誤差を大きな値に設定します。

## Define a pipeline to search for the best combination of PCA truncation
## and classifier regularization.
pca = PCA()
## Define a Standard Scaler to normalize inputs
scaler = StandardScaler()

logistic = LogisticRegression(max_iter=10000, tol=0.1)

pipe = Pipeline(steps=[("scaler", scaler), ("pca", pca), ("logistic", logistic)])

データセットを読み込み、GridSearchCV のパラメータを定義する

digits データセットを読み込み、GridSearchCV のパラメータを定義します。PCA トランケーションと分類器の正則化のパラメータを設定します。

X_digits, y_digits = datasets.load_digits(return_X_y=True)

param_grid = {
    "pca__n_components": [5, 15, 30, 45, 60],
    "logistic__C": np.logspace(-4, 4, 4),
}

GridSearchCV を実行する

PCA トランケーションと分類器の正則化の最適な組み合わせを見つけるために、GridSearchCV を実行します。

search = GridSearchCV(pipe, param_grid, n_jobs=2)
search.fit(X_digits, y_digits)

最適なパラメータとスコアを表示する

GridSearchCV から得られた最適なパラメータとスコアを表示します。

print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)

PCA のスペクトルをプロットする

各主成分の寄与率を視覚化するために、PCA のスペクトルをプロットします。

pca.fit(X_digits)

fig, (ax0, ax1) = plt.subplots(nrows=2, sharex=True, figsize=(6, 6))
ax0.plot(
    np.arange(1, pca.n_components_ + 1), pca.explained_variance_ratio_, "+", linewidth=2
)
ax0.set_ylabel("PCA explained variance ratio")

ax0.axvline(
    search.best_estimator_.named_steps["pca"].n_components,
    linestyle=":",
    label="n_components chosen",
)
ax0.legend(prop=dict(size=12))

最適な分類器の結果を見つける

主成分の数ごとに、最適な分類器の結果を見つけます。

results = pd.DataFrame(search.cv_results_)
components_col = "param_pca__n_components"
best_clfs = results.groupby(components_col).apply(
    lambda g: g.nlargest(1, "mean_test_score")
)

分類精度をプロットする

主成分の数ごとの分類精度をプロットします。

best_clfs.plot(
    x=components_col, y="mean_test_score", yerr="std_test_score", legend=False, ax=ax1
)
ax1.set_ylabel("分類精度 (検証)")
ax1.set_xlabel("主成分数")

plt.xlim(-1, 70)

plt.tight_layout()
plt.show()

まとめ

この実験では、主成分分析 (PCA) とロジスティック回帰を使って次元削減と分類のためのパイプラインを構築する方法を学びました。私たちは、scikit-learn ライブラリを使って、PCA を使って手書き数字データセットに対して非監督的な次元削減を行いました。その後、分類のためにロジスティック回帰モデルを使いました。また、GridSearchCV を使って PCA の次元を設定し、PCA のトランケーションと分類器の正則化の最適な組み合わせを見つけました。そして、主成分の数ごとに PCA のスペクトルと分類精度をプロットしました。

手書き数字のパイプラインをプロットする