Scikit-Learn でのパイプライン構築 | 機械学習チュートリアル

はじめに

この実験では、Scikit-Learn でパイプラインを構築して表示する方法についての手順ガイドを行います。

VM のヒント

VM の起動が完了したら、左上隅をクリックして ノートブック タブに切り替え、Jupyter Notebook を使って練習しましょう。

Jupyter Notebook の読み込みには数秒かかる場合があります。Jupyter Notebook の制限により、操作の検証は自動化できません。

学習中に問題が発生した場合は、Labby にお問い合わせください。セッション後にフィードバックを提供してください。すぐに問題を解決いたします。

前処理ステップと分類器を備えた単純なパイプラインの構築

このステップでは、前処理ステップと分類器を備えた単純なパイプラインを構築し、その視覚的表現を表示します。

まず、必要なモジュールをインポートします。

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn import set_config

次に、パイプラインのステップを定義します。

steps = [
    ("preprocessing", StandardScaler()),
    ("classifier", LogisticRegression()),
]

そして、パイプラインを作成します。

pipe = Pipeline(steps)

最後に、パイプラインの視覚的表現を表示します。

set_config(display="diagram")
pipe

複数の前処理ステップと分類器を連鎖させたパイプラインの構築

このステップでは、複数の前処理ステップと分類器を備えたパイプラインを構築し、その視覚的表現を表示します。

まず、必要なモジュールをインポートします。

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LogisticRegression

次に、パイプラインのステップを定義します。

steps = [
    ("standard_scaler", StandardScaler()),
    ("polynomial", PolynomialFeatures(degree=3)),
    ("classifier", LogisticRegression(C=2.0)),
]

そして、パイプラインを作成します。

pipe = Pipeline(steps)

最後に、パイプラインの視覚的表現を表示します。

pipe

次元削減と分類器を備えたパイプラインの構築

このステップでは、次元削減ステップと分類器を備えたパイプラインを構築し、その視覚的表現を表示します。

まず、必要なモジュールをインポートします。

from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA

次に、パイプラインのステップを定義します。

steps = [("reduce_dim", PCA(n_components=4)), ("classifier", SVC(kernel="linear"))]

そして、パイプラインを作成します。

pipe = Pipeline(steps)

最後に、パイプラインの視覚的表現を表示します。

pipe

列変換器を連鎖させた複雑なパイプラインの構築

このステップでは、列変換器と分類器を備えた複雑なパイプラインを構築し、その視覚的表現を表示します。

まず、必要なモジュールをインポートします。

import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression

次に、数値型とカテゴリ型の特徴量に対する前処理ステップを定義します。

numeric_preprocessor = Pipeline(
    steps=[
        ("imputation_mean", SimpleImputer(missing_values=np.nan, strategy="mean")),
        ("scaler", StandardScaler()),
    ]
)

categorical_preprocessor = Pipeline(
    steps=[
        (
            "imputation_constant",
            SimpleImputer(fill_value="missing", strategy="constant"),
        ),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)

そして、列変換器を作成します。

preprocessor = ColumnTransformer(
    [
        ("categorical", categorical_preprocessor, ["state", "gender"]),
        ("numerical", numeric_preprocessor, ["age", "weight"]),
    ]
)

次に、パイプラインを作成します。

pipe = make_pipeline(preprocessor, LogisticRegression(max_iter=500))

最後に、パイプラインの視覚的表現を表示します。

pipe

分類器付きパイプラインに対するグリッドサーチの構築

このステップでは、分類器付きパイプラインに対するグリッドサーチを構築し、その視覚的表現を表示します。

まず、必要なモジュールをインポートします。

import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

次に、数値型とカテゴリ型の特徴量に対する前処理ステップを定義します。

numeric_preprocessor = Pipeline(
    steps=[
        ("imputation_mean", SimpleImputer(missing_values=np.nan, strategy="mean")),
        ("scaler", StandardScaler()),
    ]
)

categorical_preprocessor = Pipeline(
    steps=[
        (
            "imputation_constant",
            SimpleImputer(fill_value="missing", strategy="constant"),
        ),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)

そして、列変換器を作成します。

preprocessor = ColumnTransformer(
    [
        ("categorical", categorical_preprocessor, ["state", "gender"]),
        ("numerical", numeric_preprocessor, ["age", "weight"]),
    ]
)

次に、パイプラインを作成します。

pipe = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", RandomForestClassifier())]
)

その後、グリッドサーチ用のパラメータグリッドを定義します。

param_grid = {
    "classifier__n_estimators": [200, 500],
    "classifier__max_features": ["auto", "sqrt", "log2"],
    "classifier__max_depth": [4, 5, 6, 7, 8],
    "classifier__criterion": ["gini", "entropy"],
}

最後に、グリッドサーチを作成します。

grid_search = GridSearchCV(pipe, param_grid=param_grid, n_jobs=1)

そして、グリッドサーチの視覚的表現を表示します。

grid_search

まとめ

この実験では、Scikit-Learn におけるパイプラインの構築と表示方法について段階的なガイドを提供しました。前処理ステップと分類器を備えた単純なパイプライン、複数の前処理ステップと分類器を連鎖させたパイプライン、次元削減と分類器を備えたパイプライン、列変換器と分類器を連鎖させた複雑なパイプライン、および分類器付きパイプラインに対するグリッドサーチについて説明しました。

Scikit-Learn パイプラインの構築

はじめに

VM のヒント

前処理ステップと分類器を備えた単純なパイプラインの構築

複数の前処理ステップと分類器を連鎖させたパイプラインの構築

次元削減と分類器を備えたパイプラインの構築

列変換器を連鎖させた複雑なパイプラインの構築

分類器付きパイプラインに対するグリッドサーチの構築

まとめ