回帰におけるランダムフォレストとヒストグラム勾配ブースティングの比較

はじめに

この実験では、回帰データセットに対して、2 つの人気のあるアンサンブルモデル、ランダムフォレスト（RF）とヒストグラム勾配ブースティング（HGBT）の性能を、スコアと計算時間の面で比較します。各推定器に応じて木の数を制御するパラメータを変更し、結果をプロットして、経過計算時間と平均テストスコアのトレードオフを視覚化します。

VM のヒント

VM の起動が完了したら、左上隅をクリックしてノートブックタブに切り替え、Jupyter Notebook を使って練習しましょう。

Jupyter Notebook の読み込みには数秒かかる場合があります。Jupyter Notebook の制限により、操作の検証は自動化できません。

学習中に問題がある場合は、Labby にお問い合わせください。セッション後にフィードバックを提供してください。すぐに問題を解決いたします。

データセットの読み込み

scikit-learn のfetch_california_housing関数を使って、サンフランシスコの住宅価格データセットを読み込みます。このデータセットは 20,640 個のサンプルと 8 つの特徴量で構成されています。

from sklearn.datasets import fetch_california_housing

X, y = fetch_california_housing(return_X_y=True, as_frame=True)
n_samples, n_features = X.shape

print(f"The dataset consists of {n_samples} samples and {n_features} features")

モデルとパラメータグリッドの定義

scikit-learn のRandomForestRegressor、HistGradientBoostingRegressor、およびGridSearchCVクラスを使って、ランダムフォレストとヒストグラム勾配ブースティングの 2 つのモデルとそれに対応するパラメータグリッドを定義します。また、並列処理に使用するホストマシン上の物理コア数も設定します。

import joblib
import pandas as pd
from sklearn.ensemble import HistGradientBoostingRegressor, RandomForestRegressor
from sklearn.model_selection import GridSearchCV, KFold

N_CORES = joblib.cpu_count(only_physical_cores=True)

models = {
    "Random Forest": RandomForestRegressor(
        min_samples_leaf=5, random_state=0, n_jobs=N_CORES
    ),
    "Hist Gradient Boosting": HistGradientBoostingRegressor(
        max_leaf_nodes=15, random_state=0, early_stopping=False
    ),
}

param_grids = {
    "Random Forest": {"n_estimators": [10, 20, 50, 100]},
    "Hist Gradient Boosting": {"max_iter": [10, 20, 50, 100, 300, 500]},
}

cv = KFold(n_splits=4, shuffle=True, random_state=0)

results = []

for name, model in models.items():
    grid_search = GridSearchCV(
        estimator=model,
        param_grid=param_grids[name],
        return_train_score=True,
        cv=cv,
    ).fit(X, y)

    result = {"model": name, "cv_results": pd.DataFrame(grid_search.cv_results_)}
    results.append(result)

スコアと計算時間の計算

GridSearchCVオブジェクトのcv_results_属性を使って、ハイパーパラメータの各組み合わせに対する平均の学習時間とスコア算出時間を計算します。その後、plotly.express.scatterとplotly.express.lineを使って結果をプロットし、経過計算時間と平均テストスコアのトレードオフを視覚化します。

import plotly.express as px
import plotly.colors as colors
from plotly.subplots import make_subplots

fig = make_subplots(
    rows=1,
    cols=2,
    shared_yaxes=True,
    subplot_titles=["Train time vs score", "Predict time vs score"],
)
model_names = [result["model"] for result in results]
colors_list = colors.qualitative.Plotly * (
    len(model_names) // len(colors.qualitative.Plotly) + 1
)

for idx, result in enumerate(results):
    cv_results = result["cv_results"].round(3)
    model_name = result["model"]
    param_name = list(param_grids[model_name].keys())[0]
    cv_results[param_name] = cv_results["param_" + param_name]
    cv_results["model"] = model_name

    scatter_fig = px.scatter(
        cv_results,
        x="mean_fit_time",
        y="mean_test_score",
        error_x="std_fit_time",
        error_y="std_test_score",
        hover_data=param_name,
        color="model",
    )
    line_fig = px.line(
        cv_results,
        x="mean_fit_time",
        y="mean_test_score",
    )

    scatter_trace = scatter_fig["data"][0]
    line_trace = line_fig["data"][0]
    scatter_trace.update(marker=dict(color=colors_list[idx]))
    line_trace.update(line=dict(color=colors_list[idx]))
    fig.add_trace(scatter_trace, row=1, col=1)
    fig.add_trace(line_trace, row=1, col=1)

    scatter_fig = px.scatter(
        cv_results,
        x="mean_score_time",
        y="mean_test_score",
        error_x="std_score_time",
        error_y="std_test_score",
        hover_data=param_name,
    )
    line_fig = px.line(
        cv_results,
        x="mean_score_time",
        y="mean_test_score",
    )

    scatter_trace = scatter_fig["data"][0]
    line_trace = line_fig["data"][0]
    scatter_trace.update(marker=dict(color=colors_list[idx]))
    line_trace.update(line=dict(color=colors_list[idx]))
    fig.add_trace(scatter_trace, row=1, col=2)
    fig.add_trace(line_trace, row=1, col=2)

fig.update_layout(
    xaxis=dict(title="Train time (s) - lower is better"),
    yaxis=dict(title="Test R2 score - higher is better"),
    xaxis2=dict(title="Predict time (s) - lower is better"),
    legend=dict(x=0.72, y=0.05, traceorder="normal", borderwidth=1),
    title=dict(x=0.5, text="Speed-score trade-off of tree-based ensembles"),
)

結果の解釈

アンサンブル内の木の数を増やすと、HGBT と RF の両方のモデルが改善することがわかります。ただし、スコアはある時点で定常状態に達し、新しい木を追加すると学習とスコア算出が遅くなります。RF モデルはより早くその定常状態に達し、最大の HGBDT モデルのテストスコアには到達できません。HGBT モデルは、「テストスコア対学習速度のトレードオフ」において RF モデルを一貫して上回り、「テストスコア対予測速度」のトレードオフにおいても HGBT に有利です。HGBT は、デフォルトのハイパーパラメータの場合も、ハイパーパラメータチューニングのコストも含めて、ほとんど常に RF よりも有利な速度と精度のトレードオフを提供します。

まとめ

この実験では、回帰データセットに対して、スコアと計算時間の面で、ランダムフォレストとヒストグラム勾配ブースティングの 2 つの人気のあるアンサンブルモデルの性能を比較しました。各推定器に応じて木の数を制御するパラメータを変化させ、経過計算時間と平均テストスコアのトレードオフを視覚化するために結果をプロットしました。「テストスコア対学習速度のトレードオフ」においては、HGBT モデルが RF モデルを一貫して上回り、「テストスコア対予測速度」のトレードオフにおいても HGBT に有利であることがわかりました。HGBT は、デフォルトのハイパーパラメータの場合も、ハイパーパラメータチューニングのコストも含めて、ほとんど常に RF よりも有利な速度と精度のトレードオフを提供します。

森林ヒストグラム勾配ブースティングの比較プロット