比较随机森林和直方图梯度提升用于回归

简介

在本实验中，我们将针对一个回归数据集，从得分和计算时间方面比较两种流行的集成模型——随机森林（RF）和直方图梯度提升（HGBT）的性能。我们将根据每个估计器来改变控制树数量的参数，并绘制结果以可视化经过的计算时间与平均测试得分之间的权衡。

虚拟机使用提示

虚拟机启动完成后，点击左上角切换到“笔记本”标签页，以访问 Jupyter Notebook 进行练习。

有时，你可能需要等待几秒钟让 Jupyter Notebook 完成加载。由于 Jupyter Notebook 的限制，操作验证无法自动化。

如果你在学习过程中遇到问题，随时向 Labby 提问。课程结束后提供反馈，我们会及时为你解决问题。

加载数据集

我们将使用 scikit-learn 的fetch_california_housing函数加载加利福尼亚住房数据集。该数据集包含 20,640 个样本和 8 个特征。

from sklearn.datasets import fetch_california_housing

X, y = fetch_california_housing(return_X_y=True, as_frame=True)
n_samples, n_features = X.shape

print(f"The dataset consists of {n_samples} samples and {n_features} features")

定义模型和参数网格

我们将使用 scikit-learn 的RandomForestRegressor、HistGradientBoostingRegressor和GridSearchCV类定义两个模型——随机森林和直方图梯度提升，并为它们设置相应的参数网格。我们还将设置主机上用于并行处理的物理核心数量。

import joblib
import pandas as pd
from sklearn.ensemble import HistGradientBoostingRegressor, RandomForestRegressor
from sklearn.model_selection import GridSearchCV, KFold

N_CORES = joblib.cpu_count(only_physical_cores=True)

models = {
    "Random Forest": RandomForestRegressor(
        min_samples_leaf=5, random_state=0, n_jobs=N_CORES
    ),
    "Hist Gradient Boosting": HistGradientBoostingRegressor(
        max_leaf_nodes=15, random_state=0, early_stopping=False
    ),
}

param_grids = {
    "Random Forest": {"n_estimators": [10, 20, 50, 100]},
    "Hist Gradient Boosting": {"max_iter": [10, 20, 50, 100, 300, 500]},
}

cv = KFold(n_splits=4, shuffle=True, random_state=0)

results = []

for name, model in models.items():
    grid_search = GridSearchCV(
        estimator=model,
        param_grid=param_grids[name],
        return_train_score=True,
        cv=cv,
    ).fit(X, y)

    result = {"model": name, "cv_results": pd.DataFrame(grid_search.cv_results_)}
    results.append(result)

计算得分和计算时间

我们将使用GridSearchCV对象的cv_results_属性来计算每个超参数组合的平均拟合时间和得分时间。然后，我们将使用plotly.express.scatter和plotly.express.line绘制结果，以可视化经过的计算时间与平均测试得分之间的权衡。

import plotly.express as px
import plotly.colors as colors
from plotly.subplots import make_subplots

fig = make_subplots(
    rows=1,
    cols=2,
    shared_yaxes=True,
    subplot_titles=["Train time vs score", "Predict time vs score"],
)
model_names = [result["model"] for result in results]
colors_list = colors.qualitative.Plotly * (
    len(model_names) // len(colors.qualitative.Plotly) + 1
)

for idx, result in enumerate(results):
    cv_results = result["cv_results"].round(3)
    model_name = result["model"]
    param_name = list(param_grids[model_name].keys())[0]
    cv_results[param_name] = cv_results["param_" + param_name]
    cv_results["model"] = model_name

    scatter_fig = px.scatter(
        cv_results,
        x="mean_fit_time",
        y="mean_test_score",
        error_x="std_fit_time",
        error_y="std_test_score",
        hover_data=param_name,
        color="model",
    )
    line_fig = px.line(
        cv_results,
        x="mean_fit_time",
        y="mean_test_score",
    )

    scatter_trace = scatter_fig["data"][0]
    line_trace = line_fig["data"][0]
    scatter_trace.update(marker=dict(color=colors_list[idx]))
    line_trace.update(line=dict(color=colors_list[idx]))
    fig.add_trace(scatter_trace, row=1, col=1)
    fig.add_trace(line_trace, row=1, col=1)

    scatter_fig = px.scatter(
        cv_results,
        x="mean_score_time",
        y="mean_test_score",
        error_x="std_score_time",
        error_y="std_test_score",
        hover_data=param_name,
    )
    line_fig = px.line(
        cv_results,
        x="mean_score_time",
        y="mean_test_score",
    )

    scatter_trace = scatter_fig["data"][0]
    line_trace = line_fig["data"][0]
    scatter_trace.update(marker=dict(color=colors_list[idx]))
    line_trace.update(line=dict(color=colors_list[idx]))
    fig.add_trace(scatter_trace, row=1, col=2)
    fig.add_trace(line_trace, row=1, col=2)

fig.update_layout(
    xaxis=dict(title="Train time (s) - lower is better"),
    yaxis=dict(title="Test R2 score - higher is better"),
    xaxis2=dict(title="Predict time (s) - lower is better"),
    legend=dict(x=0.72, y=0.05, traceorder="normal", borderwidth=1),
    title=dict(x=0.5, text="Speed-score trade-off of tree-based ensembles"),
)

解读结果

我们可以观察到，在集成模型中增加树的数量时，HGBT 和 RF 模型的性能都会提高。然而，分数会达到一个平稳期，此时添加新树只会使拟合和评分变慢。RF 模型更早达到这样的平稳期，并且永远无法达到最大 HGBDT 模型的测试分数。在“测试分数与训练速度的权衡”方面，HGBT 模型始终优于 RF 模型，并且在“测试分数与预测速度”的权衡上，HGBT 模型也更具优势。无论是使用默认超参数还是包括超参数调整成本，HGBT 几乎总是比 RF 提供更有利的速度 - 准确性权衡。

总结

在本实验中，我们针对一个回归数据集，从得分和计算时间方面比较了两种流行的集成模型——随机森林（Random Forest）和直方图梯度提升（Histogram Gradient Boosting）的性能。我们根据每个估计器来改变控制树数量的参数，并绘制结果以可视化经过的计算时间与平均测试得分之间的权衡。我们观察到，在“测试得分与训练速度的权衡”方面，直方图梯度提升（HGBT）模型始终优于随机森林（RF）模型，并且在“测试得分与预测速度”的权衡上，直方图梯度提升（HGBT）模型也更具优势。无论是使用默认超参数还是包括超参数调整成本，直方图梯度提升（HGBT）几乎总是比随机森林（RF）提供更有利的速度 - 准确性权衡。

绘制森林直方图梯度提升比较

简介

虚拟机使用提示

加载数据集

定义模型和参数网格

计算得分和计算时间

解读结果

总结