회귀 분석을 위한 랜덤 포레스트와 히스토그램 기반 그래디언트 부스팅 비교

소개

이 실습에서는 두 가지 인기 앙상블 모델인 랜덤 포레스트 (RF) 와 히스토그램 기반 그래디언트 부스팅 (HGBT) 의 성능을 회귀 데이터셋에 대해 점수와 계산 시간 측면에서 비교합니다. 각 추정자에 따른 트리 개수를 제어하는 매개변수를 변경하고, 소요 시간과 평균 테스트 점수 간의 트레이드오프를 시각화하기 위해 결과를 플롯합니다.

VM 팁

VM 시작이 완료되면 왼쪽 상단 모서리를 클릭하여 Notebook 탭으로 전환하여 연습을 위한 Jupyter Notebook에 접근합니다.

때때로 Jupyter Notebook 이 완전히 로드되기까지 몇 초 정도 기다려야 할 수 있습니다. Jupyter Notebook 의 제한으로 인해 작업의 유효성 검사를 자동화할 수 없습니다.

학습 중 문제가 발생하면 Labby 에 문의하십시오. 세션 후 피드백을 제공하면 문제를 신속하게 해결해 드리겠습니다.

데이터셋 로드

scikit-learn 의 fetch_california_housing 함수를 사용하여 캘리포니아 주택 데이터셋을 로드합니다. 이 데이터셋은 20,640 개의 샘플과 8 개의 특징으로 구성되어 있습니다.

from sklearn.datasets import fetch_california_housing

X, y = fetch_california_housing(return_X_y=True, as_frame=True)
n_samples, n_features = X.shape

print(f"The dataset consists of {n_samples} samples and {n_features} features")

모델 및 매개변수 그리드 정의

scikit-learn 의 RandomForestRegressor, HistGradientBoostingRegressor, 및 GridSearchCV 클래스를 사용하여 랜덤 포레스트와 히스토그램 기반 그래디언트 부스팅 두 가지 모델과 해당 매개변수 그리드를 정의합니다. 또한 병렬 처리에 사용할 호스트 머신의 물리적 코어 수를 설정합니다.

import joblib
import pandas as pd
from sklearn.ensemble import HistGradientBoostingRegressor, RandomForestRegressor
from sklearn.model_selection import GridSearchCV, KFold

N_CORES = joblib.cpu_count(only_physical_cores=True)

models = {
    "Random Forest": RandomForestRegressor(
        min_samples_leaf=5, random_state=0, n_jobs=N_CORES
    ),
    "Hist Gradient Boosting": HistGradientBoostingRegressor(
        max_leaf_nodes=15, random_state=0, early_stopping=False
    ),
}

param_grids = {
    "Random Forest": {"n_estimators": [10, 20, 50, 100]},
    "Hist Gradient Boosting": {"max_iter": [10, 20, 50, 100, 300, 500]},
}

cv = KFold(n_splits=4, shuffle=True, random_state=0)

results = []

for name, model in models.items():
    grid_search = GridSearchCV(
        estimator=model,
        param_grid=param_grids[name],
        return_train_score=True,
        cv=cv,
    ).fit(X, y)

    result = {"model": name, "cv_results": pd.DataFrame(grid_search.cv_results_)}
    results.append(result)

점수 및 계산 시간 계산

GridSearchCV 객체의 cv_results_ 속성을 사용하여 각 하이퍼파라미터 조합에 대한 평균 적합 및 점수 시간을 계산합니다. 그런 다음 plotly.express.scatter 및 plotly.express.line을 사용하여 결과를 시각화하여 계산 시간과 평균 테스트 점수 간의 트레이드오프를 시각화합니다.

import plotly.express as px
import plotly.colors as colors
from plotly.subplots import make_subplots

fig = make_subplots(
    rows=1,
    cols=2,
    shared_yaxes=True,
    subplot_titles=["Train time vs score", "Predict time vs score"],
)
model_names = [result["model"] for result in results]
colors_list = colors.qualitative.Plotly * (
    len(model_names) // len(colors.qualitative.Plotly) + 1
)

for idx, result in enumerate(results):
    cv_results = result["cv_results"].round(3)
    model_name = result["model"]
    param_name = list(param_grids[model_name].keys())[0]
    cv_results[param_name] = cv_results["param_" + param_name]
    cv_results["model"] = model_name

    scatter_fig = px.scatter(
        cv_results,
        x="mean_fit_time",
        y="mean_test_score",
        error_x="std_fit_time",
        error_y="std_test_score",
        hover_data=param_name,
        color="model",
    )
    line_fig = px.line(
        cv_results,
        x="mean_fit_time",
        y="mean_test_score",
    )

    scatter_trace = scatter_fig["data"][0]
    line_trace = line_fig["data"][0]
    scatter_trace.update(marker=dict(color=colors_list[idx]))
    line_trace.update(line=dict(color=colors_list[idx]))
    fig.add_trace(scatter_trace, row=1, col=1)
    fig.add_trace(line_trace, row=1, col=1)

    scatter_fig = px.scatter(
        cv_results,
        x="mean_score_time",
        y="mean_test_score",
        error_x="std_score_time",
        error_y="std_test_score",
        hover_data=param_name,
    )
    line_fig = px.line(
        cv_results,
        x="mean_score_time",
        y="mean_test_score",
    )

    scatter_trace = scatter_fig["data"][0]
    line_trace = line_fig["data"][0]
    scatter_trace.update(marker=dict(color=colors_list[idx]))
    line_trace.update(line=dict(color=colors_list[idx]))
    fig.add_trace(scatter_trace, row=1, col=2)
    fig.add_trace(line_trace, row=1, col=2)

fig.update_layout(
    xaxis=dict(title="Train time (s) - lower is better"),
    yaxis=dict(title="Test R2 score - higher is better"),
    xaxis2=dict(title="Predict time (s) - lower is better"),
    legend=dict(x=0.72, y=0.05, traceorder="normal", borderwidth=1),
    title=dict(x=0.5, text="Speed-score trade-off of tree-based ensembles"),
)

결과 해석

앙상블에 포함된 트리의 수를 늘리면 HGBT 및 RF 모델 모두 개선되는 것을 관찰할 수 있습니다. 그러나 점수가 정체되는 지점에 도달하여 새로운 트리를 추가하면 적합 및 점수 계산 속도만 느려집니다. RF 모델은 이러한 정체 지점에 더 빨리 도달하며 가장 큰 HGBDT 모델의 테스트 점수에 도달할 수 없습니다. HGBT 모델은 "테스트 점수 대 훈련 속도 트레이드오프"에서 RF 모델을 균일하게 능가하며, "테스트 점수 대 예측 속도" 트레이드오프도 HGBT 에 더 유리할 수 있습니다. HGBT 는 기본 하이퍼파라미터를 사용하든 하이퍼파라미터 튜닝 비용을 포함하든 거의 항상 RF 보다 더 나은 속도 - 정확도 트레이드오프를 제공합니다.

요약

이 실험에서는 두 가지 인기 있는 앙상블 모델인 랜덤 포레스트와 히스토그램 기반 그래디언트 부스팅을 회귀 데이터셋에 적용하여 점수와 계산 시간 측면에서 성능을 비교했습니다. 각 추정자에 따른 트리 개수를 제어하는 매개변수를 변경하고, 결과를 플롯하여 경과된 계산 시간과 평균 테스트 점수 간의 트레이드오프를 시각화했습니다. HGBT 모델은 "테스트 점수 대 훈련 속도 트레이드오프"에서 RF 모델을 균일하게 능가하며, "테스트 점수 대 예측 속도" 트레이드오프도 HGBT 에 더 유리할 수 있음을 관찰했습니다. HGBT 는 기본 하이퍼파라미터를 사용하든 하이퍼파라미터 튜닝 비용을 포함하든 거의 항상 RF 보다 더 나은 속도 - 정확도 트레이드오프를 제공합니다.

랜덤 포레스트와 히스토그램 기반 그래디언트 부스팅 비교 플롯

소개

VM 팁

데이터셋 로드

모델 및 매개변수 그리드 정의

점수 및 계산 시간 계산

결과 해석

요약