그래디언트 부스팅 추정기 | 범주형 특징 처리

소개

이 실습에서는 Ames 주택 데이터셋을 사용하여 Gradient Boosting 추정기에서 범주형 특징을 처리하는 다양한 방법을 비교합니다. 이 데이터셋은 수치형 및 범주형 특징을 모두 포함하며, 목표는 주택의 판매 가격입니다. 다음 네 가지 파이프라인의 성능을 비교할 것입니다.

범주형 특징 제거
범주형 특징에 대한 원 - 핫 인코딩
범주형 특징을 순서형 값으로 처리
Gradient Boosting 추정기의 기본 범주형 지원 사용

교차 검증을 통해 파이프라인의 적합 시간과 예측 성능을 평가할 것입니다.

VM 팁

VM 시작이 완료되면 왼쪽 상단 모서리를 클릭하여 Notebook 탭으로 전환하여 연습을 위한 Jupyter Notebook에 접근합니다.

때때로 Jupyter Notebook 이 완전히 로드되기까지 몇 초 정도 기다려야 할 수 있습니다. Jupyter Notebook 의 제한으로 인해 작업의 유효성 검사를 자동화할 수 없습니다.

학습 중 문제가 발생하면 Labby 에게 문의하십시오. 세션 후 피드백을 제공하면 문제를 신속하게 해결해 드리겠습니다.

데이터셋 로드

Scikit-Learn 의 fetch_openml 함수를 사용하여 Ames 주택 데이터셋을 로드하고, 예제 실행 속도를 높이기 위해 특징의 하위 집합을 선택합니다. 또한 범주형 특징을 'category' dtype 으로 변환합니다.

from sklearn.datasets import fetch_openml

X, y = fetch_openml(data_id=42165, as_frame=True, return_X_y=True, parser="pandas")

## 예제 실행 속도를 높이기 위해 X 의 특징 하위 집합만 선택
categorical_columns_subset = [
    "BldgType",
    "GarageFinish",
    "LotConfig",
    "Functional",
    "MasVnrType",
    "HouseStyle",
    "FireplaceQu",
    "ExterCond",
    "ExterQual",
    "PoolQC",
]

numerical_columns_subset = [
    "3SsnPorch",
    "Fireplaces",
    "BsmtHalfBath",
    "HalfBath",
    "GarageCars",
    "TotRmsAbvGrd",
    "BsmtFinSF1",
    "BsmtFinSF2",
    "GrLivArea",
    "ScreenPorch",
]

X = X[categorical_columns_subset + numerical_columns_subset]
X[categorical_columns_subset] = X[categorical_columns_subset].astype("category")

기준 파이프라인 - 범주형 특징 제거

범주형 특징을 제거하고 HistGradientBoostingRegressor 추정기를 학습하는 파이프라인을 생성합니다.

from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector

dropper = make_column_transformer(
    ("drop", make_column_selector(dtype_include="category")), remainder="passthrough"
)
hist_dropped = make_pipeline(dropper, HistGradientBoostingRegressor(random_state=42))

원 - 핫 인코딩 파이프라인

범주형 특징을 원 - 핫 인코딩하고 HistGradientBoostingRegressor 추정기를 학습하는 파이프라인을 생성합니다.

from sklearn.preprocessing import OneHotEncoder

one_hot_encoder = make_column_transformer(
    (
        OneHotEncoder(sparse_output=False, handle_unknown="ignore"),
        make_column_selector(dtype_include="category"),
    ),
    remainder="passthrough",
)

hist_one_hot = make_pipeline(
    one_hot_encoder, HistGradientBoostingRegressor(random_state=42)
)

순서형 인코딩 파이프라인

범주형 특징을 순서형 값으로 처리하고 HistGradientBoostingRegressor 추정기를 학습하는 파이프라인을 생성합니다. OrdinalEncoder 를 사용하여 범주형 특징을 인코딩합니다.

from sklearn.preprocessing import OrdinalEncoder
import numpy as np

ordinal_encoder = make_column_transformer(
    (
        OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=np.nan),
        make_column_selector(dtype_include="category"),
    ),
    remainder="passthrough",
    verbose_feature_names_out=False,
)

hist_ordinal = make_pipeline(
    ordinal_encoder, HistGradientBoostingRegressor(random_state=42)
)

기본 범주형 지원 파이프라인

HistGradientBoostingRegressor 추정기의 기본 범주형 지원 기능을 활용하여 범주형 특징을 처리하는 파이프라인을 생성합니다. 여전히 OrdinalEncoder 를 사용하여 데이터를 사전 처리합니다.

hist_native = make_pipeline(
    ordinal_encoder,
    HistGradientBoostingRegressor(
        random_state=42,
        categorical_features=categorical_columns,
    ),
).set_output(transform="pandas")

모델 비교

교차 검증을 사용하여 네 가지 파이프라인의 성능을 비교하고, 적합 시간과 평균 절대 백분율 오차 점수를 플롯합니다.

from sklearn.model_selection import cross_validate
import matplotlib.pyplot as plt

scoring = "neg_mean_absolute_percentage_error"
n_cv_folds = 3

dropped_result = cross_validate(hist_dropped, X, y, cv=n_cv_folds, scoring=scoring)
one_hot_result = cross_validate(hist_one_hot, X, y, cv=n_cv_folds, scoring=scoring)
ordinal_result = cross_validate(hist_ordinal, X, y, cv=n_cv_folds, scoring=scoring)
native_result = cross_validate(hist_native, X, y, cv=n_cv_folds, scoring=scoring)

def plot_results(figure_title):
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 8))

    plot_info = [
        ("fit_time", "적합 시간 (초)", ax1, None),
        ("test_score", "평균 절대 백분율 오차", ax2, None),
    ]

    x, width = np.arange(4), 0.9
    for key, title, ax, y_limit in plot_info:
        items = [
            dropped_result[key],
            one_hot_result[key],
            ordinal_result[key],
            native_result[key],
        ]

        mape_cv_mean = [np.mean(np.abs(item)) for item in items]
        mape_cv_std = [np.std(item) for item in items]

        ax.bar(
            x=x,
            height=mape_cv_mean,
            width=width,
            yerr=mape_cv_std,
            color=["C0", "C1", "C2", "C3"],
        )
        ax.set(
            xlabel="모델",
            title=title,
            xticks=x,
            xticklabels=["Dropped", "One Hot", "Ordinal", "Native"],
            ylim=y_limit,
        )
    fig.suptitle(figure_title)

plot_results("Ames 주택 데이터에 대한 그래디언트 부스팅")

분할 횟수 제한

트리의 개수와 각 트리의 깊이를 인위적으로 제한하여 과소적합 모델로 동일한 분석을 다시 수행합니다.

for pipe in (hist_dropped, hist_one_hot, hist_ordinal, hist_native):
    pipe.set_params(
        histgradientboostingregressor__max_depth=3,
        histgradientboostingregressor__max_iter=15,
    )

dropped_result = cross_validate(hist_dropped, X, y, cv=n_cv_folds, scoring=scoring)
one_hot_result = cross_validate(hist_one_hot, X, y, cv=n_cv_folds, scoring=scoring)
ordinal_result = cross_validate(hist_ordinal, X, y, cv=n_cv_folds, scoring=scoring)
native_result = cross_validate(hist_native, X, y, cv=n_cv_folds, scoring=scoring)

plot_results("Ames 주택 데이터에 대한 그래디언트 부스팅 (적은 수의 작은 트리)")

요약

이 실험에서는 Ames 주택 데이터셋을 사용하여 그래디언트 부스팅 추정기에서 범주형 특징을 처리하기 위한 네 가지 다른 파이프라인을 비교했습니다. 범주형 특징을 제거하면 예측 성능이 저하되는 것을 발견했으며, 범주형 특징을 사용한 세 가지 모델은 비슷한 오류율을 보였습니다. 범주형 특징에 대해 원 - 핫 인코딩을 사용하는 방법이 가장 느렸으며, 범주형 특징을 순서형 값으로 처리하고 HistGradientBoostingRegressor 추정기의 기본 범주형 지원을 사용하는 방법은 적합 시간이 비슷했습니다. 분할 횟수가 제한되었을 때 기본 범주형 지원 전략이 가장 좋은 성능을 보였습니다.

범주형 특징을 사용한 그래디언트 부스팅

소개