Out-of-Bag 추정을 활용한 Gradient Boosting

Introduction

This lab will guide you through implementing a Gradient Boosting Classifier with out-of-bag (OOB) estimates using the scikit-learn library in Python. OOB estimates are an alternative to cross-validation estimates and can be computed on-the-fly without the need for repeated model fitting. This lab will cover the following steps:

Generate data
Fit classifier with OOB estimates
Estimate best number of iterations using cross-validation
Compute best number of iterations for test data
Plot the results

VM Tips

After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.

Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.

If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.

소개

이 실습에서는 Python 의 scikit-learn 라이브러리를 사용하여 out-of-bag (OOB) 추정치를 사용한 Gradient Boosting 분류기를 구현하는 방법을 안내합니다. OOB 추정치는 교차 검증 추정치의 대안이며, 반복적인 모델 적합 없이 실시간으로 계산할 수 있습니다. 이 실습에서는 다음 단계를 다룰 것입니다.

데이터 생성
OOB 추정치를 사용한 분류기 적합
교차 검증을 사용한 반복 횟수 추정
테스트 데이터에 대한 최적 반복 횟수 계산
결과 플롯

VM 팁

VM 시작이 완료되면 왼쪽 상단 모서리를 클릭하여 Notebook 탭으로 전환하여 연습을 위한 Jupyter Notebook에 접근할 수 있습니다.

때때로 Jupyter Notebook 이 완전히 로드되기까지 몇 초 정도 기다려야 할 수 있습니다. Jupyter Notebook 의 제한으로 인해 작업 검증을 자동화할 수 없습니다.

학습 중 문제가 발생하면 Labby 에 문의하십시오. 세션 후 피드백을 제공하면 문제를 신속하게 해결해 드리겠습니다.

OOB 추정치를 사용한 분류기 적합

다음으로 sklearn.ensemble 모듈의 GradientBoostingClassifier 클래스를 사용하여 OOB 추정치를 사용한 Gradient Boosting 분류기를 생성합니다. 추정자 수는 100 으로, 학습률은 0.1 로 설정합니다.

from sklearn.ensemble import GradientBoostingClassifier

params = {
    "n_estimators": 100,
    "learning_rate": 0.1,
    "subsample": 1.0,
    "max_depth": 3,
    "min_samples_leaf": 1,
    "random_state": 1,
    "oob_score": True
}

clf = GradientBoostingClassifier(**params)
clf.fit(X, y)

교차 검증을 이용한 최적 반복 횟수 추정

교차 검증을 사용하여 최적의 반복 횟수를 추정할 수 있습니다. 5-겹 교차 검증을 사용하고 각 반복 횟수에 대한 음수 로그 손실을 계산합니다.

from sklearn.model_selection import cross_val_score

cv_scores = []
for i in range(1, params['n_estimators'] + 1):
    clf.set_params(n_estimators=i)
    scores = -1 * cross_val_score(clf, X, y, cv=5, scoring='neg_log_loss')
    cv_scores.append(scores.mean())

테스트 데이터에 대한 최적 반복 횟수 계산

테스트 데이터에 대한 최적의 반복 횟수를 계산할 수도 있습니다. 테스트 데이터에서 각 반복 횟수에 대한 음수 로그 손실을 계산합니다.

from sklearn.metrics import log_loss
import matplotlib.pyplot as plt

test_scores = []
for i, y_pred in enumerate(clf.staged_predict_proba(X)):
    score = log_loss(y, y_pred)
    test_scores.append(score)

best_n_estimators = np.argmin(test_scores) + 1

결과 플롯

마지막으로, 다양한 반복 횟수에 대한 모델 성능을 시각화하기 위해 결과를 플롯할 수 있습니다. y 축에는 음수 로그 손실을, x 축에는 반복 횟수를 표시합니다.

plt.figure(figsize=(10, 5))
plt.plot(range(1, params['n_estimators'] + 1), cv_scores, label='CV')
plt.plot(range(1, params['n_estimators'] + 1), test_scores, label='Test')
plt.axvline(x=best_n_estimators, color='red', linestyle='--')
plt.xlabel('반복 횟수')
plt.ylabel('음수 로그 손실')
plt.legend()
plt.show()

요약

이 실습에서는 Gradient Boosting 분류기를 out-of-bag 추정과 함께 구현하고 교차 검증을 사용하여 최적의 반복 횟수를 추정하는 방법을 배웠습니다. 또한 테스트 데이터에 대한 최적의 반복 횟수를 계산하고, 다양한 반복 횟수에 따른 모델 성능을 시각화하기 위해 결과를 플롯했습니다.

Gradient Boosting Out-of-Bag 추정