AdaBoost 의사결정 나무 분류기 | 머신러닝 튜토리얼

소개

이것은 AdaBoost 를 사용하여 의사결정 트럼프를 학습하고 두 개의 가우시안 분포 쿼타일 클러스터로 구성된 이차원 데이터 세트를 분류하는 단계별 실습입니다.

VM 팁

VM 시작이 완료되면 왼쪽 상단 모서리를 클릭하여 Notebook 탭으로 전환하여 연습을 위한 Jupyter Notebook에 접속합니다.

때때로 Jupyter Notebook 이 완전히 로드되기까지 몇 초 정도 기다려야 할 수 있습니다. Jupyter Notebook 의 제한으로 인해 작업의 유효성 검사를 자동화할 수 없습니다.

학습 중 문제가 발생하면 Labby 에 문의하십시오. 세션 후 피드백을 제공하면 문제를 신속하게 해결해 드리겠습니다.

필요한 라이브러리 가져오기

이 단계에서는 이 실습에 필요한 라이브러리를 가져옵니다.

import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_gaussian_quantiles
from sklearn.inspection import DecisionBoundaryDisplay

데이터셋 생성

이 단계에서는 sklearn.datasets 모듈의 make_gaussian_quantiles 함수를 사용하여 두 개의 가우시안 쿼타일 클러스터로 구성된 비선형적으로 분리 가능한 분류 데이터셋을 생성합니다. 또한 두 클러스터를 연결하고 레이블을 할당합니다.

X1, y1 = make_gaussian_quantiles(
    cov=2.0, n_samples=200, n_features=2, n_classes=2, random_state=1
)
X2, y2 = make_gaussian_quantiles(
    mean=(3, 3), cov=1.5, n_samples=300, n_features=2, n_classes=2, random_state=1
)
X = np.concatenate((X1, X2))
y = np.concatenate((y1, -y2 + 1))

AdaBoost 의사결정 트리 생성 및 학습

이 단계에서는 sklearn.ensemble 모듈의 AdaBoostClassifier 클래스를 사용하여 AdaBoost 의사결정 트리를 생성합니다. 의사결정 트리를 기본 추정기로 사용하고 max_depth 매개변수를 1 로 설정합니다. 또한 algorithm 매개변수를 "SAMME"로, n_estimators 매개변수를 200 으로 설정합니다. 마지막으로, 분류기를 데이터셋에 맞춥니다.

bdt = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), algorithm="SAMME", n_estimators=200
)

bdt.fit(X, y)

결정 경계 및 학습 데이터 시각화

이 단계에서는 결정 경계와 학습 데이터 포인트를 시각화합니다. sklearn.inspection 모듈의 from_estimator 메서드를 사용하여 DecisionBoundaryDisplay 객체를 생성하고, AdaBoost 분류기, 데이터셋 및 다른 매개변수를 전달합니다. 각 클래스에 대해 다른 색상을 사용하여 학습 데이터 포인트를 플롯합니다.

plot_colors = "br"
plot_step = 0.02
class_names = "AB"

plt.figure(figsize=(10, 5))

## 결정 경계 플롯
ax = plt.subplot(121)
disp = DecisionBoundaryDisplay.from_estimator(
    bdt,
    X,
    cmap=plt.cm.Paired,
    response_method="predict",
    ax=ax,
    xlabel="x",
    ylabel="y",
)
x_min, x_max = disp.xx0.min(), disp.xx0.max()
y_min, y_max = disp.xx1.min(), disp.xx1.max()
plt.axis("tight")

## 학습 데이터 포인트 플롯
for i, n, c in zip(range(2), class_names, plot_colors):
    idx = np.where(y == i)
    plt.scatter(
        X[idx, 0],
        X[idx, 1],
        c=c,
        cmap=plt.cm.Paired,
        s=20,
        edgecolor="k",
        label="Class %s" % n,
    )
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.legend(loc="upper right")

plt.title("Decision Boundary")

두 클래스 결정 점수 시각화

이 단계에서는 두 클래스의 결정 점수를 시각화합니다. AdaBoost 분류기의 decision_function 메서드를 사용하여 데이터셋의 각 샘플에 대한 결정 점수를 얻습니다. 그런 다음 각 클래스의 결정 점수 히스토그램을 플롯합니다.

## 두 클래스 결정 점수 플롯
twoclass_output = bdt.decision_function(X)
plot_range = (twoclass_output.min(), twoclass_output.max())
plt.subplot(122)
for i, n, c in zip(range(2), class_names, plot_colors):
    plt.hist(
        twoclass_output[y == i],
        bins=10,
        range=plot_range,
        facecolor=c,
        label="Class %s" % n,
        alpha=0.5,
        edgecolor="k",
    )
x1, x2, y1, y2 = plt.axis()
plt.axis((x1, x2, y1, y2 * 1.2))
plt.legend(loc="upper right")
plt.ylabel("샘플 수")
plt.xlabel("점수")
plt.title("결정 점수")

plt.tight_layout()
plt.subplots_adjust(wspace=0.35)
plt.show()

요약

이 실습에서는 AdaBoost 를 사용하여 의사결정 나무 (decision stump) 를 학습하고, 두 개의 가우시안 분포 쿼터 (Gaussian quantiles) 클러스터로 구성된 이차원 데이터셋을 분류하는 방법을 배웠습니다. 또한 분류기의 결정 경계와 결정 점수를 시각화하는 방법도 학습했습니다.