앙상블 기법을 이용한 결정 경계 시각화

소개

이 실습에서는 파이썬의 scikit-learn 라이브러리를 사용하여 아이리스 데이터셋에서 랜덤 트리의 숲의 결정 경계를 시각화하는 방법을 보여줍니다. 아이리스 데이터셋은 분류 작업에 일반적으로 사용되는 데이터셋입니다. 이 실습에서는 의사결정 트리 분류기, 랜덤 포레스트 분류기, 엑스트라 트리 분류기 및 AdaBoost 분류기가 학습한 결정 경계를 비교할 것입니다.

VM 팁

VM 시작이 완료되면 왼쪽 상단 모서리를 클릭하여 Notebook 탭으로 전환하여 연습을 위한 Jupyter Notebook에 접근할 수 있습니다.

때때로 Jupyter Notebook 이 완전히 로드되기까지 몇 초 정도 기다려야 할 수 있습니다. Jupyter Notebook 의 제한으로 인해 작업의 유효성 검사를 자동화할 수 없습니다.

학습 중 문제가 발생하면 Labby 에게 문의하십시오. 세션 후 피드백을 제공하면 문제를 신속하게 해결해 드리겠습니다.

라이브러리 가져오기

이 단계에서는 아이리스 데이터셋에서 결정 경계를 시각화하는 데 필요한 라이브러리를 가져옵니다.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

from sklearn.datasets import load_iris
from sklearn.ensemble import (
    RandomForestClassifier,
    ExtraTreesClassifier,
    AdaBoostClassifier,
)
from sklearn.tree import DecisionTreeClassifier

매개변수 정의

이 단계에서는 아이리스 데이터셋에서 결정 경계를 시각화하는 데 필요한 매개변수를 정의합니다.

## 매개변수
n_classes = 3
n_estimators = 30
cmap = plt.cm.RdYlBu
plot_step = 0.02  ## 결정 경계 윤곽선을 위한 미세 단계 너비
plot_step_coarser = 0.5  ## 거친 분류기 추측을 위한 단계 너비
RANDOM_SEED = 13  ## 각 반복에서 시드를 고정

데이터 로드

이 단계에서는 아이리스 데이터셋을 로드합니다.

## 데이터 로드
iris = load_iris()

모델 정의

이 단계에서는 아이리스 데이터셋의 결정 경계를 시각화하는 데 사용할 모델들을 정의합니다.

models = [
    DecisionTreeClassifier(max_depth=None),
    RandomForestClassifier(n_estimators=n_estimators),
    ExtraTreesClassifier(n_estimators=n_estimators),
    AdaBoostClassifier(DecisionTreeClassifier(max_depth=3), n_estimators=n_estimators),
]

결정 경계 시각화

이 단계에서는 정의된 모델의 결정 경계를 아이리스 데이터셋에 시각화합니다.

plot_idx = 1

for pair in ([0, 1], [0, 2], [2, 3]):
    for model in models:
        ## 해당하는 두 개의 특징만 사용
        X = iris.data[:, pair]
        y = iris.target

        ## 셔플
        idx = np.arange(X.shape[0])
        np.random.seed(RANDOM_SEED)
        np.random.shuffle(idx)
        X = X[idx]
        y = y[idx]

        ## 표준화
        mean = X.mean(axis=0)
        std = X.std(axis=0)
        X = (X - mean) / std

        ## 학습
        model.fit(X, y)

        scores = model.score(X, y)
        ## 각 열과 콘솔에 대한 제목을 str() 을 사용하여 만들고
        ## 문자열의 불필요한 부분을 잘라냅니다.
        model_title = str(type(model)).split(".")[-1][:-2][: -len("Classifier")]

        model_details = model_title
        if hasattr(model, "estimators_"):
            model_details += " with {} estimators".format(len(model.estimators_))
        print(model_details + " with features", pair, "has a score of", scores)

        plt.subplot(3, 4, plot_idx)
        if plot_idx <= len(models):
            ## 각 열 위에 제목 추가
            plt.title(model_title, fontsize=9)

        ## 결정 경계를 미세한 메쉬를 입력으로 사용하여 채워진 등고선 플롯을 사용하여 플롯
        x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
        y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
        xx, yy = np.meshgrid(
            np.arange(x_min, x_max, plot_step), np.arange(y_min, y_max, plot_step)
        )

        ## 단일 DecisionTreeClassifier 를 플롯하거나, 분류기 앙상블의 결정 경계를 알파 블렌딩
        if isinstance(model, DecisionTreeClassifier):
            Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
            Z = Z.reshape(xx.shape)
            cs = plt.contourf(xx, yy, Z, cmap=cmap)
        else:
            ## 사용 중인 추정기의 수에 따라 알파 블렌딩 수준 선택
            ## (AdaBoost 는 초기 단계에서 충분히 좋은 적합을 달성하면 최대 추정기보다 적은 추정기를 사용할 수 있음)
            estimator_alpha = 1.0 / len(model.estimators_)
            for tree in model.estimators_:
                Z = tree.predict(np.c_[xx.ravel(), yy.ravel()])
                Z = Z.reshape(xx.shape)
                cs = plt.contourf(xx, yy, Z, alpha=estimator_alpha, cmap=cmap)

        ## 앙상블 분류의 집합을 플롯하기 위해 더 거친 그리드를 만듭니다.
        ## 이 점들은 정기적으로 배치되고 검은색 윤곽선이 없습니다.
        xx_coarser, yy_coarser = np.meshgrid(
            np.arange(x_min, x_max, plot_step_coarser),
            np.arange(y_min, y_max, plot_step_coarser),
        )
        Z_points_coarser = model.predict(
            np.c_[xx_coarser.ravel(), yy_coarser.ravel()]
        ).reshape(xx_coarser.shape)
        cs_points = plt.scatter(
            xx_coarser,
            yy_coarser,
            s=15,
            c=Z_points_coarser,
            cmap=cmap,
            edgecolors="none",
        )

        ## 학습 데이터 포인트를 플롯합니다. 이들은 클러스터링되어 검은색 윤곽선이 있습니다.
        plt.scatter(
            X[:, 0],
            X[:, 1],
            c=y,
            cmap=ListedColormap(["r", "y", "b"]),
            edgecolor="k",
            s=20,
        )
        plot_idx += 1  ## 다음 플롯으로 이동

plt.suptitle("아이리스 데이터셋의 특징 부분 집합에 대한 분류기", fontsize=12)
plt.axis("tight")
plt.tight_layout(h_pad=0.2, w_pad=0.2, pad=2.5)
plt.show()

요약

이 실습에서는 파이썬의 scikit-learn 라이브러리를 사용하여 아이리스 데이터셋에서 랜덤 트리의 결정 경계를 시각화하는 방법을 배웠습니다. 의사결정 트리 분류기, 랜덤 포레스트 분류기, 엑스트라 트리 분류기, 그리고 AdaBoost 분류기가 학습한 결정 경계를 비교했습니다. 또한 파이썬에서 모델을 정의하고, 결정 경계를 시각화하고, 데이터를 로드하는 방법을 배웠습니다.

아이리스 데이터셋 랜덤 트리 결정 경계 시각화

소개