랜덤 포레스트를 이용한 특징 중요도 평가

소개

이 실습에서는 랜덤 포레스트를 사용하여 인공 분류 작업에서 특징의 중요성을 평가합니다. 3 개의 정보 특징만 있는 합성 데이터 세트를 생성합니다. 포레스트의 특징 중요도와 함께, 오차 막대에 의해 표현되는 나무 간의 변동성이 함께 플롯됩니다.

VM 팁

VM 시작이 완료되면 왼쪽 상단 모서리를 클릭하여 Notebook 탭으로 전환하여 연습을 위한 Jupyter Notebook에 접근합니다.

때때로 Jupyter Notebook 이 완전히 로드되기까지 몇 초 정도 기다려야 할 수 있습니다. Jupyter Notebook 의 제한으로 인해 작업의 유효성 검사는 자동화될 수 없습니다.

학습 중 문제가 발생하면 Labby 에 문의하십시오. 세션 후 피드백을 제공하면 문제를 신속하게 해결해 드리겠습니다.

라이브러리 가져오기

이 실습에 필요한 라이브러리를 가져옵니다.

import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
import pandas as pd
import numpy as np
import time

데이터 생성

3 개의 정보 특징만 있는 합성 데이터 세트를 생성합니다. 정보 특징이 X 의 첫 세 열에 해당하도록 데이터 세트를 명시적으로 섞지 않습니다. 또한, 데이터 세트를 학습 및 테스트 하위 집합으로 분할합니다.

X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=3,
    n_redundant=0,
    n_repeated=0,
    n_classes=2,
    random_state=0,
    shuffle=False,
)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

랜덤 포레스트 학습

특징 중요도를 계산하기 위해 랜덤 포레스트 분류기를 학습합니다.

feature_names = [f"feature {i}" for i in range(X.shape[1])]
forest = RandomForestClassifier(random_state=0)
forest.fit(X_train, y_train)

불순물 감소 평균 기반 특징 중요도

학습된 속성 feature_importances_에서 특징 중요도를 제공하며, 각 트리 내 불순물 감소 누적의 평균 및 표준 편차로 계산됩니다. 불순물 기반 중요도를 플롯합니다.

start_time = time.time()
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_], axis=0)
elapsed_time = time.time() - start_time

print(f"Elapsed time to compute the importances: {elapsed_time:.3f} seconds")

forest_importances = pd.Series(importances, index=feature_names)

fig, ax = plt.subplots()
forest_importances.plot.bar(yerr=std, ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()

특성 순열 기반 특징 중요도

순열 특징 중요도는 불순물 기반 특징 중요도의 한계를 극복합니다. 즉, 고카디널리티 특징에 대한 편향이 없으며, 왼쪽으로 제외된 테스트 세트에서 계산할 수 있습니다. 전체 순열 중요도를 계산합니다. 특징은 n 번 셔플되고 모델이 다시 맞춰져서 특징의 중요도를 추정합니다. 중요도 순위를 플롯합니다.

start_time = time.time()
result = permutation_importance(
    forest, X_test, y_test, n_repeats=10, random_state=42, n_jobs=2
)
elapsed_time = time.time() - start_time
print(f"Elapsed time to compute the importances: {elapsed_time:.3f} seconds")

forest_importances = pd.Series(result.importances_mean, index=feature_names)

fig, ax = plt.subplots()
forest_importances.plot.bar(yerr=result.importances_std, ax=ax)
ax.set_title("Feature importances using permutation on full model")
ax.set_ylabel("Mean accuracy decrease")
fig.tight_layout()
plt.show()

요약

이 실험에서, 정보가 있는 특징이 3 개뿐인 합성 데이터 세트를 생성하고, 랜덤 포레스트를 사용하여 특징의 중요도를 평가했습니다. 오차 막대에 의해 나타나는 트리 간 변동성과 함께, 포레스트의 특징 중요도를 플롯했습니다. 불순물 기반 중요도와 특징 순열 중요도를 사용하여 특징 중요도를 계산했습니다. 두 방법 모두 동일한 특징을 가장 중요한 특징으로 감지했습니다.

랜덤 포레스트를 이용한 특징 중요도 분석

소개