랜덤 포레스트를 활용한 다중 출력 회귀 마스터하기

소개

이 실습에서는 다중 출력 회귀를 수행하기 위한 다중 출력 메타 추정기를 사용하는 방법을 보여줍니다. 다중 출력 회귀를 기본적으로 지원하는 랜덤 포레스트 회귀자가 사용되므로 결과를 비교할 수 있습니다. 이 실습의 목적은 scikit-learn 의 MultiOutputRegressor를 사용하여 다중 출력 회귀를 수행하는 방법을 보여주고, 표준 랜덤 포레스트 회귀자와 결과를 비교하는 것입니다.

VM 팁

VM 시작이 완료되면 왼쪽 상단 모서리를 클릭하여 Notebook 탭으로 전환하여 연습을 위한 Jupyter Notebook에 접근할 수 있습니다.

때때로 Jupyter Notebook 이 완전히 로드되기까지 몇 초 정도 기다려야 할 수 있습니다. Jupyter Notebook 의 제한으로 인해 작업의 유효성 검사를 자동화할 수 없습니다.

학습 중 문제가 발생하면 Labby 에 문의하십시오. 세션 후 피드백을 제공하면 문제를 신속하게 해결해 드리겠습니다.

라이브러리 가져오기

먼저 필요한 라이브러리를 가져와야 합니다. numpy, matplotlib, 그리고 scikit-learn 의 RandomForestRegressor, train_test_split, MultiOutputRegressor를 사용할 것입니다.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputRegressor

랜덤 데이터셋 생성

다음으로, 회귀에 사용할 랜덤 데이터셋을 생성합니다. numpy를 사용하여 -100 과 100 사이의 600 개의 x 값 집합과 x 값의 사인 및 코사인으로 계산된 y 값 (일부 랜덤 노이즈 포함) 을 생성합니다.

rng = np.random.RandomState(1)
X = np.sort(200 * rng.rand(600, 1) - 100, axis=0)
y = np.array([np.pi * np.sin(X).ravel(), np.pi * np.cos(X).ravel()]).T
y += 0.5 - rng.rand(*y.shape)

학습 및 테스트 데이터 분할

scikit-learn 의 train_test_split 함수를 사용하여 데이터를 400 개의 학습 데이터셋과 200 개의 테스트 데이터셋으로 분할합니다.

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=400, test_size=200, random_state=4)

랜덤 포레스트 회귀 모델 생성

scikit-learn 의 RandomForestRegressor를 사용하여 최대 깊이 30, 추정자 100 개를 가진 랜덤 포레스트 회귀 모델을 생성합니다.

max_depth = 30
regr_rf = RandomForestRegressor(n_estimators=100, max_depth=max_depth, random_state=2)
regr_rf.fit(X_train, y_train)

MultiOutputRegressor 생성

기본 추정기로 랜덤 포레스트 회귀 모델을 사용하는 MultiOutputRegressor를 생성합니다. 이전 단계 (4 단계) 와 동일한 매개변수를 사용합니다.

regr_multirf = MultiOutputRegressor(RandomForestRegressor(n_estimators=100, max_depth=max_depth, random_state=0))
regr_multirf.fit(X_train, y_train)

새로운 데이터 예측

랜덤 포레스트 회귀 모델과 다중 출력 회귀 모델을 사용하여 테스트 데이터에 대한 예측을 수행합니다.

y_rf = regr_rf.predict(X_test)
y_multirf = regr_multirf.predict(X_test)

결과 플롯

두 회귀 모델의 성능을 비교하기 위해 결과를 플롯합니다. matplotlib를 사용하여 실제 테스트 데이터, 랜덤 포레스트 회귀 모델의 예측값, 그리고 다중 출력 회귀 모델의 예측값을 산점도로 표시합니다.

plt.figure()
s = 50
a = 0.4
plt.scatter(y_test[:, 0], y_test[:, 1], edgecolor="k", c="navy", s=s, marker="s", alpha=a, label="Data")
plt.scatter(y_rf[:, 0], y_rf[:, 1], edgecolor="k", c="c", s=s, marker="^", alpha=a, label="RF score=%.2f" % regr_rf.score(X_test, y_test))
plt.scatter(y_multirf[:, 0], y_multirf[:, 1], edgecolor="k", c="cornflowerblue", s=s, alpha=a, label="Multi RF score=%.2f" % regr_multirf.score(X_test, y_test))
plt.xlim([-6, 6])
plt.ylim([-6, 6])
plt.xlabel("target 1")
plt.ylabel("target 2")
plt.title("랜덤 포레스트와 다중 출력 메타 추정기 비교")
plt.legend()
plt.show()

요약

이 실습에서는 scikit-learn 의 MultiOutputRegressor를 사용하여 다중 출력 회귀를 수행하는 방법을 보여주었습니다. 랜덤 데이터를 사용하여 다중 출력 회귀 모델의 성능을 표준 랜덤 포레스트 회귀 모델과 비교했습니다. 결과는 다중 출력 회귀 모델이 랜덤 포레스트 회귀 모델보다 약간 더 나은 성능을 보였다는 것을 보여줍니다.

랜덤 포레스트 다중 출력 회귀 플롯

소개