베이지안 회귀 기법 | 머신러닝 튜토리얼

소개

이 실습에서는 합성 데이터셋을 사용하여 두 가지 다른 베이지안 회귀 모델인 자동 관련성 결정 (ARD) 과 베이지안 릿지 회귀를 비교합니다. 첫 번째 부분에서는 기준 모델로서 최소자승법 (OLS) 모델을 사용하여 모델의 계수를 실제 계수와 비교합니다. 마지막 섹션에서는 다항식 특징 확장을 사용하여 X와 y 사이의 비선형 관계를 적합시킨 후 ARD 와 베이지안 릿지 회귀에 대한 예측값과 불확실성을 플롯합니다.

VM 팁

VM 시작이 완료되면 왼쪽 상단 모서리를 클릭하여 Notebook 탭으로 전환하여 연습을 위한 Jupyter Notebook에 접근합니다.

때때로 Jupyter Notebook 이 완전히 로드되기까지 몇 초 정도 기다려야 할 수 있습니다. Jupyter Notebook 의 제한으로 인해 작업 검증은 자동화될 수 없습니다.

학습 중 문제가 발생하면 Labby 에 문의하십시오. 세션 후 피드백을 제공하면 문제를 신속하게 해결해 드리겠습니다.

합성 데이터셋 생성

X와 y가 선형적으로 연결된 합성 데이터셋을 생성합니다. X의 10 개 특징이 y를 생성하는 데 사용됩니다. 다른 특징들은 y를 예측하는 데 유용하지 않습니다. 또한, n_samples == n_features인 데이터셋을 생성합니다. 이러한 설정은 OLS 모델에 어려움을 주며, 잠재적으로 임의로 큰 가중치를 초래할 수 있습니다. 가중치에 사전 정보를 제공하고 페널티를 적용하면 이 문제를 완화할 수 있습니다. 마지막으로 가우시안 노이즈가 추가됩니다.

from sklearn.datasets import make_regression

X, y, true_weights = make_regression(
    n_samples=100,
    n_features=100,
    n_informative=10,
    noise=8,
    coef=True,
    random_state=42,
)

회귀 모델 적합

나중에 모델 계수를 비교하기 위해 두 가지 베이지안 모델과 OLS 모델을 모두 적합합니다.

import pandas as pd
from sklearn.linear_model import ARDRegression, LinearRegression, BayesianRidge

olr = LinearRegression().fit(X, y)
brr = BayesianRidge(compute_score=True, n_iter=30).fit(X, y)
ard = ARDRegression(compute_score=True, n_iter=30).fit(X, y)
df = pd.DataFrame(
    {
        "Weights of true generative process": true_weights,
        "ARDRegression": ard.coef_,
        "BayesianRidge": brr.coef_,
        "LinearRegression": olr.coef_,
    }
)

실제 계수와 추정 계수 시각화

각 모델의 계수를 실제 생성 모델의 가중치와 비교합니다.

import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import SymLogNorm

plt.figure(figsize=(10, 6))
ax = sns.heatmap(
    df.T,
    norm=SymLogNorm(linthresh=10e-4, vmin=-80, vmax=80),
    cbar_kws={"label": "계수 값"},
    cmap="seismic_r",
)
plt.ylabel("선형 모델")
plt.xlabel("계수")
plt.tight_layout(rect=(0, 0, 1, 0.95))
_ = plt.title("모델 계수")

주변 로그 - 가능도 시각화

두 모델의 주변 로그 - 가능도를 시각화합니다.

import numpy as np

ard_scores = -np.array(ard.scores_)
brr_scores = -np.array(brr.scores_)
plt.plot(ard_scores, color="navy", label="ARD")
plt.plot(brr_scores, color="red", label="BayesianRidge")
plt.ylabel("로그 - 가능도")
plt.xlabel("반복 횟수")
plt.xlim(1, 30)
plt.legend()
_ = plt.title("모델 로그 - 가능도")

합성 데이터셋 생성

입력 특징의 비선형 함수인 대상을 생성하고, 표준 균일 분포를 따르는 노이즈를 추가합니다.

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

rng = np.random.RandomState(0)
n_samples = 110

## 나중에 플롯을 더 쉽게 하기 위해 데이터를 정렬합니다.
X = np.sort(-10 * rng.rand(n_samples) + 10)
noise = rng.normal(0, 1, n_samples) * 1.35
y = np.sqrt(X) * np.sin(X) + noise
full_data = pd.DataFrame({"input_feature": X, "target": y})
X = X.reshape((-1, 1))

## 외삽
X_plot = np.linspace(10, 10.4, 10)
y_plot = np.sqrt(X_plot) * np.sin(X_plot)
X_plot = np.concatenate((X, X_plot.reshape((-1, 1))))
y_plot = np.concatenate((y - noise, y_plot))

회귀자 맞추기

10 차 다항식을 시도하여 과적합될 가능성이 있지만, 베이지안 선형 모델은 다항식 계수의 크기를 규제합니다. ARDRegression 및 BayesianRidge 의 기본값으로 fit_intercept=True이므로 PolynomialFeatures 는 추가적인 편향 특징을 도입하지 않아야 합니다. return_std=True를 설정하면 베이지안 회귀자는 모델 매개변수에 대한 사후 분포의 표준 편차를 반환합니다.

ard_poly = make_pipeline(
    PolynomialFeatures(degree=10, include_bias=False),
    StandardScaler(),
    ARDRegression(),
).fit(X, y)
brr_poly = make_pipeline(
    PolynomialFeatures(degree=10, include_bias=False),
    StandardScaler(),
    BayesianRidge(),
).fit(X, y)

y_ard, y_ard_std = ard_poly.predict(X_plot, return_std=True)
y_brr, y_brr_std = brr_poly.predict(X_plot, return_std=True)

점수의 표준 오차를 사용하여 다항 회귀 플롯 그리기

오차 막대는 쿼리 지점의 예측된 가우시안 분포의 한 표준 편차를 나타냅니다. ARD 회귀가 두 모델 모두 기본 매개변수를 사용할 때 실제 값을 가장 잘 포착하는 데 주목하십시오. 그러나 베이지안 릿지의 lambda_init 하이퍼매개변수를 더 줄이면 편향을 줄일 수 있습니다. 마지막으로, 다항 회귀의 본질적인 한계로 인해 두 모델 모두 외삽 시 실패합니다.

ax = sns.scatterplot(
    data=full_data, x="input_feature", y="target", color="black", alpha=0.75
)
ax.plot(X_plot, y_plot, color="black", label="Ground Truth")
ax.plot(X_plot, y_brr, color="red", label="BayesianRidge with polynomial features")
ax.plot(X_plot, y_ard, color="navy", label="ARD with polynomial features")
ax.fill_between(
    X_plot.ravel(),
    y_ard - y_ard_std,
    y_ard + y_ard_std,
    color="navy",
    alpha=0.3,
)
ax.fill_between(
    X_plot.ravel(),
    y_brr - y_brr_std,
    y_brr + y_brr_std,
    color="red",
    alpha=0.3,
)
ax.legend()
_ = ax.set_title("Polynomial fit of a non-linear feature")

요약

이 실험은 합성 데이터 세트를 사용하여 두 가지 다른 베이지안 회귀자를 비교합니다. 실험의 첫 번째 부분은 기준선으로서 최소 제곱법 (OLS) 모델을 사용하여 모델의 계수를 실제 계수와 비교합니다. 마지막 섹션에서는 X와 y 사이의 비선형 관계를 맞추기 위해 다항식 특징 확장을 사용하여 ARD 및 베이지안 릿지 회귀에 대한 예측 및 불확실성을 플롯합니다.

선형 베이지안 회귀자 비교

소개

VM 팁

합성 데이터셋 생성

회귀 모델 적합

실제 계수와 추정 계수 시각화

주변 로그 - 가능도 시각화

합성 데이터셋 생성

회귀자 맞추기

점수의 표준 오차를 사용하여 다항 회귀 플롯 그리기

요약