머신러닝에서 과소적합 및 과적합 극복하기

소개

이 실습은 머신 러닝에서 과소적합 (underfitting) 과 과대적합 (overfitting) 의 문제점을 보여주고, 다항식 특징을 가진 선형 회귀를 사용하여 비선형 함수를 근사하는 방법을 보여줍니다. scikit-learn 을 사용하여 데이터를 생성하고, 모델을 학습시키며, 모델 성능을 평가할 것입니다.

VM 팁

VM 시작이 완료되면 왼쪽 상단 모서리를 클릭하여 Notebook 탭으로 전환하여 연습용 Jupyter Notebook에 접근합니다.

때때로 Jupyter Notebook 이 완전히 로드되기까지 몇 초 정도 기다려야 할 수 있습니다. Jupyter Notebook 의 제한으로 인해 작업 검증을 자동화할 수 없습니다.

학습 중 문제가 발생하면 Labby 에 문의하십시오. 세션 후 피드백을 제공하면 문제를 신속하게 해결해 드리겠습니다.

라이브러리 가져오기

먼저, 이 실습에 필요한 라이브러리를 가져옵니다.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

데이터 생성

코사인 함수에서 30 개의 샘플을 생성하고, 샘플에 약간의 랜덤 노이즈를 추가합니다.

def true_fun(X):
    return np.cos(1.5 * np.pi * X)

np.random.seed(0)

n_samples = 30

X = np.sort(np.random.rand(n_samples))
y = true_fun(X) + np.random.randn(n_samples) * 0.1

데이터 시각화

실제 함수와 생성된 샘플을 플롯합니다.

plt.figure(figsize=(6, 4))
plt.plot(np.linspace(0, 1, 100), true_fun(np.linspace(0, 1, 100)), label="True function")
plt.scatter(X, y, edgecolor="b", s=20, label="Samples")
plt.xlabel("x")
plt.ylabel("y")
plt.legend(loc="best")
plt.show()

다항식 특징으로 모델 적합

1, 4, 15 차 다항식 특징을 가진 모델을 적합하고 결과를 플롯합니다.

degrees = [1, 4, 15]

plt.figure(figsize=(14, 5))

for i in range(len(degrees)):
    ax = plt.subplot(1, len(degrees), i + 1)
    plt.setp(ax, xticks=(), yticks=())

    polynomial_features = PolynomialFeatures(degree=degrees[i], include_bias=False)
    linear_regression = LinearRegression()
    pipeline = Pipeline(
        [
            ("polynomial_features", polynomial_features),
            ("linear_regression", linear_regression),
        ]
    )
    pipeline.fit(X[:, np.newaxis], y)

    X_test = np.linspace(0, 1, 100)
    plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
    plt.plot(X_test, true_fun(X_test), label="True function")
    plt.scatter(X, y, edgecolor="b", s=20, label="Samples")
    plt.xlabel("x")
    plt.ylabel("y")
    plt.xlim((0, 1))
    plt.ylim((-2, 2))
    plt.legend(loc="best")
    plt.title("Degree {}".format(degrees[i]))

plt.show()

모델 성능 평가

교차 검증을 사용하여 모델을 평가하고 검증 세트에서 평균 제곱 오차 (MSE) 를 계산합니다.

degrees = [1, 4, 15]

plt.figure(figsize=(14, 5))

for i in range(len(degrees)):
    ax = plt.subplot(1, len(degrees), i + 1)
    plt.setp(ax, xticks=(), yticks=())

    polynomial_features = PolynomialFeatures(degree=degrees[i], include_bias=False)
    linear_regression = LinearRegression()
    pipeline = Pipeline(
        [
            ("polynomial_features", polynomial_features),
            ("linear_regression", linear_regression),
        ]
    )
    pipeline.fit(X[:, np.newaxis], y)

    ## 교차 검증을 사용하여 모델 평가
    scores = cross_val_score(
        pipeline, X[:, np.newaxis], y, scoring="neg_mean_squared_error", cv=10
    )

    X_test = np.linspace(0, 1, 100)
    plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
    plt.plot(X_test, true_fun(X_test), label="True function")
    plt.scatter(X, y, edgecolor="b", s=20, label="Samples")
    plt.xlabel("x")
    plt.ylabel("y")
    plt.xlim((0, 1))
    plt.ylim((-2, 2))
    plt.legend(loc="best")
    plt.title(
        "Degree {}\nMSE = {:.2e}(+/- {:.2e})".format(
            degrees[i], -scores.mean(), scores.std()
        )
    )

plt.show()

요약

이 실험에서는 다항식 특징을 가진 선형 회귀를 사용하여 비선형 함수를 근사하는 방법과 교차 검증을 사용하여 모델 성능을 평가하는 방법을 보여주었습니다. 선형 함수는 학습 샘플을 적합하기에 충분하지 않으며, 4 차 다항식은 실제 함수를 거의 완벽하게 근사합니다. 그러나 더 높은 차수의 경우 모델은 학습 데이터를 과적합하고 학습 데이터의 노이즈를 학습하게 됩니다. 교차 검증과 평균 제곱 오차 (MSE) 를 사용하여 모델 성능을 평가하고 과적합을 방지할 수 있습니다.

과소적합과 과적합 이해

소개