머신러닝 | PCA | LinearSVC | 숫자 데이터셋

소개

이 실습에서는 PCA 성분의 수를 최소화하면서 최고 정확도 점수의 1 표준 편차 내에서 적절한 정확도를 찾아 모델 복잡성과 교차 검증 점수를 균형 있게 맞추는 방법을 배웁니다. scikit-learn 의 digits 데이터셋과 PCA 및 LinearSVC 로 구성된 파이프라인을 사용합니다.

VM 팁

VM 시작이 완료되면 왼쪽 상단 모서리를 클릭하여 Notebook 탭으로 전환하여 연습을 위한 Jupyter Notebook에 접근합니다.

때때로 Jupyter Notebook 이 완전히 로드되기까지 몇 초 정도 기다려야 할 수 있습니다. Jupyter Notebook 의 제한으로 인해 작업의 유효성 검사를 자동화할 수 없습니다.

학습 중 문제가 발생하면 Labby 에게 문의하십시오. 세션 후 피드백을 제공하면 문제를 신속하게 해결해 드리겠습니다.

라이브러리 가져오기

이 실습에서는 필요한 라이브러리를 가져오는 것으로 시작합니다.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

함수 정의

이 실습에서 나중에 사용될 두 가지 함수를 정의합니다.

def lower_bound(cv_results):
    """
    최고 `mean_test_scores`의 1 표준 편차 내의 하한을 계산합니다.

    매개변수
    ----------
    cv_results : numpy(masked) ndarrays의 사전
        `GridSearchCV`의 속성 cv_results_ 참조

    반환값
    -------
    float
        최고 `mean_test_score`의 1 표준 편차 내의 하한.
    """
    best_score_idx = np.argmax(cv_results["mean_test_score"])

    return (
        cv_results["mean_test_score"][best_score_idx]
        - cv_results["std_test_score"][best_score_idx]
    )


def best_low_complexity(cv_results):
    """
    모델 복잡성과 교차 검증 점수를 균형 있게 조정합니다.

    매개변수
    ----------
    cv_results : numpy(masked) ndarrays의 사전
        `GridSearchCV`의 속성 cv_results_ 참조

    반환값
    ------
    int
        최고 `mean_test_score`의 1 표준 편차 내에 있는 점수를 가지면서
        PCA 성분이 가장 적은 모델의 인덱스.
    """
    threshold = lower_bound(cv_results)
    candidate_idx = np.flatnonzero(cv_results["mean_test_score"] >= threshold)
    best_idx = candidate_idx[
        cv_results["param_reduce_dim__n_components"][candidate_idx].argmin()
    ]
    return best_idx

데이터 로드 및 파이프라인 정의

scikit-learn 에서 숫자 데이터셋을 로드하고 PCA 와 LinearSVC 로 구성된 파이프라인을 정의합니다.

pipe = Pipeline(
    [
        ("reduce_dim", PCA(random_state=42)),
        ("classify", LinearSVC(random_state=42, C=0.01, dual="auto")),
    ]
)

X, y = load_digits(return_X_y=True)

GridSearchCV 를 위한 매개변수 정의

GridSearchCV 를 위한 매개변수를 정의합니다.

param_grid = {"reduce_dim__n_components": [6, 8, 10, 12, 14]}

GridSearchCV 객체 정의

GridSearchCV 객체를 정의하고 모델을 학습시킵니다.

grid = GridSearchCV(
    pipe,
    cv=10,
    n_jobs=1,
    param_grid=param_grid,
    scoring="accuracy",
    refit=best_low_complexity,
)

grid.fit(X, y)

결과 시각화

PCA 성분 수에 따른 정확도를 그래프로 시각화합니다.

n_components = grid.cv_results_["param_reduce_dim__n_components"]
test_scores = grid.cv_results_["mean_test_score"]

plt.figure()
plt.bar(n_components, test_scores, width=1.3, color="b")

lower = lower_bound(grid.cv_results_)
plt.axhline(np.max(test_scores), linestyle="--", color="y", label="최고 점수")
plt.axhline(lower, linestyle="--", color=".5", label="최고 점수 - 1 표준편차")

plt.title("모델 복잡도와 교차 검증 점수의 균형")
plt.xlabel("사용된 PCA 성분 수")
plt.ylabel("숫자 분류 정확도")
plt.xticks(n_components.tolist())
plt.ylim((0, 1.0))
plt.legend(loc="upper left")

best_index_ = grid.best_index_

print("최적의 인덱스는 %d입니다" % best_index_)
print("선택된 PCA 성분 수는 %d입니다" % n_components[best_index_])
print(
    "해당 정확도 점수는 %.2f 입니다"
    % grid.cv_results_["mean_test_score"][best_index_]
)
plt.show()

요약

이 실험에서 PCA 와 LinearSVC 를 사용하여 모델 복잡도와 교차 검증 점수를 균형 있게 맞추는 방법을 배웠습니다. GridSearchCV 를 사용하여 최고 점수의 1 표준 편차 내에서 정확도 점수를 극대화하는 최적의 PCA 성분 수를 찾았습니다. 또한 결과를 시각화하여 모델 복잡도와 정확도 간의 트레이드오프를 더 잘 이해했습니다.

모델 복잡도와 교차 검증 점수 균형 맞추기

소개