자가 학습 분류기 | 머신러닝 | 파이썬

소개

이 실험은 자기훈련에 대한 임계값 변화의 효과를 보여줍니다. breast_cancer 데이터셋이 로드되고, 569 개 샘플 중 50 개의 샘플만 레이블이 있는 상태로 레이블이 삭제됩니다. 다양한 임계값으로 이 데이터셋에 SelfTrainingClassifier가 적합됩니다.

VM 팁

VM 시작이 완료되면 왼쪽 상단 모서리를 클릭하여 Notebook 탭으로 전환하여 연습을 위한 Jupyter Notebook에 접근할 수 있습니다.

때때로 Jupyter Notebook 이 로드되는 데 몇 초 정도 기다려야 할 수 있습니다. Jupyter Notebook 의 제한으로 인해 작업의 유효성 검사를 자동화할 수 없습니다.

학습 중 문제가 발생하면 Labby 에 문의하십시오. 세션 후 피드백을 제공하면 문제를 신속하게 해결해 드리겠습니다.

라이브러리 가져오기

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
from sklearn.semi_supervised import SelfTrainingClassifier
from sklearn.metrics import accuracy_score
from sklearn.utils import shuffle

먼저, 이 실험에 필요한 라이브러리를 가져옵니다.

데이터 로드

X, y = datasets.load_breast_cancer(return_X_y=True)
X, y = shuffle(X, y, random_state=42)
y_true = y.copy()
y[50:] = -1
total_samples = y.shape[0]

breast_cancer 데이터셋을 로드하고 섞습니다. 그런 다음 실제 레이블을 y_true에 복사하고 y에서 처음 50 개 샘플을 제외한 모든 레이블을 제거합니다. 이는 준지도 학습 시나리오를 시뮬레이션하는 데 사용됩니다.

분류기 정의

base_classifier = SVC(probability=True, gamma=0.001, random_state=42)

기본 분류기를 낮은 감마 값 (0.001) 을 가진 서포트 벡터 머신 (SVM) 으로 정의합니다.

임계값 정의

x_values = np.arange(0.4, 1.05, 0.05)
x_values = np.append(x_values, 0.99999)

0.4 부터 1 까지 0.05 단계의 임계값 배열을 정의합니다. 그런 다음 0.99999 와 같은 매우 높은 임계값을 추가하여 자가 레이블링된 샘플이 발생하지 않는 임계값을 포함하도록 합니다.

결과 저장을 위한 배열 정의

scores = np.empty((x_values.shape[0], n_splits))
amount_labeled = np.empty((x_values.shape[0], n_splits))
amount_iterations = np.empty((x_values.shape[0], n_splits))

실험 결과를 저장하기 위한 배열을 정의합니다.

다양한 임계값을 사용한 자가 학습

for i, threshold in enumerate(x_values):
    self_training_clf = SelfTrainingClassifier(base_classifier, threshold=threshold)

    skfolds = StratifiedKFold(n_splits=n_splits)
    for fold, (train_index, test_index) in enumerate(skfolds.split(X, y)):
        X_train = X[train_index]
        y_train = y[train_index]
        X_test = X[test_index]
        y_test = y[test_index]
        y_test_true = y_true[test_index]

        self_training_clf.fit(X_train, y_train)

        amount_labeled[i, fold] = (
            total_samples
            - np.unique(self_training_clf.labeled_iter_, return_counts=True)[1][0]
        )

        amount_iterations[i, fold] = np.max(self_training_clf.labeled_iter_)

        y_pred = self_training_clf.predict(X_test)
        scores[i, fold] = accuracy_score(y_test_true, y_pred)

기본 분류기를 사용하고 scikit-learn 의 SelfTrainingClassifier 클래스를 활용하여 다양한 임계값으로 자가 학습을 수행합니다. 계층적 k-겹 교차 검증을 사용하여 데이터를 학습 및 테스트 세트로 분할합니다. 그런 다음 학습 세트에서 자가 학습 분류기를 학습시키고, 테스트 세트에서 분류기의 정확도를 계산합니다. 또한 각 폴드에 대한 레이블링된 샘플 수와 반복 횟수를 저장합니다.

결과 시각화

ax1 = plt.subplot(211)
ax1.errorbar(
    x_values, scores.mean(axis=1), yerr=scores.std(axis=1), capsize=2, color="b"
)
ax1.set_ylabel("정확도", color="b")
ax1.tick_params("y", colors="b")

ax2 = ax1.twinx()
ax2.errorbar(
    x_values,
    amount_labeled.mean(axis=1),
    yerr=amount_labeled.std(axis=1),
    capsize=2,
    color="g",
)
ax2.set_ylim(bottom=0)
ax2.set_ylabel("레이블링된 샘플 수", color="g")
ax2.tick_params("y", colors="g")

ax3 = plt.subplot(212, sharex=ax1)
ax3.errorbar(
    x_values,
    amount_iterations.mean(axis=1),
    yerr=amount_iterations.std(axis=1),
    capsize=2,
    color="b",
)
ax3.set_ylim(bottom=0)
ax3.set_ylabel("반복 횟수")
ax3.set_xlabel("임계값")

plt.show()

Matplotlib 를 사용하여 실험 결과를 시각화합니다. 상단 그래프는 분류기가 학습 종료 시 사용 가능한 레이블링된 샘플 수와 분류기의 정확도를 보여줍니다. 하단 그래프는 샘플이 레이블링된 마지막 반복 횟수를 보여줍니다.

요약

이 실험에서 scikit-learn 을 사용하여 다양한 임계값으로 자가 학습을 수행하는 방법을 배웠습니다. 최적의 임계값은 매우 낮은 값과 매우 높은 값 사이에 있으며, 적절한 임계값을 선택하면 정확도를 크게 향상시킬 수 있다는 것을 확인했습니다.

자가 학습에 대한 임계값 변화의 영향

소개