이상치 탐지 알고리즘 | 이차원 데이터셋

소개

이 실험실에서는 이차원 데이터 세트에서 서로 다른 이상치 탐지 알고리즘을 비교합니다. 데이터 세트는 다중 모드 (높은 밀도의 영역) 를 하나 또는 둘 포함하여 알고리즘이 다중 모드 데이터를 처리하는 능력을 보여줍니다. 각 데이터 세트의 15% 의 샘플은 랜덤 균일 노이즈로 생성됩니다. 내부 데이터와 외부 데이터 사이의 결정 경계는 검정색으로 표시되지만, Local Outlier Factor (LOF) 의 경우 새 데이터에 적용할 예측 메서드가 없기 때문에 이상치 탐지에 사용될 때는 예외입니다.

VM 팁

VM 시작이 완료되면 왼쪽 상단 모서리를 클릭하여 Notebook 탭으로 전환하여 연습을 위한 Jupyter Notebook에 접근할 수 있습니다.

때때로 Jupyter Notebook 이 완전히 로드되기까지 몇 초 정도 기다려야 할 수 있습니다. Jupyter Notebook 의 제한으로 인해 작업의 유효성 검사는 자동화될 수 없습니다.

학습 중 문제가 발생하면 Labby 에 문의하십시오. 세션 후 피드백을 제공하면 문제를 신속하게 해결해 드리겠습니다.

필요한 라이브러리 가져오기

실험에 필요한 라이브러리를 가져옵니다.

import time

import numpy as np
import matplotlib
import matplotlib.pyplot as plt

from sklearn import svm
from sklearn.datasets import make_moons, make_blobs
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.linear_model import SGDOneClassSVM
from sklearn.kernel_approximation import Nystroem
from sklearn.pipeline import make_pipeline

매개변수 설정

실험에 필요한 매개변수를 설정합니다.

n_samples = 300
outliers_fraction = 0.15
n_outliers = int(outliers_fraction * n_samples)
n_inliers = n_samples - n_outliers

이상 탐지 알고리즘 정의

비교할 이상 탐지 알고리즘을 정의합니다.

anomaly_algorithms = [
    (
        "강건한 공분산",
        EllipticEnvelope(contamination=outliers_fraction, random_state=42),
    ),
    ("일반화된 SVM", svm.OneClassSVM(nu=outliers_fraction, kernel="rbf", gamma=0.1)),
    (
        "일반화된 SVM (SGD)",
        make_pipeline(
            Nystroem(gamma=0.1, random_state=42, n_components=150),
            SGDOneClassSVM(
                nu=outliers_fraction,
                shuffle=True,
                fit_intercept=True,
                random_state=42,
                tol=1e-6,
            ),
        ),
    ),
    (
        "격리 숲",
        IsolationForest(contamination=outliers_fraction, random_state=42),
    ),
    (
        "지역 이상치 탐지",
        LocalOutlierFactor(n_neighbors=35, contamination=outliers_fraction),
    ),
]

데이터셋 정의

실험에 사용할 데이터셋을 정의합니다.

blobs_params = dict(random_state=0, n_samples=n_inliers, n_features=2)
datasets = [
    make_blobs(centers=[[0, 0], [0, 0]], cluster_std=0.5, **blobs_params)[0],
    make_blobs(centers=[[2, 2], [-2, -2]], cluster_std=[0.5, 0.5], **blobs_params)[0],
    make_blobs(centers=[[2, 2], [-2, -2]], cluster_std=[1.5, 0.3], **blobs_params)[0],
    4.0
    * (
        make_moons(n_samples=n_samples, noise=0.05, random_state=0)[0]
        - np.array([0.5, 0.25])
    ),
    14.0 * (np.random.RandomState(42).rand(n_samples, 2) - 0.5),
]

분류기 비교

주어진 설정 하에서 주어진 분류기를 비교합니다.

xx, yy = np.meshgrid(np.linspace(-7, 7, 150), np.linspace(-7, 7, 150))

plt.figure(figsize=(len(anomaly_algorithms) * 2 + 4, 12.5))
plt.subplots_adjust(
    left=0.02, right=0.98, bottom=0.001, top=0.96, wspace=0.05, hspace=0.01
)

plot_num = 1
rng = np.random.RandomState(42)

for i_dataset, X in enumerate(datasets):
    ## 이상치 추가
    X = np.concatenate([X, rng.uniform(low=-6, high=6, size=(n_outliers, 2))], axis=0)

    for name, algorithm in anomaly_algorithms:
        t0 = time.time()
        algorithm.fit(X)
        t1 = time.time()
        plt.subplot(len(datasets), len(anomaly_algorithms), plot_num)
        if i_dataset == 0:
            plt.title(name, size=18)

        ## 데이터를 학습하고 이상치를 표시
        if name == "Local Outlier Factor":
            y_pred = algorithm.fit_predict(X)
        else:
            y_pred = algorithm.fit(X).predict(X)

        ## 레벨 선과 점을 플롯
        if name != "Local Outlier Factor":  ## LOF 는 predict 를 구현하지 않음
            Z = algorithm.predict(np.c_[xx.ravel(), yy.ravel()])
            Z = Z.reshape(xx.shape)
            plt.contour(xx, yy, Z, levels=[0], linewidths=2, colors="black")

        colors = np.array(["#377eb8", "#ff7f00"])
        plt.scatter(X[:, 0], X[:, 1], s=10, color=colors[(y_pred + 1) // 2])

        plt.xlim(-7, 7)
        plt.ylim(-7, 7)
        plt.xticks(())
        plt.yticks(())
        plt.text(
            0.99,
            0.01,
            ("%.2fs" % (t1 - t0)).lstrip("0"),
            transform=plt.gca().transAxes,
            size=15,
            horizontalalignment="right",
        )
        plot_num += 1

plt.show()

요약

이 실험에서는 이차원 데이터셋에 대한 다양한 이상치 탐지 알고리즘을 비교했습니다. 데이터셋은 다중 모드 (높은 밀도의 영역) 를 하나 또는 둘 포함하여 알고리즘이 다중 모드 데이터를 처리하는 능력을 보여주었습니다. 내부 데이터와 외부 데이터 사이의 결정 경계는 Local Outlier Factor (LOF) 를 제외하고 검정색으로 표시되었습니다. LOF 는 이상치 탐지를 위해 새로운 데이터에 적용할 예측 메서드가 없기 때문입니다. :class:~sklearn.svm.OneClassSVM은 이상치에 민감하여 이상치 탐지에 매우 효과적이지 않았습니다. :class:sklearn.linear_model.SGDOneClassSVM은 확률적 경사 하강법 (SGD) 을 기반으로 한 One-Class SVM 의 구현이었습니다. :class:sklearn.covariance.EllipticEnvelope는 데이터가 가우시안 분포를 따른다고 가정하고 타원을 학습했으며, :class:~sklearn.ensemble.IsolationForest와 :class:~sklearn.neighbors.LocalOutlierFactor는 다중 모드 데이터셋에 대해 상당히 잘 수행하는 것으로 보였습니다.

이상치 탐지 알고리즘 비교

소개