異常検出アルゴリズム | 二次元データセット

はじめに

この実験では、二次元データセットに対するさまざまな異常検出アルゴリズムを比較します。これらのデータセットには 1 つまたは 2 つのモード（高密度領域）が含まれており、アルゴリズムが多モーダルデータに対処する能力を示すために使用されます。各データセットに対して、サンプルの 15％がランダムな一様ノイズとして生成されます。内点と外れ値の決定境界は、Local Outlier Factor（LOF）以外は黒で表示されます。これは、外れ値検出に使用される場合、新しいデータに適用する予測メソッドがないためです。

VM のヒント

VM の起動が完了したら、左上隅をクリックしてノートブックタブに切り替え、Jupyter Notebook を使用して練習します。

場合によっては、Jupyter Notebook が読み込み完了するまで数秒待つ必要があります。Jupyter Notebook の制限により、操作の検証を自動化することはできません。

学習中に問題に遭遇した場合は、Labby にお問い合わせください。セッション後にフィードバックを提供してください。そうすれば、迅速に問題を解決します。

必要なライブラリをインポートする

この実験に必要なライブラリをインポートします。

import time

import numpy as np
import matplotlib
import matplotlib.pyplot as plt

from sklearn import svm
from sklearn.datasets import make_moons, make_blobs
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.linear_model import SGDOneClassSVM
from sklearn.kernel_approximation import Nystroem
from sklearn.pipeline import make_pipeline

パラメータを設定する

この実験に必要なパラメータを設定します。

n_samples = 300
outliers_fraction = 0.15
n_outliers = int(outliers_fraction * n_samples)
n_inliers = n_samples - n_outliers

異常検出アルゴリズムを定義する

比較する異常検出アルゴリズムを定義します。

anomaly_algorithms = [
    (
        "Robust covariance",
        EllipticEnvelope(contamination=outliers_fraction, random_state=42),
    ),
    ("One-Class SVM", svm.OneClassSVM(nu=outliers_fraction, kernel="rbf", gamma=0.1)),
    (
        "One-Class SVM (SGD)",
        make_pipeline(
            Nystroem(gamma=0.1, random_state=42, n_components=150),
            SGDOneClassSVM(
                nu=outliers_fraction,
                shuffle=True,
                fit_intercept=True,
                random_state=42,
                tol=1e-6,
            ),
        ),
    ),
    (
        "Isolation Forest",
        IsolationForest(contamination=outliers_fraction, random_state=42),
    ),
    (
        "Local Outlier Factor",
        LocalOutlierFactor(n_neighbors=35, contamination=outliers_fraction),
    ),
]

データセットを定義する

この実験用のデータセットを定義します。

blobs_params = dict(random_state=0, n_samples=n_inliers, n_features=2)
datasets = [
    make_blobs(centers=[[0, 0], [0, 0]], cluster_std=0.5, **blobs_params)[0],
    make_blobs(centers=[[2, 2], [-2, -2]], cluster_std=[0.5, 0.5], **blobs_params)[0],
    make_blobs(centers=[[2, 2], [-2, -2]], cluster_std=[1.5, 0.3], **blobs_params)[0],
    4.0
    * (
        make_moons(n_samples=n_samples, noise=0.05, random_state=0)[0]
        - np.array([0.5, 0.25])
    ),
    14.0 * (np.random.RandomState(42).rand(n_samples, 2) - 0.5),
]

分類器を比較する

与えられた設定の下で与えられた分類器を比較します。

xx, yy = np.meshgrid(np.linspace(-7, 7, 150), np.linspace(-7, 7, 150))

plt.figure(figsize=(len(anomaly_algorithms) * 2 + 4, 12.5))
plt.subplots_adjust(
    left=0.02, right=0.98, bottom=0.001, top=0.96, wspace=0.05, hspace=0.01
)

plot_num = 1
rng = np.random.RandomState(42)

for i_dataset, X in enumerate(datasets):
    ## アウトライアを追加する
    X = np.concatenate([X, rng.uniform(low=-6, high=6, size=(n_outliers, 2))], axis=0)

    for name, algorithm in anomaly_algorithms:
        t0 = time.time()
        algorithm.fit(X)
        t1 = time.time()
        plt.subplot(len(datasets), len(anomaly_algorithms), plot_num)
        if i_dataset == 0:
            plt.title(name, size=18)

        ## データに適合させてアウトライアをタグ付けする
        if name == "Local Outlier Factor":
            y_pred = algorithm.fit_predict(X)
        else:
            y_pred = algorithm.fit(X).predict(X)

        ## レベル線と点を描画する
        if name!= "Local Outlier Factor":  ## LOF は predict を実装していない
            Z = algorithm.predict(np.c_[xx.ravel(), yy.ravel()])
            Z = Z.reshape(xx.shape)
            plt.contour(xx, yy, Z, levels=[0], linewidths=2, colors="black")

        colors = np.array(["#377eb8", "#ff7f00"])
        plt.scatter(X[:, 0], X[:, 1], s=10, color=colors[(y_pred + 1) // 2])

        plt.xlim(-7, 7)
        plt.ylim(-7, 7)
        plt.xticks(())
        plt.yticks(())
        plt.text(
            0.99,
            0.01,
            ("%.2fs" % (t1 - t0)).lstrip("0"),
            transform=plt.gca().transAxes,
            size=15,
            horizontalalignment="right",
        )
        plot_num += 1

plt.show()

まとめ

この実験では、二次元データセットに対してさまざまな異常検出アルゴリズムを比較しました。これらのデータセットには、1 つまたは 2 つのモード（高密度領域）が含まれており、アルゴリズムが多モーダルデータに対処する能力を示すために使用されました。アウトライア検出に使用する場合、新しいデータに適用できる predict メソッドがないため、Local Outlier Factor（LOF）を除いて、内点と外れ点の間の決定境界が黒で表示されました。:class:~sklearn.svm.OneClassSVMはアウトライアに敏感であることがわかり、したがってアウトライア検出にはあまりうまくいきませんでした。:class:sklearn.linear_model.SGDOneClassSVMは、確率的勾配降下法（SGD）に基づく One-Class SVM の実装でした。:class:sklearn.covariance.EllipticEnvelopeは、データがガウス分布であると仮定し、楕円を学習しました。また、:class:~sklearn.ensemble.IsolationForestと:class:~sklearn.neighbors.LocalOutlierFactorは、多モーダルデータセットに対してかなりうまく機能するようでした。

異常検出アルゴリズムの比較