异常检测算法比较

Machine LearningMachine LearningBeginner
立即练习

This tutorial is from open-source community. Access the source code

💡 本教程由 AI 辅助翻译自英文原版。如需查看原文,您可以 切换至英文原版

简介

本实验在二维数据集上比较了不同的异常检测算法。这些数据集包含一个或两个模式(高密度区域),以说明算法处理多模态数据的能力。对于每个数据集,15% 的样本被生成为随机均匀噪声。除局部离群因子(Local Outlier Factor,LOF)外,内点和外点之间的决策边界以黑色显示,因为在用于异常检测时,它没有可应用于新数据的预测方法。

虚拟机使用提示

虚拟机启动完成后,点击左上角切换到“笔记本”标签页,以访问 Jupyter Notebook 进行练习。

有时,你可能需要等待几秒钟让 Jupyter Notebook 完成加载。由于 Jupyter Notebook 的限制,操作验证无法自动化。

如果你在学习过程中遇到问题,随时向 Labby 提问。课程结束后提供反馈,我们会及时为你解决问题。


Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL ml(("Machine Learning")) -.-> ml/FrameworkandSoftwareGroup(["Framework and Software"]) sklearn(("Sklearn")) -.-> sklearn/CoreModelsandAlgorithmsGroup(["Core Models and Algorithms"]) sklearn(("Sklearn")) -.-> sklearn/DataPreprocessingandFeatureEngineeringGroup(["Data Preprocessing and Feature Engineering"]) sklearn(("Sklearn")) -.-> sklearn/AdvancedDataAnalysisandDimensionalityReductionGroup(["Advanced Data Analysis and Dimensionality Reduction"]) sklearn(("Sklearn")) -.-> sklearn/UtilitiesandDatasetsGroup(["Utilities and Datasets"]) sklearn/CoreModelsandAlgorithmsGroup -.-> sklearn/linear_model("Linear Models") sklearn/CoreModelsandAlgorithmsGroup -.-> sklearn/neighbors("Nearest Neighbors") sklearn/CoreModelsandAlgorithmsGroup -.-> sklearn/ensemble("Ensemble Methods") sklearn/DataPreprocessingandFeatureEngineeringGroup -.-> sklearn/pipeline("Pipeline") sklearn/DataPreprocessingandFeatureEngineeringGroup -.-> sklearn/kernel_approximation("Kernel Approximation") sklearn/AdvancedDataAnalysisandDimensionalityReductionGroup -.-> sklearn/covariance("Covariance Estimators") sklearn/UtilitiesandDatasetsGroup -.-> sklearn/datasets("Datasets") ml/FrameworkandSoftwareGroup -.-> ml/sklearn("scikit-learn") subgraph Lab Skills sklearn/linear_model -.-> lab-49065{{"异常检测算法比较"}} sklearn/neighbors -.-> lab-49065{{"异常检测算法比较"}} sklearn/ensemble -.-> lab-49065{{"异常检测算法比较"}} sklearn/pipeline -.-> lab-49065{{"异常检测算法比较"}} sklearn/kernel_approximation -.-> lab-49065{{"异常检测算法比较"}} sklearn/covariance -.-> lab-49065{{"异常检测算法比较"}} sklearn/datasets -.-> lab-49065{{"异常检测算法比较"}} ml/sklearn -.-> lab-49065{{"异常检测算法比较"}} end

导入所需库

导入本实验所需的必要库。

import time

import numpy as np
import matplotlib
import matplotlib.pyplot as plt

from sklearn import svm
from sklearn.datasets import make_moons, make_blobs
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.linear_model import SGDOneClassSVM
from sklearn.kernel_approximation import Nystroem
from sklearn.pipeline import make_pipeline

设置参数

设置本实验所需的参数。

n_samples = 300
outliers_fraction = 0.15
n_outliers = int(outliers_fraction * n_samples)
n_inliers = n_samples - n_outliers

定义异常检测算法

定义要比较的异常检测算法。

anomaly_algorithms = [
    (
        "Robust covariance",
        EllipticEnvelope(contamination=outliers_fraction, random_state=42),
    ),
    ("One-Class SVM", svm.OneClassSVM(nu=outliers_fraction, kernel="rbf", gamma=0.1)),
    (
        "One-Class SVM (SGD)",
        make_pipeline(
            Nystroem(gamma=0.1, random_state=42, n_components=150),
            SGDOneClassSVM(
                nu=outliers_fraction,
                shuffle=True,
                fit_intercept=True,
                random_state=42,
                tol=1e-6,
            ),
        ),
    ),
    (
        "Isolation Forest",
        IsolationForest(contamination=outliers_fraction, random_state=42),
    ),
    (
        "Local Outlier Factor",
        LocalOutlierFactor(n_neighbors=35, contamination=outliers_fraction),
    ),
]

定义数据集

定义本实验所用的数据集。

blobs_params = dict(random_state=0, n_samples=n_inliers, n_features=2)
datasets = [
    make_blobs(centers=[[0, 0], [0, 0]], cluster_std=0.5, **blobs_params)[0],
    make_blobs(centers=[[2, 2], [-2, -2]], cluster_std=[0.5, 0.5], **blobs_params)[0],
    make_blobs(centers=[[2, 2], [-2, -2]], cluster_std=[1.5, 0.3], **blobs_params)[0],
    4.0
    * (
        make_moons(n_samples=n_samples, noise=0.05, random_state=0)[0]
        - np.array([0.5, 0.25])
    ),
    14.0 * (np.random.RandomState(42).rand(n_samples, 2) - 0.5),
]

比较分类器

在给定设置下比较给定的分类器。

xx, yy = np.meshgrid(np.linspace(-7, 7, 150), np.linspace(-7, 7, 150))

plt.figure(figsize=(len(anomaly_algorithms) * 2 + 4, 12.5))
plt.subplots_adjust(
    left=0.02, right=0.98, bottom=0.001, top=0.96, wspace=0.05, hspace=0.01
)

plot_num = 1
rng = np.random.RandomState(42)

for i_dataset, X in enumerate(datasets):
    ## 添加离群点
    X = np.concatenate([X, rng.uniform(low=-6, high=6, size=(n_outliers, 2))], axis=0)

    for name, algorithm in anomaly_algorithms:
        t0 = time.time()
        algorithm.fit(X)
        t1 = time.time()
        plt.subplot(len(datasets), len(anomaly_algorithms), plot_num)
        if i_dataset == 0:
            plt.title(name, size=18)

        ## 拟合数据并标记离群点
        if name == "Local Outlier Factor":
            y_pred = algorithm.fit_predict(X)
        else:
            y_pred = algorithm.fit(X).predict(X)

        ## 绘制等高线和点
        if name!= "Local Outlier Factor":  ## LOF 不实现 predict 方法
            Z = algorithm.predict(np.c_[xx.ravel(), yy.ravel()])
            Z = Z.reshape(xx.shape)
            plt.contour(xx, yy, Z, levels=[0], linewidths=2, colors="black")

        colors = np.array(["#377eb8", "#ff7f00"])
        plt.scatter(X[:, 0], X[:, 1], s=10, color=colors[(y_pred + 1) // 2])

        plt.xlim(-7, 7)
        plt.ylim(-7, 7)
        plt.xticks(())
        plt.yticks(())
        plt.text(
            0.99,
            0.01,
            ("%.2fs" % (t1 - t0)).lstrip("0"),
            transform=plt.gca().transAxes,
            size=15,
            horizontalalignment="right",
        )
        plot_num += 1

plt.show()

总结

本实验在二维数据集上比较了不同的异常检测算法。这些数据集包含一种或两种模式(高密度区域),以说明算法处理多模态数据的能力。除了局部离群因子(LOF)外,内点和离群点之间的决策边界以黑色显示,因为在用于离群点检测时,它没有可应用于新数据的预测方法。发现 :class:~sklearn.svm.OneClassSVM 对离群点敏感,因此在离群点检测方面表现不是很好。:class:sklearn.linear_model.SGDOneClassSVM 是基于随机梯度下降(SGD)的一类支持向量机(One-Class SVM)的实现。:class:sklearn.covariance.EllipticEnvelope 假设数据是高斯分布的,并学习一个椭圆,并且 :class:~sklearn.ensemble.IsolationForest 和 :class:~sklearn.neighbors.LocalOutlierFactor 对于多模态数据集似乎表现得相当不错。