Anomaly Detection Algorithms Comparison

Introduction

This lab compares different anomaly detection algorithms on two-dimensional datasets. The datasets contain one or two modes (regions of high density) to illustrate the ability of algorithms to cope with multimodal data. For each dataset, 15% of samples are generated as random uniform noise. Decision boundaries between inliers and outliers are displayed in black except for Local Outlier Factor (LOF) as it has no predict method to be applied on new data when it is used for outlier detection.

VM Tips

After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.

Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.

If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.

Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL sklearn(("`Sklearn`")) -.-> sklearn/DataPreprocessingandFeatureEngineeringGroup(["`Data Preprocessing and Feature Engineering`"]) sklearn(("`Sklearn`")) -.-> sklearn/AdvancedDataAnalysisandDimensionalityReductionGroup(["`Advanced Data Analysis and Dimensionality Reduction`"]) sklearn(("`Sklearn`")) -.-> sklearn/UtilitiesandDatasetsGroup(["`Utilities and Datasets`"]) sklearn(("`Sklearn`")) -.-> sklearn/CoreModelsandAlgorithmsGroup(["`Core Models and Algorithms`"]) ml(("`Machine Learning`")) -.-> ml/FrameworkandSoftwareGroup(["`Framework and Software`"]) sklearn/DataPreprocessingandFeatureEngineeringGroup -.-> sklearn/pipeline("`Pipeline`") sklearn/AdvancedDataAnalysisandDimensionalityReductionGroup -.-> sklearn/covariance("`Covariance Estimators`") sklearn/UtilitiesandDatasetsGroup -.-> sklearn/datasets("`Datasets`") sklearn/CoreModelsandAlgorithmsGroup -.-> sklearn/ensemble("`Ensemble Methods`") sklearn/DataPreprocessingandFeatureEngineeringGroup -.-> sklearn/kernel_approximation("`Kernel Approximation`") sklearn/CoreModelsandAlgorithmsGroup -.-> sklearn/linear_model("`Linear Models`") sklearn/CoreModelsandAlgorithmsGroup -.-> sklearn/neighbors("`Nearest Neighbors`") ml/FrameworkandSoftwareGroup -.-> ml/sklearn("`scikit-learn`") subgraph Lab Skills sklearn/pipeline -.-> lab-49065{{"`Anomaly Detection Algorithms Comparison`"}} sklearn/covariance -.-> lab-49065{{"`Anomaly Detection Algorithms Comparison`"}} sklearn/datasets -.-> lab-49065{{"`Anomaly Detection Algorithms Comparison`"}} sklearn/ensemble -.-> lab-49065{{"`Anomaly Detection Algorithms Comparison`"}} sklearn/kernel_approximation -.-> lab-49065{{"`Anomaly Detection Algorithms Comparison`"}} sklearn/linear_model -.-> lab-49065{{"`Anomaly Detection Algorithms Comparison`"}} sklearn/neighbors -.-> lab-49065{{"`Anomaly Detection Algorithms Comparison`"}} ml/sklearn -.-> lab-49065{{"`Anomaly Detection Algorithms Comparison`"}} end

Import Required Libraries

Import the necessary libraries for the lab.

import time

import numpy as np
import matplotlib
import matplotlib.pyplot as plt

from sklearn import svm
from sklearn.datasets import make_moons, make_blobs
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.linear_model import SGDOneClassSVM
from sklearn.kernel_approximation import Nystroem
from sklearn.pipeline import make_pipeline

Set Parameters

Set the required parameters for the lab.

n_samples = 300
outliers_fraction = 0.15
n_outliers = int(outliers_fraction * n_samples)
n_inliers = n_samples - n_outliers

Define Anomaly Detection Algorithms

Define the anomaly detection algorithms to be compared.

anomaly_algorithms = [
    (
        "Robust covariance",
        EllipticEnvelope(contamination=outliers_fraction, random_state=42),
    ),
    ("One-Class SVM", svm.OneClassSVM(nu=outliers_fraction, kernel="rbf", gamma=0.1)),
    (
        "One-Class SVM (SGD)",
        make_pipeline(
            Nystroem(gamma=0.1, random_state=42, n_components=150),
            SGDOneClassSVM(
                nu=outliers_fraction,
                shuffle=True,
                fit_intercept=True,
                random_state=42,
                tol=1e-6,
            ),
        ),
    ),
    (
        "Isolation Forest",
        IsolationForest(contamination=outliers_fraction, random_state=42),
    ),
    (
        "Local Outlier Factor",
        LocalOutlierFactor(n_neighbors=35, contamination=outliers_fraction),
    ),
]

Define Datasets

Define the datasets for the lab.

blobs_params = dict(random_state=0, n_samples=n_inliers, n_features=2)
datasets = [
    make_blobs(centers=[[0, 0], [0, 0]], cluster_std=0.5, **blobs_params)[0],
    make_blobs(centers=[[2, 2], [-2, -2]], cluster_std=[0.5, 0.5], **blobs_params)[0],
    make_blobs(centers=[[2, 2], [-2, -2]], cluster_std=[1.5, 0.3], **blobs_params)[0],
    4.0
    * (
        make_moons(n_samples=n_samples, noise=0.05, random_state=0)[0]
        - np.array([0.5, 0.25])
    ),
    14.0 * (np.random.RandomState(42).rand(n_samples, 2) - 0.5),
]

Compare Classifiers

Compare the given classifiers under the given settings.

xx, yy = np.meshgrid(np.linspace(-7, 7, 150), np.linspace(-7, 7, 150))

plt.figure(figsize=(len(anomaly_algorithms) * 2 + 4, 12.5))
plt.subplots_adjust(
    left=0.02, right=0.98, bottom=0.001, top=0.96, wspace=0.05, hspace=0.01
)

plot_num = 1
rng = np.random.RandomState(42)

for i_dataset, X in enumerate(datasets):
    ## Add outliers
    X = np.concatenate([X, rng.uniform(low=-6, high=6, size=(n_outliers, 2))], axis=0)

    for name, algorithm in anomaly_algorithms:
        t0 = time.time()
        algorithm.fit(X)
        t1 = time.time()
        plt.subplot(len(datasets), len(anomaly_algorithms), plot_num)
        if i_dataset == 0:
            plt.title(name, size=18)

        ## fit the data and tag outliers
        if name == "Local Outlier Factor":
            y_pred = algorithm.fit_predict(X)
        else:
            y_pred = algorithm.fit(X).predict(X)

        ## plot the levels lines and the points
        if name != "Local Outlier Factor":  ## LOF does not implement predict
            Z = algorithm.predict(np.c_[xx.ravel(), yy.ravel()])
            Z = Z.reshape(xx.shape)
            plt.contour(xx, yy, Z, levels=[0], linewidths=2, colors="black")

        colors = np.array(["#377eb8", "#ff7f00"])
        plt.scatter(X[:, 0], X[:, 1], s=10, color=colors[(y_pred + 1) // 2])

        plt.xlim(-7, 7)
        plt.ylim(-7, 7)
        plt.xticks(())
        plt.yticks(())
        plt.text(
            0.99,
            0.01,
            ("%.2fs" % (t1 - t0)).lstrip("0"),
            transform=plt.gca().transAxes,
            size=15,
            horizontalalignment="right",
        )
        plot_num += 1

plt.show()

Summary

This lab compared different anomaly detection algorithms on two-dimensional datasets. The datasets contained one or two modes (regions of high density) to illustrate the ability of algorithms to cope with multimodal data. Decision boundaries between inliers and outliers were displayed in black except for Local Outlier Factor (LOF) as it had no predict method to be applied on new data when it was used for outlier detection. The :class:~sklearn.svm.OneClassSVM was found to be sensitive to outliers and thus did not perform very well for outlier detection. The :class:sklearn.linear_model.SGDOneClassSVM was an implementation of the One-Class SVM based on stochastic gradient descent (SGD). The :class:sklearn.covariance.EllipticEnvelope assumed the data was Gaussian and learned an ellipse, and :class:~sklearn.ensemble.IsolationForest and :class:~sklearn.neighbors.LocalOutlierFactor seemed to perform reasonably well for multi-modal data sets.