Robust Covariance Estimation in Python

Machine LearningMachine LearningBeginner
Practice Now

This tutorial is from open-source community. Access the source code

Introduction

In this lab, you will learn how to use the scikit-learn library in Python to estimate robust covariance matrices. The tutorial will introduce you to the concept of robust covariance estimation and demonstrate how it can be used to estimate the covariance matrix of datasets that are contaminated with outliers.

VM Tips

After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.

Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.

If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.


Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL sklearn(("`Sklearn`")) -.-> sklearn/AdvancedDataAnalysisandDimensionalityReductionGroup(["`Advanced Data Analysis and Dimensionality Reduction`"]) ml(("`Machine Learning`")) -.-> ml/FrameworkandSoftwareGroup(["`Framework and Software`"]) sklearn/AdvancedDataAnalysisandDimensionalityReductionGroup -.-> sklearn/covariance("`Covariance Estimators`") ml/FrameworkandSoftwareGroup -.-> ml/sklearn("`scikit-learn`") subgraph Lab Skills sklearn/covariance -.-> lab-49272{{"`Robust Covariance Estimation in Python`"}} ml/sklearn -.-> lab-49272{{"`Robust Covariance Estimation in Python`"}} end

Import Libraries

The first step is to import the required libraries. In this tutorial, we will use NumPy, Matplotlib, and scikit-learn.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.covariance import MinCovDet, EmpiricalCovariance

Generate Data

In this step, we generate a random dataset with n_samples samples and n_features features. We also add some outliers to the dataset.

n_samples = 80
n_features = 5

## Generate random dataset
rng = np.random.RandomState(42)
X = rng.randn(n_samples, n_features)

## Add outliers to the dataset
n_outliers = 20
outliers_index = rng.permutation(n_samples)[:n_outliers]
outliers_offset = 10.0 * (
    np.random.randint(2, size=(n_outliers, n_features)) - 0.5
)
X[outliers_index] += outliers_offset

Estimate Robust Covariance Matrix

In this step, we estimate a robust covariance matrix of the dataset using the Minimum Covariance Determinant (MCD) estimator.

## Estimate a robust covariance matrix of the dataset
mcd = MinCovDet().fit(X)
robust_cov = mcd.covariance_

Estimate Empirical Covariance Matrix

In this step, we estimate an empirical covariance matrix of the dataset using the Maximum Likelihood Estimate (MLE) estimator.

## Estimate an empirical covariance matrix of the dataset
emp_cov = EmpiricalCovariance().fit(X).covariance_

Compare Covariance Matrices

In this step, we compare the estimated robust and empirical covariance matrices of the dataset.

## Compare the estimated covariance matrices
print("Robust Covariance Matrix:")
print(robust_cov)
print("\nEmpirical Covariance Matrix:")
print(emp_cov)

Visualize Results

In this step, we visualize the results of the robust and empirical covariance estimation.

## Visualize the results
fig, ax = plt.subplots()

## Plot the dataset
inliers_index = np.arange(n_samples)[~np.in1d(np.arange(n_samples), outliers_index)]
ax.scatter(
    X[inliers_index, 0], X[inliers_index, 1], color="black", label="Inliers"
)
ax.scatter(X[outliers_index, 0], X[outliers_index, 1], color="red", label="Outliers")

## Plot the estimated covariance matrices
for covariance, color, label in zip(
    [emp_cov, robust_cov], ["green", "magenta"], ["MLE", "MCD"]
):
    v, w = np.linalg.eigh(covariance)
    u = w[0] / np.linalg.norm(w[0])
    angle = np.arctan2(u[1], u[0])
    angle = 180 * angle / np.pi
    v = 2.0 * np.sqrt(2.0) * np.sqrt(v)
    ell = mpl.patches.Ellipse(
        mcd.location_,
        v[0],
        v[1],
        180 + angle,
        color=color,
        label=label,
        alpha=0.2,
    )
    ell.set_clip_box(ax.bbox)
    ell.set_facecolor(color)
    ax.add_artist(ell)

## Set plot options
plt.legend()
plt.title("Robust Covariance Estimation")
plt.show()

Summary

In this tutorial, you have learned how to use the scikit-learn library in Python to estimate robust covariance matrices. You have also learned how to use the Minimum Covariance Determinant (MCD) estimator to estimate the covariance matrix of datasets that are contaminated with outliers.

Other Machine Learning Tutorials you may like