Introduction
In this lab, you will learn how to use the scikit-learn library in Python to estimate robust covariance matrices. The tutorial will introduce you to the concept of robust covariance estimation and demonstrate how it can be used to estimate the covariance matrix of datasets that are contaminated with outliers.
VM Tips
After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.
Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.
If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.
Import Libraries
The first step is to import the required libraries. In this tutorial, we will use NumPy, Matplotlib, and scikit-learn.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.covariance import MinCovDet, EmpiricalCovariance
Generate Data
In this step, we generate a random dataset with n_samples samples and n_features features. We also add some outliers to the dataset.
n_samples = 80
n_features = 5
## Generate random dataset
rng = np.random.RandomState(42)
X = rng.randn(n_samples, n_features)
## Add outliers to the dataset
n_outliers = 20
outliers_index = rng.permutation(n_samples)[:n_outliers]
outliers_offset = 10.0 * (
np.random.randint(2, size=(n_outliers, n_features)) - 0.5
)
X[outliers_index] += outliers_offset
Estimate Robust Covariance Matrix
In this step, we estimate a robust covariance matrix of the dataset using the Minimum Covariance Determinant (MCD) estimator.
## Estimate a robust covariance matrix of the dataset
mcd = MinCovDet().fit(X)
robust_cov = mcd.covariance_
Estimate Empirical Covariance Matrix
In this step, we estimate an empirical covariance matrix of the dataset using the Maximum Likelihood Estimate (MLE) estimator.
## Estimate an empirical covariance matrix of the dataset
emp_cov = EmpiricalCovariance().fit(X).covariance_
Compare Covariance Matrices
In this step, we compare the estimated robust and empirical covariance matrices of the dataset.
## Compare the estimated covariance matrices
print("Robust Covariance Matrix:")
print(robust_cov)
print("\nEmpirical Covariance Matrix:")
print(emp_cov)
Visualize Results
In this step, we visualize the results of the robust and empirical covariance estimation.
## Visualize the results
fig, ax = plt.subplots()
## Plot the dataset
inliers_index = np.arange(n_samples)[~np.in1d(np.arange(n_samples), outliers_index)]
ax.scatter(
X[inliers_index, 0], X[inliers_index, 1], color="black", label="Inliers"
)
ax.scatter(X[outliers_index, 0], X[outliers_index, 1], color="red", label="Outliers")
## Plot the estimated covariance matrices
for covariance, color, label in zip(
[emp_cov, robust_cov], ["green", "magenta"], ["MLE", "MCD"]
):
v, w = np.linalg.eigh(covariance)
u = w[0] / np.linalg.norm(w[0])
angle = np.arctan2(u[1], u[0])
angle = 180 * angle / np.pi
v = 2.0 * np.sqrt(2.0) * np.sqrt(v)
ell = mpl.patches.Ellipse(
mcd.location_,
v[0],
v[1],
180 + angle,
color=color,
label=label,
alpha=0.2,
)
ell.set_clip_box(ax.bbox)
ell.set_facecolor(color)
ax.add_artist(ell)
## Set plot options
plt.legend()
plt.title("Robust Covariance Estimation")
plt.show()
Summary
In this tutorial, you have learned how to use the scikit-learn library in Python to estimate robust covariance matrices. You have also learned how to use the Minimum Covariance Determinant (MCD) estimator to estimate the covariance matrix of datasets that are contaminated with outliers.