Comparison of F-Test and Mutual Information

Machine LearningMachine LearningBeginner
Practice Now

This tutorial is from open-source community. Access the source code

Introduction

In this lab, we will learn about the differences between univariate F-test statistics and mutual information. We will use scikit-learn library to perform F-test and mutual information regression on a dataset and compare the results.

VM Tips

After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.

Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.

If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.


Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL sklearn(("`Sklearn`")) -.-> sklearn/DataPreprocessingandFeatureEngineeringGroup(["`Data Preprocessing and Feature Engineering`"]) ml(("`Machine Learning`")) -.-> ml/FrameworkandSoftwareGroup(["`Framework and Software`"]) sklearn/DataPreprocessingandFeatureEngineeringGroup -.-> sklearn/feature_selection("`Feature Selection`") ml/FrameworkandSoftwareGroup -.-> ml/sklearn("`scikit-learn`") subgraph Lab Skills sklearn/feature_selection -.-> lab-49121{{"`Comparison of F-Test and Mutual Information`"}} ml/sklearn -.-> lab-49121{{"`Comparison of F-Test and Mutual Information`"}} end

Import libraries

We will start by importing the necessary libraries for this lab. We will use numpy, matplotlib and scikit-learn for this lab.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_selection import f_regression, mutual_info_regression

Create dataset

We will create a dataset with 3 features, where the first feature has a linear relationship with the target, the second feature has a non-linear relationship with the target, and the third feature is completely irrelevant. We will create 1000 samples for this dataset.

np.random.seed(0)
X = np.random.rand(1000, 3)
y = X[:, 0] + np.sin(6 * np.pi * X[:, 1]) + 0.1 * np.random.randn(1000)

Calculate F-test

We will now calculate the F-test score for each feature. F-test captures only linear dependency between variables. We will normalize the F-test scores by dividing them by the maximum F-test score.

f_test, _ = f_regression(X, y)
f_test /= np.max(f_test)

Calculate mutual information

We will now calculate the mutual information score for each feature. Mutual information can capture any kind of dependency between variables. We will normalize the mutual information scores by dividing them by the maximum mutual information score.

mi = mutual_info_regression(X, y)
mi /= np.max(mi)

Plot the results

We will now plot the dependency of the target against each feature and the F-test and mutual information scores for each feature.

plt.figure(figsize=(15, 5))
for i in range(3):
    plt.subplot(1, 3, i + 1)
    plt.scatter(X[:, i], y, edgecolor="black", s=20)
    plt.xlabel("$x_{}$".format(i + 1), fontsize=14)
    if i == 0:
        plt.ylabel("$y$", fontsize=14)
    plt.title("F-test={:.2f}, MI={:.2f}".format(f_test[i], mi[i]), fontsize=16)
plt.show()

Summary

In this lab, we learned about the differences between univariate F-test statistics and mutual information. We performed F-test and mutual information regression on a dataset and compared the results. We found that F-test captures only linear dependency between variables while mutual information can capture any kind of dependency between variables.

Other Machine Learning Tutorials you may like