Recursive Feature Elimination | Cross-Validation | Scikit-learn

Introduction

In this lab, we will go through a step-by-step process of implementing Recursive Feature Elimination with Cross-Validation (RFECV) using scikit-learn. RFECV is used for feature selection, which is the process of selecting a subset of relevant features for use in model construction. We will use a classification task with 15 features, out of which 3 are informative, 2 are redundant, and 10 are non-informative.

VM Tips

After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.

Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.

If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.

Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL sklearn(("`Sklearn`")) -.-> sklearn/DataPreprocessingandFeatureEngineeringGroup(["`Data Preprocessing and Feature Engineering`"]) sklearn(("`Sklearn`")) -.-> sklearn/UtilitiesandDatasetsGroup(["`Utilities and Datasets`"]) sklearn(("`Sklearn`")) -.-> sklearn/CoreModelsandAlgorithmsGroup(["`Core Models and Algorithms`"]) sklearn(("`Sklearn`")) -.-> sklearn/ModelSelectionandEvaluationGroup(["`Model Selection and Evaluation`"]) ml(("`Machine Learning`")) -.-> ml/FrameworkandSoftwareGroup(["`Framework and Software`"]) sklearn/DataPreprocessingandFeatureEngineeringGroup -.-> sklearn/feature_selection("`Feature Selection`") sklearn/UtilitiesandDatasetsGroup -.-> sklearn/datasets("`Datasets`") sklearn/CoreModelsandAlgorithmsGroup -.-> sklearn/linear_model("`Linear Models`") sklearn/ModelSelectionandEvaluationGroup -.-> sklearn/model_selection("`Model Selection`") ml/FrameworkandSoftwareGroup -.-> ml/sklearn("`scikit-learn`") subgraph Lab Skills sklearn/feature_selection -.-> lab-49268{{"`Recursive Feature Elimination with Cross-Validation`"}} sklearn/datasets -.-> lab-49268{{"`Recursive Feature Elimination with Cross-Validation`"}} sklearn/linear_model -.-> lab-49268{{"`Recursive Feature Elimination with Cross-Validation`"}} sklearn/model_selection -.-> lab-49268{{"`Recursive Feature Elimination with Cross-Validation`"}} ml/sklearn -.-> lab-49268{{"`Recursive Feature Elimination with Cross-Validation`"}} end

Data generation

We will generate a classification task using scikit-learn's make_classification function. We will generate 500 samples with 15 features, out of which 3 are informative, 2 are redundant, and 10 are non-informative.

from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=500,
    n_features=15,
    n_informative=3,
    n_redundant=2,
    n_repeated=0,
    n_classes=8,
    n_clusters_per_class=1,
    class_sep=0.8,
    random_state=0,
)

Model training and selection

We will create the RFECV object and compute the cross-validated scores. The scoring strategy "accuracy" optimizes the proportion of correctly classified samples. We will use logistic regression as the estimator and stratified k-fold cross-validation with 5 folds.

from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression

min_features_to_select = 1  ## Minimum number of features to consider
clf = LogisticRegression()
cv = StratifiedKFold(5)

rfecv = RFECV(
    estimator=clf,
    step=1,
    cv=cv,
    scoring="accuracy",
    min_features_to_select=min_features_to_select,
    n_jobs=2,
)
rfecv.fit(X, y)

print(f"Optimal number of features: {rfecv.n_features_}")

Plot number of features vs. cross-validation scores

We will plot the number of features selected against the cross-validation scores. We will use matplotlib to create the plot.

import matplotlib.pyplot as plt

n_scores = len(rfecv.cv_results_["mean_test_score"])
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Mean test accuracy")
plt.errorbar(
    range(min_features_to_select, n_scores + min_features_to_select),
    rfecv.cv_results_["mean_test_score"],
    yerr=rfecv.cv_results_["std_test_score"],
)
plt.title("Recursive Feature Elimination \nwith correlated features")
plt.show()

Summary

In this lab, we went through the process of implementing Recursive Feature Elimination with Cross-Validation (RFECV) using scikit-learn. We generated a classification task with 15 features, out of which 3 were informative, 2 were redundant, and 10 were non-informative. We used logistic regression as the estimator and stratified k-fold cross-validation with 5 folds. We plotted the number of features selected against the cross-validation scores. We found that the optimal number of features was 3, which corresponded to the true generative model. We also noticed a plateau of equivalent scores for 3 to 5 selected features due to the introduction of correlated features.

Recursive Feature Elimination with Cross-Validation