Building Machine Learning Pipelines with Scikit-Learn

Machine LearningMachine LearningBeginner
Practice Now

This tutorial is from open-source community. Access the source code

Introduction

In machine learning, a pipeline is a series of steps that are performed sequentially to transform the input data and then build a model. Scikit-learn provides a pipeline class that can be used to chain multiple processing steps together, making it easy to build complex models that involve multiple preprocessing and modeling steps.

In this tutorial, we will demonstrate how to build a pipeline with feature selection and SVM classification using Scikit-learn. We will show how to integrate feature selection within the pipeline to prevent overfitting, and how to inspect the pipeline to better understand the model.

VM Tips

After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.

Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.

If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.


Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL sklearn(("`Sklearn`")) -.-> sklearn/DataPreprocessingandFeatureEngineeringGroup(["`Data Preprocessing and Feature Engineering`"]) sklearn(("`Sklearn`")) -.-> sklearn/ModelSelectionandEvaluationGroup(["`Model Selection and Evaluation`"]) sklearn(("`Sklearn`")) -.-> sklearn/UtilitiesandDatasetsGroup(["`Utilities and Datasets`"]) sklearn(("`Sklearn`")) -.-> sklearn/CoreModelsandAlgorithmsGroup(["`Core Models and Algorithms`"]) ml(("`Machine Learning`")) -.-> ml/FrameworkandSoftwareGroup(["`Framework and Software`"]) sklearn/DataPreprocessingandFeatureEngineeringGroup -.-> sklearn/feature_selection("`Feature Selection`") sklearn/ModelSelectionandEvaluationGroup -.-> sklearn/metrics("`Metrics`") sklearn/DataPreprocessingandFeatureEngineeringGroup -.-> sklearn/pipeline("`Pipeline`") sklearn/UtilitiesandDatasetsGroup -.-> sklearn/datasets("`Datasets`") sklearn/ModelSelectionandEvaluationGroup -.-> sklearn/model_selection("`Model Selection`") sklearn/CoreModelsandAlgorithmsGroup -.-> sklearn/svm("`Support Vector Machines`") ml/FrameworkandSoftwareGroup -.-> ml/sklearn("`scikit-learn`") subgraph Lab Skills sklearn/feature_selection -.-> lab-49126{{"`Building Machine Learning Pipelines with Scikit-Learn`"}} sklearn/metrics -.-> lab-49126{{"`Building Machine Learning Pipelines with Scikit-Learn`"}} sklearn/pipeline -.-> lab-49126{{"`Building Machine Learning Pipelines with Scikit-Learn`"}} sklearn/datasets -.-> lab-49126{{"`Building Machine Learning Pipelines with Scikit-Learn`"}} sklearn/model_selection -.-> lab-49126{{"`Building Machine Learning Pipelines with Scikit-Learn`"}} sklearn/svm -.-> lab-49126{{"`Building Machine Learning Pipelines with Scikit-Learn`"}} ml/sklearn -.-> lab-49126{{"`Building Machine Learning Pipelines with Scikit-Learn`"}} end

Generate and Split the Dataset

We will start by generating a binary classification dataset using Scikit-learn's make_classification function. We will also split the dataset into training and testing subsets using Scikit-learn's train_test_split function.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(
    n_features=20,
    n_informative=3,
    n_redundant=0,
    n_classes=2,
    n_clusters_per_class=2,
    random_state=42,
)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

Build the Pipeline

We will now build a pipeline that consists of two steps: feature selection and SVM classification. We will use Scikit-learn's SelectKBest function for feature selection, and Scikit-learn's LinearSVC function for SVM classification. The SelectKBest function selects the k most informative features based on the f_classif method, which computes the ANOVA F-value between each feature and the target variable. We will set k=3 in this example.

from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.pipeline import make_pipeline
from sklearn.svm import LinearSVC

anova_filter = SelectKBest(f_classif, k=3)
clf = LinearSVC(dual="auto")
anova_svm = make_pipeline(anova_filter, clf)

Train the Pipeline

We will now train the pipeline on the training subset using the fit method. During training, the SelectKBest function will select the 3 most informative features based on the ANOVA F-value, and the LinearSVC function will train a linear SVM classifier on the selected features.

anova_svm.fit(X_train, y_train)

Evaluate the Pipeline

We will now evaluate the pipeline on the testing subset using the predict method. The pipeline will select the 3 most informative features based on the ANOVA F-value, and the LinearSVC function will make predictions on the selected features.

from sklearn.metrics import classification_report

y_pred = anova_svm.predict(X_test)
print(classification_report(y_test, y_pred))

Inspect the Pipeline

We can inspect the pipeline to better understand the model. We can use the index of the selected features to retrieve the original feature names.

anova_svm[:-1].inverse_transform(anova_svm[-1].coef_)

Summary

In this tutorial, we demonstrated how to build a pipeline with feature selection and SVM classification using Scikit-learn. We showed how to integrate feature selection within the pipeline to prevent overfitting, and how to inspect the pipeline to better understand the model. Pipelines are a powerful tool for building complex models in a modular and efficient way.

Other Machine Learning Tutorials you may like