Ensemble Methods in Machine Learning | Scikit-Learn Tutorial

Introduction

In this lab, we will explore ensemble methods using scikit-learn. Ensemble methods are machine learning techniques that combine multiple models to achieve better performance than a single model. We will specifically focus on two popular ensemble methods: Bagging and Random Forests.

VM Tips

After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.

Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.

If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.

Import Dependencies

Let's start by importing the necessary dependencies.

import numpy as np
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score

Load the Data

Next, we will load the iris dataset from scikit-learn using the load_iris function.

data = load_iris()
X, y = data.data, data.target

Split the Data

We will split the data into training and test sets using the train_test_split function from scikit-learn.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Fit a Bagging Classifier

Now, we will fit a Bagging Classifier on the training data. The Bagging Classifier is an ensemble method that uses bootstrap sampling to create multiple base models (often decision trees) and aggregates their predictions using majority voting.

bagging = BaggingClassifier(DecisionTreeClassifier(), n_estimators=10)
bagging.fit(X_train, y_train)

Evaluate the Bagging Classifier

Let's evaluate the Bagging Classifier by computing the accuracy score on the test data using the score method.

accuracy = bagging.score(X_test, y_test)
print(f"Bagging Classifier Accuracy: {accuracy}")

Fit a Random Forest Classifier

Next, we will fit a Random Forest Classifier on the training data. The Random Forest Classifier is also an ensemble method that uses bootstrap sampling to create multiple decision trees, but it also adds additional randomness by considering only a subset of features at each split.

random_forest = RandomForestClassifier(n_estimators=10)
random_forest.fit(X_train, y_train)

Evaluate the Random Forest Classifier

Let's evaluate the Random Forest Classifier by computing the accuracy score on the test data.

accuracy = random_forest.score(X_test, y_test)
print(f"Random Forest Classifier Accuracy: {accuracy}")

Summary

In this lab, we explored ensemble methods using scikit-learn. We fit a Bagging Classifier and a Random Forest Classifier on the iris dataset and evaluated their performance. Ensemble methods like Bagging and Random Forests can be powerful tools for improving the predictive performance of machine learning models.

Ensemble Methods Exploration with Scikit-Learn