Plotting Decision Surfaces with Ensemble Methods

Introduction

This lab demonstrates how to plot the decision surfaces of forests of randomized trees on the iris dataset using Python's scikit-learn library. The iris dataset is a commonly used dataset for classification tasks. In this lab, we will compare the decision surfaces learned by a decision tree classifier, a random forest classifier, an extra-trees classifier, and an AdaBoost classifier.

VM Tips

After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.

Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.

If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.

Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL sklearn(("`Sklearn`")) -.-> sklearn/UtilitiesandDatasetsGroup(["`Utilities and Datasets`"]) sklearn(("`Sklearn`")) -.-> sklearn/CoreModelsandAlgorithmsGroup(["`Core Models and Algorithms`"]) ml(("`Machine Learning`")) -.-> ml/FrameworkandSoftwareGroup(["`Framework and Software`"]) sklearn/UtilitiesandDatasetsGroup -.-> sklearn/datasets("`Datasets`") sklearn/CoreModelsandAlgorithmsGroup -.-> sklearn/ensemble("`Ensemble Methods`") sklearn/CoreModelsandAlgorithmsGroup -.-> sklearn/tree("`Decision Trees`") ml/FrameworkandSoftwareGroup -.-> ml/sklearn("`scikit-learn`") subgraph Lab Skills sklearn/datasets -.-> lab-49133{{"`Plot Forest Iris`"}} sklearn/ensemble -.-> lab-49133{{"`Plot Forest Iris`"}} sklearn/tree -.-> lab-49133{{"`Plot Forest Iris`"}} ml/sklearn -.-> lab-49133{{"`Plot Forest Iris`"}} end

Import Libraries

In this step, we will import the necessary libraries required to plot the decision surfaces on the iris dataset.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

from sklearn.datasets import load_iris
from sklearn.ensemble import (
    RandomForestClassifier,
    ExtraTreesClassifier,
    AdaBoostClassifier,
)
from sklearn.tree import DecisionTreeClassifier

Define Parameters

In this step, we will define the parameters required to plot the decision surfaces on the iris dataset.

## Parameters
n_classes = 3
n_estimators = 30
cmap = plt.cm.RdYlBu
plot_step = 0.02  ## fine step width for decision surface contours
plot_step_coarser = 0.5  ## step widths for coarse classifier guesses
RANDOM_SEED = 13  ## fix the seed on each iteration

Load Data

In this step, we will load the iris dataset.

## Load data
iris = load_iris()

Define Models

In this step, we will define the models to be used for plotting the decision surfaces on the iris dataset.

models = [
    DecisionTreeClassifier(max_depth=None),
    RandomForestClassifier(n_estimators=n_estimators),
    ExtraTreesClassifier(n_estimators=n_estimators),
    AdaBoostClassifier(DecisionTreeClassifier(max_depth=3), n_estimators=n_estimators),
]

Plot Decision Surfaces

In this step, we will plot the decision surfaces of the defined models on the iris dataset.

plot_idx = 1

for pair in ([0, 1], [0, 2], [2, 3]):
    for model in models:
        ## We only take the two corresponding features
        X = iris.data[:, pair]
        y = iris.target

        ## Shuffle
        idx = np.arange(X.shape[0])
        np.random.seed(RANDOM_SEED)
        np.random.shuffle(idx)
        X = X[idx]
        y = y[idx]

        ## Standardize
        mean = X.mean(axis=0)
        std = X.std(axis=0)
        X = (X - mean) / std

        ## Train
        model.fit(X, y)

        scores = model.score(X, y)
        ## Create a title for each column and the console by using str() and
        ## slicing away useless parts of the string
        model_title = str(type(model)).split(".")[-1][:-2][: -len("Classifier")]

        model_details = model_title
        if hasattr(model, "estimators_"):
            model_details += " with {} estimators".format(len(model.estimators_))
        print(model_details + " with features", pair, "has a score of", scores)

        plt.subplot(3, 4, plot_idx)
        if plot_idx <= len(models):
            ## Add a title at the top of each column
            plt.title(model_title, fontsize=9)

        ## Now plot the decision boundary using a fine mesh as input to a
        ## filled contour plot
        x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
        y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
        xx, yy = np.meshgrid(
            np.arange(x_min, x_max, plot_step), np.arange(y_min, y_max, plot_step)
        )

        ## Plot either a single DecisionTreeClassifier or alpha blend the
        ## decision surfaces of the ensemble of classifiers
        if isinstance(model, DecisionTreeClassifier):
            Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
            Z = Z.reshape(xx.shape)
            cs = plt.contourf(xx, yy, Z, cmap=cmap)
        else:
            ## Choose alpha blend level with respect to the number
            ## of estimators
            ## that are in use (noting that AdaBoost can use fewer estimators
            ## than its maximum if it achieves a good enough fit early on)
            estimator_alpha = 1.0 / len(model.estimators_)
            for tree in model.estimators_:
                Z = tree.predict(np.c_[xx.ravel(), yy.ravel()])
                Z = Z.reshape(xx.shape)
                cs = plt.contourf(xx, yy, Z, alpha=estimator_alpha, cmap=cmap)

        ## Build a coarser grid to plot a set of ensemble classifications
        ## to show how these are different to what we see in the decision
        ## surfaces. These points are regularly space and do not have a
        ## black outline
        xx_coarser, yy_coarser = np.meshgrid(
            np.arange(x_min, x_max, plot_step_coarser),
            np.arange(y_min, y_max, plot_step_coarser),
        )
        Z_points_coarser = model.predict(
            np.c_[xx_coarser.ravel(), yy_coarser.ravel()]
        ).reshape(xx_coarser.shape)
        cs_points = plt.scatter(
            xx_coarser,
            yy_coarser,
            s=15,
            c=Z_points_coarser,
            cmap=cmap,
            edgecolors="none",
        )

        ## Plot the training points, these are clustered together and have a
        ## black outline
        plt.scatter(
            X[:, 0],
            X[:, 1],
            c=y,
            cmap=ListedColormap(["r", "y", "b"]),
            edgecolor="k",
            s=20,
        )
        plot_idx += 1  ## move on to the next plot in sequence

plt.suptitle("Classifiers on feature subsets of the Iris dataset", fontsize=12)
plt.axis("tight")
plt.tight_layout(h_pad=0.2, w_pad=0.2, pad=2.5)
plt.show()

Summary

In this lab, we learned how to plot the decision surfaces of forests of randomized trees on the iris dataset using Python's scikit-learn library. We compared the decision surfaces learned by a decision tree classifier, a random forest classifier, an extra-trees classifier, and an AdaBoost classifier. We also learned how to define models, plot decision surfaces, and load data in Python.

Plot Forest Iris