Cross-Validation | Prediction Error Display | Scikit-Learn

Introduction

In this lab, we will learn how to use cross-validation to visualize model predictions and errors using the cross_val_predict and PredictionErrorDisplay functions in scikit-learn. We will load the diabetes dataset, create an instance of a linear regression model, and use cross-validation to obtain an array of predictions. We will then use PredictionErrorDisplay to plot the actual versus predicted values, as well as the residuals versus predicted values.

VM Tips

After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.

Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.

If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.

Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL sklearn(("`Sklearn`")) -.-> sklearn/ModelSelectionandEvaluationGroup(["`Model Selection and Evaluation`"]) sklearn(("`Sklearn`")) -.-> sklearn/UtilitiesandDatasetsGroup(["`Utilities and Datasets`"]) sklearn(("`Sklearn`")) -.-> sklearn/CoreModelsandAlgorithmsGroup(["`Core Models and Algorithms`"]) ml(("`Machine Learning`")) -.-> ml/FrameworkandSoftwareGroup(["`Framework and Software`"]) sklearn/ModelSelectionandEvaluationGroup -.-> sklearn/metrics("`Metrics`") sklearn/UtilitiesandDatasetsGroup -.-> sklearn/datasets("`Datasets`") sklearn/CoreModelsandAlgorithmsGroup -.-> sklearn/linear_model("`Linear Models`") sklearn/ModelSelectionandEvaluationGroup -.-> sklearn/model_selection("`Model Selection`") ml/FrameworkandSoftwareGroup -.-> ml/sklearn("`scikit-learn`") subgraph Lab Skills sklearn/metrics -.-> lab-49101{{"`Plotting Predictions with Cross-Validation`"}} sklearn/datasets -.-> lab-49101{{"`Plotting Predictions with Cross-Validation`"}} sklearn/linear_model -.-> lab-49101{{"`Plotting Predictions with Cross-Validation`"}} sklearn/model_selection -.-> lab-49101{{"`Plotting Predictions with Cross-Validation`"}} ml/sklearn -.-> lab-49101{{"`Plotting Predictions with Cross-Validation`"}} end

Load and Prepare Data

First, we will load the diabetes dataset and prepare it for modeling. We will use load_diabetes function from scikit-learn to load the dataset into two arrays, X and y.

from sklearn.datasets import load_diabetes

X, y = load_diabetes(return_X_y=True)

Create a Linear Regression Model

Next, we will create an instance of a linear regression model using the LinearRegression class from scikit-learn.

from sklearn.linear_model import LinearRegression

lr = LinearRegression()

Generate Cross-Validated Predictions

We will use cross_val_predict function from scikit-learn to generate cross-validated predictions.

from sklearn.model_selection import cross_val_predict

y_pred = cross_val_predict(lr, X, y, cv=10)

Visualize Prediction Errors

We will use PredictionErrorDisplay from scikit-learn to visualize the prediction errors. We will plot the actual versus predicted values, as well as the residuals versus predicted values.

import matplotlib.pyplot as plt
from sklearn.metrics import PredictionErrorDisplay

fig, axs = plt.subplots(ncols=2, figsize=(8, 4))
PredictionErrorDisplay.from_predictions(
    y,
    y_pred=y_pred,
    kind="actual_vs_predicted",
    subsample=100,
    ax=axs[0],
    random_state=0,
)
axs[0].set_title("Actual vs. Predicted values")
PredictionErrorDisplay.from_predictions(
    y,
    y_pred=y_pred,
    kind="residual_vs_predicted",
    subsample=100,
    ax=axs[1],
    random_state=0,
)
axs[1].set_title("Residuals vs. Predicted Values")
fig.suptitle("Plotting cross-validated predictions")
plt.tight_layout()
plt.show()

Interpret Results

From the visualizations, we can see that the actual versus predicted plot shows a relatively linear relationship with some variation. The residuals versus predicted values plot shows a relatively random pattern with no clear trend, indicating that the linear regression model may be a good fit for the data. However, it is important to note that we used cross_val_predict for visualization purposes only. It would be problematic to quantitatively assess the model performance by computing a single performance metric from the concatenated predictions returned by cross_val_predict when the different CV folds vary by size and distributions. It is recommended to compute per-fold performance metrics using cross_val_score or cross_validate instead.

Summary

In this lab, we learned how to use cross-validation to visualize model predictions and errors using the cross_val_predict and PredictionErrorDisplay functions in scikit-learn. We loaded the diabetes dataset, created an instance of a linear regression model, and used cross-validation to obtain an array of predictions. We then used PredictionErrorDisplay to plot the actual versus predicted values, as well as the residuals versus predicted values. Finally, we interpreted the results and discussed the importance of using per-fold performance metrics for model evaluation.

Plotting Predictions with Cross-Validation