Linear Regression Example: A Hands-on Guide

Introduction

This lab demonstrates how to use linear regression to draw a straight line that best fits a dataset and how to calculate the coefficients, residual sum of squares, and coefficient of determination. We will be using the scikit-learn library to perform linear regression on the diabetes dataset.

VM Tips

After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.

Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.

If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.

Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL sklearn(("`Sklearn`")) -.-> sklearn/ModelSelectionandEvaluationGroup(["`Model Selection and Evaluation`"]) ml(("`Machine Learning`")) -.-> ml/FrameworkandSoftwareGroup(["`Framework and Software`"]) sklearn/ModelSelectionandEvaluationGroup -.-> sklearn/metrics("`Metrics`") ml/FrameworkandSoftwareGroup -.-> ml/sklearn("`scikit-learn`") subgraph Lab Skills sklearn/metrics -.-> lab-49231{{"`Linear Regression Example`"}} ml/sklearn -.-> lab-49231{{"`Linear Regression Example`"}} end

Load the Diabetes Dataset

We start by loading the diabetes dataset from scikit-learn and only selecting one feature from the dataset.

import numpy as np
from sklearn import datasets

## Load the diabetes dataset
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)

## Use only one feature
diabetes_X = diabetes_X[:, np.newaxis, 2]

Split the Dataset

Next, we split the dataset into training and testing sets. We will use 80% of the data for training and 20% for testing.

## Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

## Split the targets into training/testing sets
diabetes_y_train = diabetes_y[:-20]
diabetes_y_test = diabetes_y[-20:]

Train the Model

Now, we create a linear regression object and train the model using the training sets.

from sklearn import linear_model

## Create linear regression object
regr = linear_model.LinearRegression()

## Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

Make Predictions

We can now use the trained model to make predictions on the testing set.

## Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)

Calculate Metrics

We can calculate the coefficients, mean squared error, and coefficient of determination.

from sklearn.metrics import mean_squared_error, r2_score

## The coefficients
print("Coefficients: \n", regr.coef_)

## The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(diabetes_y_test, diabetes_y_pred))

## The coefficient of determination: 1 is perfect prediction
print("Coefficient of determination: %.2f"
      % r2_score(diabetes_y_test, diabetes_y_pred))

Visualize the Results

Finally, we can plot the predicted values against the actual values to visualize how well the model fits the data.

import matplotlib.pyplot as plt

## Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color="black")
plt.plot(diabetes_X_test, diabetes_y_pred, color="blue", linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

Summary

In this lab, we learned how to use linear regression to fit a straight line to a dataset and how to calculate the coefficients, residual sum of squares, and coefficient of determination. We also learned how to visualize the predicted values against the actual values using a scatter plot.

Linear Regression Example