Cross-Validation on Digits Dataset

Machine LearningMachine LearningBeginner
Practice Now

This tutorial is from open-source community. Access the source code

Introduction

This lab uses cross-validation with a support vector machine (SVM) on the digits dataset. This is a classification problem, where the task is to identify digits from images of handwritten digits.

VM Tips

After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.

Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.

If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.


Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL sklearn(("`Sklearn`")) -.-> sklearn/ModelSelectionandEvaluationGroup(["`Model Selection and Evaluation`"]) ml(("`Machine Learning`")) -.-> ml/FrameworkandSoftwareGroup(["`Framework and Software`"]) sklearn/ModelSelectionandEvaluationGroup -.-> sklearn/model_selection("`Model Selection`") ml/FrameworkandSoftwareGroup -.-> ml/sklearn("`scikit-learn`") subgraph Lab Skills sklearn/model_selection -.-> lab-49099{{"`Cross-Validation on Digits Dataset`"}} ml/sklearn -.-> lab-49099{{"`Cross-Validation on Digits Dataset`"}} end

Load the dataset

First, we need to load the digits dataset from scikit-learn and split it into features and labels.

import numpy as np
from sklearn import datasets

X, y = datasets.load_digits(return_X_y=True)

Create a Support Vector Machine (SVM) model

Next, we create an SVM model with a linear kernel.

from sklearn import svm

svc = svm.SVC(kernel="linear")

Define the hyperparameter values to test

We will test different values of the regularization parameter C, which controls the trade-off between maximizing the margin and minimizing the classification error. We will test 10 logarithmically-spaced values between 10^-10 and 1.

C_s = np.logspace(-10, 0, 10)

Perform cross-validation and record results

For each value of C, we perform 10-fold cross-validation and record the mean and standard deviation of the scores.

from sklearn.model_selection import cross_val_score

scores = list()
scores_std = list()
for C in C_s:
    svc.C = C
    this_scores = cross_val_score(svc, X, y, n_jobs=1)
    scores.append(np.mean(this_scores))
    scores_std.append(np.std(this_scores))

Plot the results

Finally, we plot the mean scores as a function of C, and also include error bars to visualize the standard deviation.

import matplotlib.pyplot as plt

plt.figure()
plt.semilogx(C_s, scores)
plt.semilogx(C_s, np.array(scores) + np.array(scores_std), "b--")
plt.semilogx(C_s, np.array(scores) - np.array(scores_std), "b--")
locs, labels = plt.yticks()
plt.yticks(locs, list(map(lambda x: "%g" % x, locs)))
plt.ylabel("CV score")
plt.xlabel("Parameter C")
plt.ylim(0, 1.1)
plt.show()

Summary

In this lab, we performed 10-fold cross-validation with an SVM model on the digits dataset, testing different values of the regularization parameter C. We plotted the results to visualize the relationship between C and the mean cross-validation score. This is a useful technique for tuning hyperparameters and assessing model performance.

Other Machine Learning Tutorials you may like