Digit Dataset Analysis

Machine LearningMachine LearningBeginner
Practice Now

This tutorial is from open-source community. Access the source code

Introduction

In this lab, we will be exploring the scikit-learn digits dataset. This dataset consists of 1797 8x8 pixel images, each representing a handwritten digit from 0-9. Our goal is to analyze this dataset and understand how we can utilize it to classify handwritten digits using machine learning algorithms.

VM Tips

After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.

Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.

If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.


Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL sklearn(("`Sklearn`")) -.-> sklearn/ModelSelectionandEvaluationGroup(["`Model Selection and Evaluation`"]) sklearn(("`Sklearn`")) -.-> sklearn/CoreModelsandAlgorithmsGroup(["`Core Models and Algorithms`"]) ml(("`Machine Learning`")) -.-> ml/FrameworkandSoftwareGroup(["`Framework and Software`"]) sklearn/ModelSelectionandEvaluationGroup -.-> sklearn/metrics("`Metrics`") sklearn/ModelSelectionandEvaluationGroup -.-> sklearn/model_selection("`Model Selection`") sklearn/CoreModelsandAlgorithmsGroup -.-> sklearn/svm("`Support Vector Machines`") ml/FrameworkandSoftwareGroup -.-> ml/sklearn("`scikit-learn`") subgraph Lab Skills sklearn/metrics -.-> lab-49110{{"`Digit Dataset Analysis`"}} sklearn/model_selection -.-> lab-49110{{"`Digit Dataset Analysis`"}} sklearn/svm -.-> lab-49110{{"`Digit Dataset Analysis`"}} ml/sklearn -.-> lab-49110{{"`Digit Dataset Analysis`"}} end

Importing the Dataset

The first step is to import the digits dataset from scikit-learn using the following code:

from sklearn import datasets

## Load the digits dataset
digits = datasets.load_digits()

Visualizing the Dataset

To get a better understanding of the dataset, we can visualize a sample image using matplotlib. The following code displays the last digit in the dataset:

import matplotlib.pyplot as plt

## Display the last digit
plt.figure(1, figsize=(3, 3))
plt.imshow(digits.images[-1], cmap=plt.cm.gray_r, interpolation="nearest")
plt.show()

Preparing the Dataset for Machine Learning

Before we can train a machine learning model on the dataset, we need to prepare the data by splitting it into training and testing sets. We can do this using scikit-learn's train_test_split function:

from sklearn.model_selection import train_test_split

## Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.2, random_state=42)

Training a Machine Learning Model

Now that we have prepared the dataset, we can train a machine learning model on the training data. In this example, we will be using a Support Vector Machine (SVM) algorithm:

from sklearn.svm import SVC

## Create the SVM classifier
clf = SVC(kernel='linear')

## Train the classifier on the training data
clf.fit(X_train, y_train)

Evaluating the Model

To evaluate the performance of our model, we can use scikit-learn's accuracy_score function:

from sklearn.metrics import accuracy_score

## Predict the labels of the test set
y_pred = clf.predict(X_test)

## Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)

## Print the accuracy of the model
print("Accuracy:", accuracy)

Improving the Model

If the accuracy of our model is not satisfactory, we can try improving it by tuning the hyperparameters of the SVM algorithm. For example, we can try changing the value of the C parameter:

## Create the SVM classifier with a different value of C
clf = SVC(kernel='linear', C=0.1)

## Train the classifier on the training data
clf.fit(X_train, y_train)

## Predict the labels of the test set
y_pred = clf.predict(X_test)

## Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)

## Print the accuracy of the model
print("Accuracy:", accuracy)

Summary

In this lab, we explored the scikit-learn digits dataset and learned how to train a machine learning model to classify handwritten digits. We also learned how to evaluate the performance of the model and how to improve it by tuning the hyperparameters of the algorithm. This dataset is a great resource for anyone interested in learning about machine learning classification algorithms.

Other Machine Learning Tutorials you may like