Active Learning Withel Propagation

Machine LearningMachine LearningBeginner
Practice Now

This tutorial is from open-source community. Access the source code

Introduction

This lab demonstrates an active learning technique to learn handwritten digits using label propagation. The Label Propagation is a semi-supervised learning method that uses a graph-based approach to propagate labels across data points. Active learning is a process that allows us to iteratively select data points to label, and use these labeled points to retrain the model.

VM Tips

After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.

Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.

If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.


Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL sklearn(("`Sklearn`")) -.-> sklearn/AdvancedDataAnalysisandDimensionalityReductionGroup(["`Advanced Data Analysis and Dimensionality Reduction`"]) ml(("`Machine Learning`")) -.-> ml/FrameworkandSoftwareGroup(["`Framework and Software`"]) sklearn/AdvancedDataAnalysisandDimensionalityReductionGroup -.-> sklearn/semi_supervised("`Semi-Supervised Learning`") ml/FrameworkandSoftwareGroup -.-> ml/sklearn("`scikit-learn`") subgraph Lab Skills sklearn/semi_supervised -.-> lab-49184{{"`Active Learning Withel Propagation`"}} ml/sklearn -.-> lab-49184{{"`Active Learning Withel Propagation`"}} end

Load the Digits Dataset

We will start by loading the digits dataset from scikit-learn library.

from sklearn import datasets

digits = datasets.load_digits()

Shuffle and Split Data

Next, we will shuffle and split the dataset into labeled and unlabeled parts. We will start with only 10 labeled points.

import numpy as np

rng = np.random.RandomState(0)
indices = np.arange(len(digits.data))
rng.shuffle(indices)

X = digits.data[indices[:330]]
y = digits.target[indices[:330]]
images = digits.images[indices[:330]]

n_total_samples = len(y)
n_labeled_points = 10
unlabeled_indices = np.arange(n_total_samples)[n_labeled_points:]

Train Label Propagation Model

We will now train a label propagation model with the labeled data points and use it to predict the labels of the remaining unlabeled data points.

from sklearn.semi_supervised import LabelSpreading

lp_model = LabelSpreading(gamma=0.25, max_iter=20)
lp_model.fit(X, y_train)
predicted_labels = lp_model.transduction_[unlabeled_indices]

Select Most Uncertain Points

We will select the top five most uncertain points based on their predicted label distributions and request human labels for them.

from scipy import stats

pred_entropies = stats.distributions.entropy(lp_model.label_distributions_.T)
uncertainty_index = np.argsort(pred_entropies)[::-1]
uncertainty_index = uncertainty_index[np.in1d(uncertainty_index, unlabeled_indices)][:5]

Label the Most Uncertain Points

We will add the human labels to the labeled data points and train the model with them.

y_train[uncertainty_index] = y[uncertainty_index]
lp_model.fit(X, y_train)

Repeat

We will repeat the process of selecting the top five most uncertain points, adding their labels to the labeled data points, and training the model until we have 30 labeled data points.

max_iterations = 3

for i in range(max_iterations):
    if len(unlabeled_indices) == 0:
        print("No unlabeled items left to label.")
        break

    ## select top five uncertain points
    pred_entropies = stats.distributions.entropy(lp_model.label_distributions_.T)
    uncertainty_index = np.argsort(pred_entropies)[::-1]
    uncertainty_index = uncertainty_index[np.in1d(uncertainty_index, unlabeled_indices)][:5]

    ## add labels to labeled data points
    y_train[uncertainty_index] = y[uncertainty_index]

    ## train the model
    lp_model.fit(X, y_train)

    ## remove labeled data points from the unlabeled set
    delete_indices = np.array([], dtype=int)
    for index, image_index in enumerate(uncertainty_index):
        (delete_index,) = np.where(unlabeled_indices == image_index)
        delete_indices = np.concatenate((delete_indices, delete_index))
    unlabeled_indices = np.delete(unlabeled_indices, delete_indices)
    n_labeled_points += len(uncertainty_index)

Summary

In summary, this lab demonstrated an active learning technique using Label Propagation to learn handwritten digits. We started by training a label propagation model with only 10 labeled points, and iteratively selected the top five most uncertain points to label until we had 30 labeled data points. This active learning technique can be useful to minimize the number of labeled data points required to train a model while maximizing its performance.

Other Machine Learning Tutorials you may like