Semi-Supervised Text Classification Tutorial

Introduction

In this lab, you will learn how to perform semi-supervised classification on a text dataset using scikit-learn. Semi-supervised learning is a type of machine learning where a model is trained on both labeled and unlabeled data. This lab will cover how to use Self-Training and LabelSpreading algorithms for semi-supervised text classification. We will be using the 20 newsgroups dataset to train and test our models.

VM Tips

After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.

Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.

If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.

Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL sklearn(("`Sklearn`")) -.-> sklearn/ModelSelectionandEvaluationGroup(["`Model Selection and Evaluation`"]) sklearn(("`Sklearn`")) -.-> sklearn/DataPreprocessingandFeatureEngineeringGroup(["`Data Preprocessing and Feature Engineering`"]) sklearn(("`Sklearn`")) -.-> sklearn/UtilitiesandDatasetsGroup(["`Utilities and Datasets`"]) sklearn(("`Sklearn`")) -.-> sklearn/CoreModelsandAlgorithmsGroup(["`Core Models and Algorithms`"]) sklearn(("`Sklearn`")) -.-> sklearn/AdvancedDataAnalysisandDimensionalityReductionGroup(["`Advanced Data Analysis and Dimensionality Reduction`"]) ml(("`Machine Learning`")) -.-> ml/FrameworkandSoftwareGroup(["`Framework and Software`"]) sklearn/ModelSelectionandEvaluationGroup -.-> sklearn/metrics("`Metrics`") sklearn/DataPreprocessingandFeatureEngineeringGroup -.-> sklearn/pipeline("`Pipeline`") sklearn/DataPreprocessingandFeatureEngineeringGroup -.-> sklearn/feature_extraction("`Feature Extraction`") sklearn/UtilitiesandDatasetsGroup -.-> sklearn/datasets("`Datasets`") sklearn/CoreModelsandAlgorithmsGroup -.-> sklearn/linear_model("`Linear Models`") sklearn/ModelSelectionandEvaluationGroup -.-> sklearn/model_selection("`Model Selection`") sklearn/DataPreprocessingandFeatureEngineeringGroup -.-> sklearn/preprocessing("`Preprocessing and Normalization`") sklearn/AdvancedDataAnalysisandDimensionalityReductionGroup -.-> sklearn/semi_supervised("`Semi-Supervised Learning`") ml/FrameworkandSoftwareGroup -.-> ml/sklearn("`scikit-learn`") subgraph Lab Skills sklearn/metrics -.-> lab-49281{{"`Semi-Supervised Text Classification`"}} sklearn/pipeline -.-> lab-49281{{"`Semi-Supervised Text Classification`"}} sklearn/feature_extraction -.-> lab-49281{{"`Semi-Supervised Text Classification`"}} sklearn/datasets -.-> lab-49281{{"`Semi-Supervised Text Classification`"}} sklearn/linear_model -.-> lab-49281{{"`Semi-Supervised Text Classification`"}} sklearn/model_selection -.-> lab-49281{{"`Semi-Supervised Text Classification`"}} sklearn/preprocessing -.-> lab-49281{{"`Semi-Supervised Text Classification`"}} sklearn/semi_supervised -.-> lab-49281{{"`Semi-Supervised Text Classification`"}} ml/sklearn -.-> lab-49281{{"`Semi-Supervised Text Classification`"}} end

Load the Dataset

We will be using the 20 newsgroups dataset, which contains around 18,000 newsgroup posts on 20 topics. In this step, we will load the dataset and print out some basic information about it.

import numpy as np
from sklearn.datasets import fetch_20newsgroups

## Load the dataset with the first five categories
data = fetch_20newsgroups(
    subset="train",
    categories=[
        "alt.atheism",
        "comp.graphics",
        "comp.os.ms-windows.misc",
        "comp.sys.ibm.pc.hardware",
        "comp.sys.mac.hardware",
    ],
)

## Print out information about the dataset
print("%d documents" % len(data.filenames))
print("%d categories" % len(data.target_names))

Create the Pipeline for Supervised Learning

In this step, we will create a pipeline for supervised learning. The pipeline will consist of a CountVectorizer to convert the text data into a matrix of token counts, a TfidfTransformer to apply term frequency-inverse document frequency normalization to the count matrix, and an SGDClassifier to train the model.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline

## Parameters for the SGDClassifier
sdg_params = dict(alpha=1e-5, penalty="l2", loss="log_loss")

## Parameters for the CountVectorizer
vectorizer_params = dict(ngram_range=(1, 2), min_df=5, max_df=0.8)

## Create the pipeline
pipeline = Pipeline(
    [
        ("vect", CountVectorizer(**vectorizer_params)),
        ("tfidf", TfidfTransformer()),
        ("clf", SGDClassifier(**sdg_params)),
    ]
)

Train and Evaluate the Supervised Model

In this step, we will split the dataset into training and testing sets, and then train and evaluate the supervised model pipeline we created in Step 2.

from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

## Split the dataset into training and testing sets
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y)

## Train and evaluate the supervised model pipeline
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(
    "Micro-averaged F1 score on test set: %0.3f"
    % f1_score(y_test, y_pred, average="micro")
)

Create the Pipeline for Self-Training

In this step, we will create a pipeline for semi-supervised learning using Self-Training. The pipeline will be similar to the supervised pipeline, but we will use the SelfTrainingClassifier instead of the SGDClassifier.

from sklearn.semi_supervised import SelfTrainingClassifier

## Create the Self-Training pipeline
st_pipeline = Pipeline(
    [
        ("vect", CountVectorizer(**vectorizer_params)),
        ("tfidf", TfidfTransformer()),
        ("clf", SelfTrainingClassifier(SGDClassifier(**sdg_params), verbose=True)),
    ]
)

Train and Evaluate the Self-Training Model

In this step, we will use Self-Training on 20% of the labeled data. We will randomly select 20% of the labeled data, train the model on that data, and then use the model to predict labels for the remaining unlabeled data.

import numpy as np

## Select 20% of the training data
y_mask = np.random.rand(len(y_train)) < 0.2
X_20, y_20 = map(
    list, zip(*((x, y) for x, y, m in zip(X_train, y_train, y_mask) if m))
)

## Set the non-masked subset to be unlabeled
y_train[~y_mask] = -1

## Train and evaluate the Self-Training pipeline
st_pipeline.fit(X_train, y_train)
y_pred = st_pipeline.predict(X_test)
print(
    "Micro-averaged F1 score on test set: %0.3f"
    % f1_score(y_test, y_pred, average="micro")
)

Create the Pipeline for LabelSpreading

In this step, we will create a pipeline for semi-supervised learning using LabelSpreading. The pipeline will be similar to the supervised pipeline, but we will use the LabelSpreading algorithm instead of the SGDClassifier.

from sklearn.semi_supervised import LabelSpreading
from sklearn.preprocessing import FunctionTransformer

## Create the LabelSpreading pipeline
ls_pipeline = Pipeline(
    [
        ("vect", CountVectorizer(**vectorizer_params)),
        ("tfidf", TfidfTransformer()),
        ("toarray", FunctionTransformer(lambda x: x.toarray())),
        ("clf", LabelSpreading()),
    ]
)

Train and Evaluate the LabelSpreading Model

In this step, we will use LabelSpreading on 20% of the labeled data. We will randomly select 20% of the labeled data, train the model on that data, and then use the model to predict labels for the remaining unlabeled data.

## Train and evaluate the LabelSpreading pipeline
ls_pipeline.fit(X_train, y_train)
y_pred = ls_pipeline.predict(X_test)
print(
    "Micro-averaged F1 score on test set: %0.3f"
    % f1_score(y_test, y_pred, average="micro")
)

Summary

In this lab, we learned how to perform semi-supervised classification on a text dataset using scikit-learn. We used Self-Training and LabelSpreading algorithms to train and test our models. Semi-supervised learning can be useful when there is a limited amount of labeled data available, and it can help improve the performance of a model by incorporating unlabeled data.

Semi-Supervised Text Classification