Working with Text Data

Machine LearningMachine LearningBeginner
Practice Now

This tutorial is from open-source community. Access the source code

Introduction

In this lab, we will explore how to work with text data using scikit-learn, a popular machine learning library in Python. We will learn how to load text data, preprocess it, extract features, train a model, and evaluate its performance.

VM Tips

After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.

Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.

If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.


Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL sklearn(("`Sklearn`")) -.-> sklearn/CoreModelsandAlgorithmsGroup(["`Core Models and Algorithms`"]) sklearn(("`Sklearn`")) -.-> sklearn/DataPreprocessingandFeatureEngineeringGroup(["`Data Preprocessing and Feature Engineering`"]) sklearn(("`Sklearn`")) -.-> sklearn/UtilitiesandDatasetsGroup(["`Utilities and Datasets`"]) ml(("`Machine Learning`")) -.-> ml/FrameworkandSoftwareGroup(["`Framework and Software`"]) sklearn/CoreModelsandAlgorithmsGroup -.-> sklearn/naive_bayes("`Naive Bayes`") sklearn/DataPreprocessingandFeatureEngineeringGroup -.-> sklearn/feature_extraction("`Feature Extraction`") sklearn/UtilitiesandDatasetsGroup -.-> sklearn/datasets("`Datasets`") ml/FrameworkandSoftwareGroup -.-> ml/sklearn("`scikit-learn`") subgraph Lab Skills sklearn/naive_bayes -.-> lab-71103{{"`Working with Text Data`"}} sklearn/feature_extraction -.-> lab-71103{{"`Working with Text Data`"}} sklearn/datasets -.-> lab-71103{{"`Working with Text Data`"}} ml/sklearn -.-> lab-71103{{"`Working with Text Data`"}} end

Loading the Text Data

First, we need to load the text data that we will be working with. We will use the 20 Newsgroups dataset, which contains news articles from twenty different topics. To load the dataset, we can use the fetch_20newsgroups function from scikit-learn.

from sklearn.datasets import fetch_20newsgroups

## Load the dataset
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

Now we have loaded the data, and we can explore its structure and content.

Preprocessing the Text Data

Before we can use the text data for machine learning, we need to preprocess it. This involves several steps, such as removing punctuation, converting all text to lowercase, and tokenizing the text into individual words. We can perform these preprocessing steps using scikit-learn's CountVectorizer and TfidfTransformer.

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

## Preprocess the text data
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

Now our text data is preprocessed and ready for feature extraction.

Feature Extraction

To represent the text data as feature vectors, we can use the bags of words representation. This representation assigns a fixed integer id to each word in the training set and counts the number of occurrences of each word in each document. We can extract these feature vectors using scikit-learn's CountVectorizer.

from sklearn.feature_extraction.text import CountVectorizer

## Extract feature vectors
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)

Now we have extracted the feature vectors, and we can use them for training our model.

Training the Model

Now that we have our feature vectors, we can train a model to classify the text data. In this example, we will use the Multinomial Naive Bayes algorithm, which is a popular algorithm for text classification.

from sklearn.naive_bayes import MultinomialNB

## Train the model
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

Now our model is trained and ready for prediction.

Evaluating the Model

To evaluate the performance of our model, we can use a held-out test set. We can load the test set using the same process as the training set.

twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)

Now we can preprocess the test set and extract the feature vectors.

X_test_counts = count_vect.transform(twenty_test.data)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

Finally, we can use our trained model to make predictions on the test set and calculate the accuracy.

predicted = clf.predict(X_test_tfidf)
accuracy = np.mean(predicted == twenty_test.target)

Summary

In this lab, we learned how to work with text data using scikit-learn. We loaded the text data, preprocessed it, extracted feature vectors, trained a model, and evaluated its performance. Working with text data can be challenging, but scikit-learn provides powerful tools and algorithms to make the process easier.

Other Machine Learning Tutorials you may like