Introduction
In this lab, we will learn how to perform feature extraction using the scikit-learn library. Feature extraction is the process of transforming raw data into numerical features that can be used by machine learning algorithms. It involves extracting relevant information from different types of data such as text and images.
VM Tips
After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.
Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.
If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.
Loading features from dicts
In this step, we will learn how to load features from dictionaries using the DictVectorizer class in scikit-learn.
from sklearn.feature_extraction import DictVectorizer
measurements = [
{'city': 'Dubai', 'temperature': 33.},
{'city': 'London', 'temperature': 12.},
{'city': 'San Francisco', 'temperature': 18.},
]
vec = DictVectorizer()
features = vec.fit_transform(measurements).toarray()
feature_names = vec.get_feature_names_out()
print(features)
print(feature_names)
Feature hashing
In this step, we will learn how to perform feature hashing using the FeatureHasher class in scikit-learn. Feature hashing is a technique that maps features to a fixed-length vector using a hash function.
from sklearn.feature_extraction import FeatureHasher
movies = [
{'category': ['thriller', 'drama'], 'year': 2003},
{'category': ['animation', 'family'], 'year': 2011},
{'year': 1974},
]
hasher = FeatureHasher(input_type='string')
hashed_features = hasher.transform(movies).toarray()
print(hashed_features)
Text feature extraction
In this step, we will learn how to perform text feature extraction using the CountVectorizer and TfidfVectorizer classes in scikit-learn. These classes can be used to convert text data into numerical features.
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]
vectorizer = CountVectorizer()
features = vectorizer.fit_transform(corpus).toarray()
feature_names = vectorizer.get_feature_names_out()
print(features)
print(feature_names)
Customizing the vectorizer classes
In this step, we will learn how to customize the behavior of vectorizer classes by passing callable functions to them.
def my_tokenizer(s):
return s.split()
vectorizer = CountVectorizer(tokenizer=my_tokenizer)
features = vectorizer.fit_transform(corpus).toarray()
print(features)
Summary
In this lab, we learned how to perform feature extraction using the scikit-learn library. We explored various techniques such as loading features from dicts, feature hashing, and text feature extraction. We also learned how to customize the behavior of vectorizer classes to suit our specific needs. Feature extraction is an important step in machine learning as it helps transform raw data into a format that can be used by algorithms to make predictions or classify data.