Introduction
In this lab, we will build a pipeline for dimensionality reduction and classification using Principal Component Analysis (PCA) and Logistic Regression. We will use the scikit-learn library to perform unsupervised dimensionality reduction on the digits dataset using PCA. We will then use a logistic regression model for classification. We will use GridSearchCV to set the dimensionality of the PCA and find the best combination of PCA truncation and classifier regularization.
VM Tips
After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.
Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.
If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.
Import Required Libraries
We will first import the required libraries for the implementation of the pipeline.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
Define Pipeline Components
We will define the pipeline components including the PCA, Standard Scaler and Logistic Regression. We will set the tolerance to a large value to make the example faster.
## Define a pipeline to search for the best combination of PCA truncation
## and classifier regularization.
pca = PCA()
## Define a Standard Scaler to normalize inputs
scaler = StandardScaler()
logistic = LogisticRegression(max_iter=10000, tol=0.1)
pipe = Pipeline(steps=[("scaler", scaler), ("pca", pca), ("logistic", logistic)])
Load Dataset and Define Parameters for GridSearchCV
We will load the digits dataset and define parameters for GridSearchCV. We will set the parameter for PCA truncation and classifier regularization.
X_digits, y_digits = datasets.load_digits(return_X_y=True)
param_grid = {
"pca__n_components": [5, 15, 30, 45, 60],
"logistic__C": np.logspace(-4, 4, 4),
}
Perform GridSearchCV
We will perform GridSearchCV to find the best combination of PCA truncation and classifier regularization.
search = GridSearchCV(pipe, param_grid, n_jobs=2)
search.fit(X_digits, y_digits)
Print Best Parameters and Score
We will print the best parameters and score obtained from the GridSearchCV.
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)
Plot PCA Spectrum
We will plot the PCA spectrum to visualize the explained variance ratio of each principal component.
pca.fit(X_digits)
fig, (ax0, ax1) = plt.subplots(nrows=2, sharex=True, figsize=(6, 6))
ax0.plot(
np.arange(1, pca.n_components_ + 1), pca.explained_variance_ratio_, "+", linewidth=2
)
ax0.set_ylabel("PCA explained variance ratio")
ax0.axvline(
search.best_estimator_.named_steps["pca"].n_components,
linestyle=":",
label="n_components chosen",
)
ax0.legend(prop=dict(size=12))
Find Best Classifier Results
For each number of components, we will find the best classifier results.
results = pd.DataFrame(search.cv_results_)
components_col = "param_pca__n_components"
best_clfs = results.groupby(components_col).apply(
lambda g: g.nlargest(1, "mean_test_score")
)
Plot Classification Accuracy
We will plot the classification accuracy for each number of components.
best_clfs.plot(
x=components_col, y="mean_test_score", yerr="std_test_score", legend=False, ax=ax1
)
ax1.set_ylabel("Classification accuracy (val)")
ax1.set_xlabel("n_components")
plt.xlim(-1, 70)
plt.tight_layout()
plt.show()
Summary
In this lab, we have learned how to build a pipeline for dimensionality reduction and classification using Principal Component Analysis (PCA) and Logistic Regression. We have used the scikit-learn library to perform unsupervised dimensionality reduction on the digits dataset using PCA. We have then used a logistic regression model for classification. We have used GridSearchCV to set the dimensionality of the PCA and find the best combination of PCA truncation and classifier regularization. We have plotted the PCA spectrum and classification accuracy for each number of components.