Introduction
Multi-dimensional scaling (MDS) is a technique used to visualize high-dimensional data in a lower dimensional space (usually 2D or 3D) while preserving the pairwise distances between the data points as much as possible. It is often used in exploratory data analysis and visualization.
In this tutorial, we will walk through the steps of performing MDS on a generated noisy dataset using the scikit-learn library in Python.
VM Tips
After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.
Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.
If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.
Import Libraries
First, we need to import the necessary libraries. We will be using numpy, matplotlib, scikit-learn, and PCA from scikit-learn.
import numpy as np
from matplotlib import pyplot as plt
from matplotlib.collections import LineCollection
from sklearn import manifold
from sklearn.metrics import euclidean_distances
from sklearn.decomposition import PCA
Generate Data
Next, we will generate a noisy dataset using numpy. We will generate 20 samples with 2 features each.
EPSILON = np.finfo(np.float32).eps
n_samples = 20
seed = np.random.RandomState(seed=3)
X_true = seed.randint(0, 20, 2 * n_samples).astype(float)
X_true = X_true.reshape((n_samples, 2))
## Center the data
X_true -= X_true.mean()
Add Noise to Data
We will then add noise to the pairwise distances between the data points using numpy.
similarities = euclidean_distances(X_true)
## Add noise to the similarities
noise = np.random.rand(n_samples, n_samples)
noise = noise + noise.T
noise[np.arange(noise.shape[0]), np.arange(noise.shape[0])] = 0
similarities += noise
Perform MDS
We will then perform MDS on the noisy dataset using scikit-learn's MDS class. We will use the precomputed dissimilarity option since we have already calculated the pairwise distances between the data points. We will also set the number of components to 2 for 2D visualization.
mds = manifold.MDS(
n_components=2,
max_iter=3000,
eps=1e-9,
random_state=seed,
dissimilarity="precomputed",
n_jobs=1,
normalized_stress="auto",
)
pos = mds.fit(similarities).embedding_
Perform Non-Metric MDS
We will also perform non-metric MDS on the same dataset for comparison. We will use the same options as MDS, except we will set the metric option to False.
nmds = manifold.MDS(
n_components=2,
metric=False,
max_iter=3000,
eps=1e-12,
dissimilarity="precomputed",
random_state=seed,
n_jobs=1,
n_init=1,
normalized_stress="auto",
)
npos = nmds.fit_transform(similarities, init=pos)
Rescale and Rotate Data
We will then rescale and rotate the data for visualization using PCA from scikit-learn.
## Rescale the data
pos *= np.sqrt((X_true**2).sum()) / np.sqrt((pos**2).sum())
npos *= np.sqrt((X_true**2).sum()) / np.sqrt((npos**2).sum())
## Rotate the data
clf = PCA(n_components=2)
X_true = clf.fit_transform(X_true)
pos = clf.fit_transform(pos)
npos = clf.fit_transform(npos)
Visualize Results
Finally, we will visualize the results using matplotlib. We will plot the true position of the data points, the position of the data points using MDS, and the position of the data points using non-metric MDS. We will also plot the pairwise distances between the data points using LineCollection from matplotlib.
fig = plt.figure(1)
ax = plt.axes([0.0, 0.0, 1.0, 1.0])
s = 100
plt.scatter(X_true[:, 0], X_true[:, 1], color="navy", s=s, lw=0, label="True Position")
plt.scatter(pos[:, 0], pos[:, 1], color="turquoise", s=s, lw=0, label="MDS")
plt.scatter(npos[:, 0], npos[:, 1], color="darkorange", s=s, lw=0, label="NMDS")
plt.legend(scatterpoints=1, loc="best", shadow=False)
similarities = similarities.max() / (similarities + EPSILON) * 100
np.fill_diagonal(similarities, 0)
## Plot the edges
start_idx, end_idx = np.where(pos)
## a sequence of (*line0*, *line1*, *line2*), where::
## linen = (x0, y0), (x1, y1), ... (xm, ym)
segments = [
[X_true[i, :], X_true[j, :]] for i in range(len(pos)) for j in range(len(pos))
]
values = np.abs(similarities)
lc = LineCollection(
segments, zorder=0, cmap=plt.cm.Blues, norm=plt.Normalize(0, values.max())
)
lc.set_array(similarities.flatten())
lc.set_linewidths(np.full(len(segments), 0.5))
ax.add_collection(lc)
plt.show()
Summary
In this tutorial, we learned how to perform MDS on a noisy dataset using scikit-learn in Python. We also learned how to visualize the results using matplotlib. MDS is a useful technique for visualizing high-dimensional data in a lower dimensional space while preserving the pairwise distances between the data points as much as possible.