Scikit-learn Data Preprocessing Tutorial

Introduction

Welcome to the scikit-learn Data Preprocessing lab. In machine learning, the quality of your data directly impacts the performance of your model. Raw data is often messy, inconsistent, and not in the optimal format for algorithms. Data preprocessing is a crucial step that involves cleaning and transforming data to make it suitable for a machine learning model.

In this lab, you will learn how to perform two fundamental preprocessing tasks using the scikit-learn library:

Feature Scaling: Standardizing the range of independent variables or features of data.
Label Encoding: Converting categorical labels into a numerical format.

We will use the famous Iris dataset, which is conveniently included with scikit-learn, to practice these techniques. By the end of this lab, you will have a solid understanding of how to prepare your data for machine learning pipelines.

Split data into features and target with X = iris.data, y = iris.target

In this step, we will begin by loading the Iris dataset and separating it into features and the target variable. In machine learning, X is the conventional notation for the features (the input variables), and y is the notation for the target (the output variable you want to predict).

The scikit-learn library provides the Iris dataset through its datasets module. The loaded dataset object behaves like a dictionary.

Iris dataset structure:

iris.data: Feature matrix (150 samples × 4 features)
iris.target: Target labels (150 samples)
iris.feature_names: Names of the 4 features
iris.target_names: Names of the 3 flower species

Why separate X and y?

X: Input features (what the model learns from)
y: Target labels (what the model predicts)
This is the standard convention in machine learning

First, open the preprocess.py file located in the ~/project directory using the file explorer on the left. We will add our code to this file.

Add the following lines under the ## --- Step 1: Split data --- comment to assign the features and target to X and y respectively. We will also print their shapes to verify.

## --- Step 1: Split data ---
X = iris.data
y = iris.target

print("Shape of features (X):", X.shape)
print("Shape of target (y):", y.shape)

Now, save the file and run it from the terminal to see the output.

python3 preprocess.py

You should see the following output, which indicates that we have 150 samples and 4 features, along with 150 corresponding target labels.

Shape of features (X): (150, 4)
Shape of target (y): (150,)

Scale features using StandardScaler from sklearn.preprocessing

In this step, we will prepare to scale our features. Feature scaling is a common requirement for many machine learning algorithms because they can be sensitive to the scale of input features. StandardScaler is a popular technique that standardizes features by removing the mean and scaling them to unit variance.

How StandardScaler works:

Formula: z = (x - u) / s, where u is the mean of the training samples, and s is the standard deviation
Effect: Transforms data to have a mean of 0 and standard deviation of 1
Benefits: Prevents features with larger scales from dominating the learning process

Key parameters of StandardScaler:

with_mean=True (default): Centers the data by removing the mean
with_std=True (default): Scales the data by dividing by the standard deviation

We will use the StandardScaler from sklearn.preprocessing. The first part of the process is to create an instance of the scaler.

In your preprocess.py file, add the following code under the ## --- Step 2: Initialize the scaler --- comment to create an instance of StandardScaler.

## --- Step 2: Initialize the scaler ---
scaler = StandardScaler()

print("Scaler object created:", scaler)

Save the file and run it again.

python3 preprocess.py

The output will now include a line confirming that the StandardScaler object has been successfully created.

Shape of features (X): (150, 4)
Shape of target (y): (150,)
Scaler object created: StandardScaler()

Fit scaler with scaler.fit(X)

In this step, we will fit the StandardScaler to our feature data X. The fit() method is a fundamental concept in scikit-learn.

What fit() does:

Calculates necessary statistics (mean and standard deviation) from the training data
Stores these parameters internally for later use
Important: Only learns from data, does not transform it

Why separate fit() and transform()?

Fit on training data only: Prevents data leakage by learning parameters only from training set
Apply to any data: Can transform both training and test data using the same learned parameters
Consistency: Ensures the same transformation is applied to all data

Real-world best practice:

scaler.fit(X_train) - Learn parameters from training data only
X_train_scaled = scaler.transform(X_train) - Transform training data
X_test_scaled = scaler.transform(X_test) - Transform test data with same parameters

Add the following code to your preprocess.py file under the ## --- Step 3: Fit the scaler --- comment. We will also print the mean_ attribute of the scaler to see what it has learned.

## --- Step 3: Fit the scaler ---
scaler.fit(X)

print("Scaler mean:", scaler.mean_)

Save the file and execute it.

python3 preprocess.py

The output will now show the mean for each of the four features, which the scaler has computed from the data.

Shape of features (X): (150, 4)
Shape of target (y): (150,)
Scaler object created: StandardScaler()
Scaler mean: [5.84333333 3.05733333 3.758      1.19933333]

Transform data with scaler.transform(X)

In this step, we will use the fitted scaler to transform our data. The transform() method applies the scaling transformation to the data, using the mean and standard deviation calculated during the fit() step. This will center our data around a mean of 0 with a standard deviation of 1.

We will store the transformed data in a new variable, X_scaled, to keep the original data intact.

Understanding the code:

X_scaled = scaler.transform(X): Applies the learned transformation to our data
np.set_printoptions(precision=2, suppress=True): Formats output for better readability
- precision=2: Shows 2 decimal places
- suppress=True: Uses scientific notation for very small/large numbers
np.mean(X, axis=0): Calculates mean along axis 0 (columns)
- axis=0: Computes mean for each feature (column) across all samples
- Result: One mean value per feature

Add the following code to your preprocess.py file under the ## --- Step 4: Transform the data --- comment. We will print the mean of the original and scaled data to observe the effect of the transformation.

## --- Step 4: Transform the data ---
X_scaled = scaler.transform(X)

## Use numpy to set precision for cleaner output
np.set_printoptions(precision=2, suppress=True)
print("Original data mean:", np.mean(X, axis=0))
print("Scaled data mean:", np.mean(X_scaled, axis=0))
print("Scaled data sample:\n", X_scaled[:5])

Save the file and run it.

python3 preprocess.py

You will see that the mean of the scaled data is effectively zero, and the sample data values have been transformed.

Shape of features (X): (150, 4)
Shape of target (y): (150,)
Scaler object created: StandardScaler()
Scaler mean: [5.84333333 3.05733333 3.758      1.19933333]
Original data mean: [5.84 3.06 3.76 1.2 ]
Scaled data mean: [-0. -0. -0. -0.]
Scaled data sample:
 [[-0.9   1.02 -1.34 -1.32]
 [-1.14 -0.13 -1.34 -1.32]
 [-1.39  0.33 -1.4  -1.32]
 [-1.51  0.1  -1.28 -1.32]
 [-1.02  1.25 -1.34 -1.32]]

Encode categorical target using LabelEncoder from sklearn.preprocessing

In this step, we will preprocess our target variable y. The Iris dataset's target is categorical, represented by the numbers 0, 1, and 2, which correspond to the three different species of Iris flowers. While they are already numeric, it's good practice to understand how to encode categorical labels, especially if they were in string format (e.g., 'setosa', 'versicolor').

LabelEncoder explained:

Purpose: Converts categorical labels (strings or mixed types) into integers
How it works: Assigns a unique integer to each unique category
Example: ['cat', 'dog', 'cat'] → [0, 1, 0]

Why use LabelEncoder?

Many ML algorithms require numeric inputs
Efficient storage and computation
Maintains the categorical nature of the data

Key methods:

fit(y): Learns the mapping from categories to integers
transform(y): Applies the learned mapping
fit_transform(y): Combines both steps in one call
inverse_transform(y_encoded): Converts integers back to original categories

Important notes:

Order is arbitrary (based on first appearance or sorting)
Not suitable for ordinal data where order matters (use OrdinalEncoder instead)
For features (not targets), consider OneHotEncoder for nominal data

Add the following code to your preprocess.py file under the ## --- Step 5: Encode the target --- comment. We will create an instance of LabelEncoder and use the fit_transform() method, which combines fitting and transforming into a single step.

## --- Step 5: Encode the target ---
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)

print("\nOriginal target sample:", y[:5])  ## Show first 5 original labels
print("Encoded target sample:", y_encoded[:5])  ## Show first 5 encoded labels
print("Unique encoded values:", np.unique(y_encoded))  ## Show all unique encoded values

Save the file and run it for the final time.

python3 preprocess.py

The output will show that the target variable has been encoded. Since it was already in the correct integer format, the result is the same, but this demonstrates the process you would use for string-based labels.

Shape of features (X): (150, 4)
Shape of target (y): (150,)
Scaler object created: StandardScaler()
Scaler mean: [5.84333333 3.05733333 3.758      1.19933333]
Original data mean: [5.84 3.06 3.76 1.2 ]
Scaled data mean: [-0. -0. -0. -0.]
Scaled data sample:
 [[-0.9   1.02 -1.34 -1.32]
 [-1.14 -0.13 -1.34 -1.32]
 [-1.39  0.33 -1.4  -1.32]
 [-1.51  0.1  -1.28 -1.32]
 [-1.02  1.25 -1.34 -1.32]]

Original target sample: [0 0 0 0 0]
Encoded target sample: [0 0 0 0 0]
Unique encoded values: [0 1 2]

Summary

Congratulations on completing the lab! You have successfully performed essential data preprocessing tasks using scikit-learn.

In this lab, you learned how to:

Load a standard dataset from scikit-learn.
Separate the data into features (X) and a target (y).
Scale numerical features using StandardScaler by first fitting it to the data to learn the parameters and then transforming the data.
Encode categorical target labels into a machine-readable integer format using LabelEncoder.

These preprocessing steps are fundamental to building robust and high-performing machine learning models. You are now better equipped to prepare your own datasets for future machine learning projects.