Scikit-learn Cross-Validation

scikit-learnBeginner
Practice Now

Introduction

In machine learning, we often split our data into a training set and a testing set to evaluate a model's performance. However, this evaluation can be heavily dependent on which data points end up in the training set versus the testing set. A more robust method is cross-validation (CV).

Why cross-validation?

  • Reduces overfitting risk: Tests model on multiple data splits
  • Better generalization estimate: More reliable performance on unseen data
  • Maximizes data usage: Every sample used for both training and testing

Cross-validation involves splitting the dataset into multiple "folds" and then training and evaluating the model multiple times, using a different fold for testing each time. This gives us a more reliable estimate of the model's performance on unseen data.

In this lab, you will learn how to use scikit-learn's powerful and convenient functions to perform cross-validation on a classifier using the famous Iris dataset. You will learn to use cross_val_score to get performance scores and then calculate their mean and standard deviation to better understand the model's stability and overall accuracy.

Import cross_val_score from sklearn.model_selection

In this step, you will begin by importing the necessary function for performing cross-validation. The cross_val_score function is the primary tool in scikit-learn for this purpose. It simplifies the process of splitting data, training the model, and scoring it over multiple folds.

First, open the main.py file located in the ~/project directory using the file explorer on the left side of your IDE.

Now, add the import statement for cross_val_score to your main.py script. Place it with the other imports at the top of the file.

from sklearn.model_selection import cross_val_score

Your main.py file should now look like this:

import numpy as np
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

## Load the iris dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target

## Initialize a Support Vector Classifier (SVC)
## Parameters explained:
## - kernel='linear': Uses a linear kernel for linearly separable data like Iris
## - C=1: Regularization parameter (higher values = less regularization)
## - random_state=42: Ensures reproducible results across runs
clf = SVC(kernel='linear', C=1, random_state=42)

## --- Your code will go below this line ---

You can run the script to ensure there are no syntax errors. Open a terminal in your IDE and execute the following command:

python3 main.py

You should see no output, which is expected as we haven't added any code to produce output yet.

Initialize KFold with n_splits=5 from sklearn.model_selection

While cross_val_score can automatically handle splitting, it's good practice to understand the underlying mechanism. The most common cross-validation strategy is K-Fold, where the dataset is split into 'k' folds. The model is trained on k-1 folds and tested on the remaining fold, repeating this process k times.

KFold parameters:

  • n_splits=5: Divides data into 5 equal parts (folds)
  • shuffle=False (default): Maintains original data order
  • random_state: Controls randomization if shuffle=True

The KFold class in scikit-learn is a cross-validation iterator that provides train/test indices to split data. Although we will use a simpler shortcut in the next step, understanding KFold is fundamental.

Let's import KFold and see how to initialize it. Add the following lines to your main.py file.

First, add the import statement at the top:

from sklearn.model_selection import KFold

Then, you can initialize it. However, for this lab, we will rely on the cv parameter of cross_val_score which is a more direct approach. The purpose of this step is to introduce you to the KFold concept. For the sake of simplicity and to follow the lab's flow, we will not be adding the KFold initialization code to our script. We will directly use cv=5 in the next step, which internally uses a K-Fold strategy. This is the most common and straightforward way to perform cross-validation.

Let's proceed to the next step where we will use this concept in practice. Since no code was added in this step, you can click "Continue" to move on.

Perform cross-validation with cross_val_score(clf, X, y, cv=5)

Now it's time to perform the cross-validation. We will use the cross_val_score function we imported earlier. This function takes several arguments:

cross_val_score parameters:

  • estimator: The model to evaluate (our clf classifier)
  • X: The feature data matrix
  • y: The target labels array
  • cv=5: Cross-validation strategy (integer = k-fold, or CV splitter object)
  • scoring: Evaluation metric (default uses estimator's score method)
  • n_jobs: Number of CPU cores to use (default=1, -1 for all cores)

By setting cv=5, we are telling scikit-learn to perform a 5-fold cross-validation. It will automatically split the data into 5 folds, then train and test the model 5 times, returning an array containing the score for each run.

Add the following code to the end of your main.py file, below the comment line:

## Perform 5-fold cross-validation
scores = cross_val_score(clf, X, y, cv=5)

## Print the array of scores
print("Scores:", scores)

Your complete main.py file should now look like this:

import numpy as np
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

## Load the iris dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target

## Initialize a Support Vector Classifier (SVC)
## Parameters explained:
## - kernel='linear': Uses a linear kernel for linearly separable data like Iris
## - C=1: Regularization parameter (higher values = less regularization)
## - random_state=42: Ensures reproducible results across runs
clf = SVC(kernel='linear', C=1, random_state=42)

## --- Your code will go below this line ---

## Perform 5-fold cross-validation
scores = cross_val_score(clf, X, y, cv=5)

## Print the array of scores
print("Scores:", scores)

Now, run the script from your terminal:

python3 main.py

You will see the output showing an array of 5 scores, one for each fold of the cross-validation.

Scores: [0.96666667 1.         0.96666667 0.96666667 1.        ]
Mean score: 0.9800000000000001
Standard deviation: 0.016329931618554516

Your scores might be slightly different depending on the exact splitting, but they should be similar. This array gives you a detailed look at how the model performed on different subsets of the data.

Compute mean CV score with scores.mean()

Having an array of scores is informative, but for a quick summary of the model's performance, we usually calculate the mean of these scores. This single value gives us a general idea of the model's accuracy.

The cross_val_score function returns a NumPy array, which comes with many useful methods, including .mean(). We can call this method directly on our scores variable.

Add the following lines to the end of your main.py script to compute and print the mean score:

## Compute and print the mean of the scores
mean_score = scores.mean()
print("Mean score:", mean_score)

Your main.py file should now contain the following code:

import numpy as np
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

## Load the iris dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target

## Initialize a Support Vector Classifier (SVC)
## Parameters explained:
## - kernel='linear': Uses a linear kernel for linearly separable data like Iris
## - C=1: Regularization parameter (higher values = less regularization)
## - random_state=42: Ensures reproducible results across runs
clf = SVC(kernel='linear', C=1, random_state=42)

## --- Your code will go below this line ---

## Perform 5-fold cross-validation
scores = cross_val_score(clf, X, y, cv=5)

## Print the array of scores
print("Scores:", scores)

## Compute and print the mean of the scores
mean_score = scores.mean()
print("Mean score:", mean_score)

Execute the script again:

python3 main.py

The output will now include the mean of the 5 scores, giving you a single, representative performance metric.

Scores: [0.96666667 1.         0.96666667 0.96666667 1.        ]
Mean score: 0.9800000000000001

Compute standard deviation of CV scores with scores.std()

The mean score tells us the average performance, but it doesn't tell us how consistent that performance is. The standard deviation of the scores gives us a measure of this variance.

Interpreting standard deviation:

  • Low std (< 0.05): Model performs consistently across all data subsets
  • Medium std (0.05-0.15): Moderate variation, acceptable for most cases
  • High std (> 0.15): Large performance variation, may indicate data issues or model instability

A low standard deviation indicates that the model's performance is stable across different subsets of the data, while a high standard deviation suggests that performance is more variable.

Just like .mean(), NumPy arrays also have a .std() method to calculate the standard deviation.

Add the final piece of code to your main.py script to compute and print the standard deviation:

## Compute and print the standard deviation of the scores
std_dev = scores.std()
print("Standard deviation:", std_dev)

Your final main.py script is now complete and should look like this:

import numpy as np
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

## Load the iris dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target

## Initialize a Support Vector Classifier (SVC)
## Parameters explained:
## - kernel='linear': Uses a linear kernel for linearly separable data like Iris
## - C=1: Regularization parameter (higher values = less regularization)
## - random_state=42: Ensures reproducible results across runs
clf = SVC(kernel='linear', C=1, random_state=42)

## --- Your code will go below this line ---

## Perform 5-fold cross-validation
scores = cross_val_score(clf, X, y, cv=5)

## Print the array of scores
print("Scores:", scores)

## Compute and print the mean of the scores
mean_score = scores.mean()
print("Mean score:", mean_score)

## Compute and print the standard deviation of the scores
std_dev = scores.std()
print("Standard deviation:", std_dev)

Run the script one last time:

python3 main.py

The final output will show the array of scores, their mean, and their standard deviation, giving you a comprehensive evaluation of your model's performance.

Scores: [0.96666667 1.         0.96666667 0.96666667 1.        ]
Mean score: 0.9800000000000001
Standard deviation: 0.016329931618554516

Summary

Congratulations on completing this lab! You have successfully learned how to perform and interpret a k-fold cross-validation using scikit-learn.

In this lab, you have:

  • Understood the importance of cross-validation for robust model evaluation.
  • Used the cross_val_score function to easily perform 5-fold cross-validation on a Support Vector Classifier.
  • Analyzed the results by calculating and printing the mean and standard deviation of the cross-validation scores.

Practical tips for cross-validation:

  • Use 5-fold or 10-fold CV for most scenarios
  • Consider stratified CV for imbalanced datasets
  • Use cross_validate instead of cross_val_score for multiple metrics
  • Always set random_state for reproducible results

This technique is a fundamental part of the machine learning workflow, ensuring that your model's performance is reliable and not just a result of a lucky train-test split. You can now apply this knowledge to evaluate your own machine learning models with greater confidence.