Scikit-learn Model Evaluation: Accuracy, Precision, Recall, F1 Score

Introduction

After training a machine learning model, it's crucial to evaluate its performance to understand how well it generalizes to new, unseen data. Scikit-learn, a powerful Python library for machine learning, provides a comprehensive set of tools for model evaluation in its sklearn.metrics module.

In this lab, you will learn how to evaluate a classification model using some of the most common metrics. We will use a predefined set of true labels and predicted labels to focus solely on the evaluation process. You will learn to compute:

Accuracy Score
Confusion Matrix
Precision Score
Recall Score
F1 Score

By the end of this lab, you will be proficient in using these fundamental scikit-learn functions to assess the performance of your classification models.

Compute accuracy score using accuracy_score from sklearn.metrics

In this step, we will calculate the accuracy of our model's predictions. Accuracy is one of the most straightforward classification metrics. It measures the ratio of correctly predicted instances to the total number of instances.

The accuracy_score function from sklearn.metrics computes this value. It takes the true labels and the predicted labels as arguments.

First, open the evaluate.py file from the file explorer on the left. The file already contains the y_true and y_pred lists. Now, add the following code to the end of the file to import the accuracy_score function, calculate the accuracy, and print the result.

from sklearn.metrics import accuracy_score

## Calculate accuracy
accuracy = accuracy_score(y_true, y_pred)

print(f"Accuracy: {accuracy}")

Your complete evaluate.py file should now look like this:

## In this lab, we will use a predefined set of true labels and predicted labels
## to understand different evaluation metrics.

## y_true represents the actual, ground truth labels for our data points.
## For a binary classification, 0 could mean 'negative' and 1 could mean 'positive'.
y_true = [0, 1, 1, 0, 1, 0, 0, 1, 0, 1]

## y_pred represents the labels predicted by our hypothetical classification model.
y_pred = [0, 1, 0, 0, 1, 1, 0, 1, 0, 1]

print("Setup complete. True and predicted labels are defined in evaluate.py.")
print(f"True labels:    {y_true}")
print(f"Predicted labels: {y_pred}")

from sklearn.metrics import accuracy_score

## Calculate accuracy
accuracy = accuracy_score(y_true, y_pred)

print(f"Accuracy: {accuracy}")

Now, let's run the script. Open the terminal in your IDE and execute the following command:

python3 evaluate.py

You should see the following output, which includes the accuracy score. An accuracy of 0.8 means that 80% of the predictions were correct.

Setup complete. True and predicted labels are defined in evaluate.py.
True labels:    [0, 1, 1, 0, 1, 0, 0, 1, 0, 1]
Predicted labels: [0, 1, 0, 0, 1, 1, 0, 1, 0, 1]
Accuracy: 0.8

Generate confusion matrix with confusion_matrix from sklearn.metrics

In this step, we will generate a confusion matrix. While accuracy gives a quick summary of performance, it can be misleading, especially for imbalanced datasets. A confusion matrix provides a more detailed breakdown of a classifier's performance by showing the number of correct and incorrect predictions for each class.

The matrix is a table with four combinations of predicted and actual values:

True Negatives (TN): The model correctly predicted the negative class.
False Positives (FP): The model incorrectly predicted the positive class.
False Negatives (FN): The model incorrectly predicted the negative class.
True Positives (TP): The model correctly predicted the positive class.

We will use the confusion_matrix function from sklearn.metrics. Add the following code to the end of your evaluate.py file.

from sklearn.metrics import confusion_matrix

## Generate confusion matrix
cm = confusion_matrix(y_true, y_pred)

print("Confusion Matrix:")
print(cm)

Now, run the script again from the terminal:

python3 evaluate.py

The output will now include the confusion matrix.

Setup complete. True and predicted labels are defined in evaluate.py.
True labels:    [0, 1, 1, 0, 1, 0, 0, 1, 0, 1]
Predicted labels: [0, 1, 0, 0, 1, 1, 0, 1, 0, 1]
Accuracy: 0.8
Confusion Matrix:
[[4 1]
 [1 4]]

This matrix tells us:

TN = 4 (Top-left)
FP = 1 (Top-right)
FN = 1 (Bottom-left)
TP = 4 (Bottom-right)

Calculate precision score using precision_score from sklearn.metrics

In this step, we will calculate the precision score. Precision answers the question: "Of all the instances that the model predicted as positive, what proportion was actually positive?" It is a measure of a classifier's exactness.

Precision is calculated as: Precision = True Positives / (True Positives + False Positives)

A low precision indicates a high number of false positives. We will use the precision_score function from sklearn.metrics.

Add the following code to the end of your evaluate.py file to calculate and print the precision.

from sklearn.metrics import precision_score

## Calculate precision
precision = precision_score(y_true, y_pred)

print(f"Precision: {precision}")

Run the script from the terminal:

python3 evaluate.py

You will see the precision score added to the output. Based on our confusion matrix (TP=4, FP=1), the precision is 4 / (4 + 1) = 0.8.

Setup complete. True and predicted labels are defined in evaluate.py.
True labels:    [0, 1, 1, 0, 1, 0, 0, 1, 0, 1]
Predicted labels: [0, 1, 0, 0, 1, 1, 0, 1, 0, 1]
Accuracy: 0.8
Confusion Matrix:
[[4 1]
 [1 4]]
Precision: 0.8

Calculate recall score using recall_score from sklearn.metrics

In this step, we will compute the recall score. Recall, also known as sensitivity or true positive rate, answers the question: "Of all the actual positive instances, what proportion did the model correctly identify?" It is a measure of a classifier's completeness.

Recall is calculated as: Recall = True Positives / (True Positives + False Negatives)

A low recall indicates a high number of false negatives. We will use the recall_score function from sklearn.metrics.

Add the following code to the end of your evaluate.py file.

from sklearn.metrics import recall_score

## Calculate recall
recall = recall_score(y_true, y_pred)

print(f"Recall: {recall}")

Run the script from the terminal:

python3 evaluate.py

The output will now include the recall score. Based on our confusion matrix (TP=4, FN=1), the recall is 4 / (4 + 1) = 0.8.

Setup complete. True and predicted labels are defined in evaluate.py.
True labels:    [0, 1, 1, 0, 1, 0, 0, 1, 0, 1]
Predicted labels: [0, 1, 0, 0, 1, 1, 0, 1, 0, 1]
Accuracy: 0.8
Confusion Matrix:
[[4 1]
 [1 4]]
Precision: 0.8
Recall: 0.8

Compute F1 score using f1_score from sklearn.metrics

In this final step, we will calculate the F1 score. The F1 score is the harmonic mean of precision and recall. It seeks to find a balance between the two. While precision focuses on minimizing false positives and recall on minimizing false negatives, the F1 score provides a single metric that considers both.

The F1 score is calculated as: F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

It is particularly useful when you need a balance between precision and recall and when there is an uneven class distribution. We will use the f1_score function from sklearn.metrics.

Add the final piece of code to your evaluate.py file.

from sklearn.metrics import f1_score

## Calculate F1 score
f1 = f1_score(y_true, y_pred)

print(f"F1 Score: {f1}")

Run the script one last time from the terminal:

python3 evaluate.py

The final output will display all the metrics we have calculated. With a precision and recall of 0.8, the F1 score will also be 0.8.

Setup complete. True and predicted labels are defined in evaluate.py.
True labels:    [0, 1, 1, 0, 1, 0, 0, 1, 0, 1]
Predicted labels: [0, 1, 0, 0, 1, 1, 0, 1, 0, 1]
Accuracy: 0.8
Confusion Matrix:
[[4 1]
 [1 4]]
Precision: 0.8
Recall: 0.8
F1 Score: 0.8

Summary

Congratulations on completing the lab! You have successfully learned how to evaluate a classification model using several key metrics from the scikit-learn library.

In this lab, you practiced:

Calculating accuracy with accuracy_score to get a general sense of model performance.
Generating a confusion matrix with confusion_matrix to get a detailed view of true/false positives and negatives.
Computing precision with precision_score to measure the model's exactness.
Computing recall with recall_score to measure the model's completeness.
Calculating the F1 score with f1_score to find a balance between precision and recall.

These metrics are fundamental tools for any data scientist or machine learning engineer. Understanding them allows you to better diagnose your model's strengths and weaknesses and choose the right model for your specific problem.