Implement Unbalanced Data Pipeline | Machine Learning

Introduction

In this project, you will learn how to implement an unbalanced data pipeline that can process imbalanced datasets and generate batches with approximately balanced class distributions. This is a common task in machine learning, where the dataset may have significantly more samples from one class compared to others, which can lead to biased model training and poor performance.

🎯 Tasks

In this project, you will learn:

How to implement the functionality of upsampling and downsampling to balance the sample distribution within a batch.
How to output a batch of samples with a sample count equal to the batch size, where the distribution of the labels within the batch is as equal as possible.
How to test the unbalanced data pipeline to ensure it is working as expected.

🏆 Achievements

After completing this project, you will be able to:

Handle imbalanced datasets in machine learning.
Apply techniques for upsampling and downsampling to balance the class distributions.
Implement a data pipeline that can generate balanced batches from an imbalanced dataset.

Implement Upsampling and Downsampling

In this step, you will learn how to implement the functionality of upsampling and downsampling to balance the sample distribution within a batch.

Open the unbalanced_data_pipeline.py file located in the /home/labex/project directory.

In the unbalanced_data_pipeline function, start by creating a defaultdict called counter to store the feature vectors and their corresponding one-hot label vectors.

    counter = defaultdict(list)
    for x, y in data:
        counter[tuple(y)].append(x)

This will group the data by their label vectors, making it easier to perform upsampling and downsampling.

Next, calculate the number of samples to be included in each batch for each label. This can be done by dividing the batch size by the number of unique labels, and then storing the remainder in the num_left variable.

    batch_data = []
    pre_num = batch_size // len(counter.keys())
    num_left = batch_size % len(counter.keys())

Now, iterate through the counter dictionary and randomly sample the required number of samples for each label. Add these samples to the batch_data list.

    for y, x in counter.items():
        samples = random.sample(x, pre_num)
        batch_data.extend([[sample, list(y)] for sample in samples])

Finally, handle the remaining samples by randomly selecting a label and a sample from the corresponding list, and adding it to the batch_data list.

    for _ in range(num_left):
        y = random.choice(list(counter.keys()))
        x = random.choice(counter[y])
        batch_data.append([x, list(y)])

Return the batch_data list.

    return batch_data

In this step, you have implemented the functionality of upsampling and downsampling to balance the sample distribution within a batch. The unbalanced_data_pipeline function now takes the input data and the batch size, and returns a list of batches with approximately balanced class distributions.

✨ Check Solution and Practice

Test the Unbalanced Data Pipeline

In this step, you will test the unbalanced_data_pipeline function to ensure it is working as expected.

Add the following code in the unbalanced_data_pipeline.py file.

if __name__ == "__main__":
    data = [
        [[1, 2, 5], [1, 0]],
        [[1, 6, 0], [1, 0]],
        [[4, 1, 8], [1, 0]],
        [[7, 0, 4], [0, 1]],
        [[5, 9, 4], [0, 1]],
        [[2, 0, 1], [0, 1]],
        [[1, 9, 3], [0, 1]],
        [[5, 5, 5], [0, 1]],
        [[8, 4, 0], [0, 1]],
        [[9, 6, 3], [0, 1]],
        [[7, 7, 0], [0, 1]],
        [[0, 3, 4], [0, 1]],
     ]
    for epoch in range(10):
        batch_data = unbalanced_data_pipeline(data, 6)
        batch_data = list(batch_data)
        print(f"{epoch=}, {batch_data=}")

In the if __name__ == "__main__": block, we call the unbalanced_data_pipeline function with the sample data and a batch size of 6.

Run the unbalanced_data_pipeline.py file to see the output.

python unbalanced_data_pipeline.py

The output should look similar to the example provided in the original challenge:

epoch=0, batch_data=[[[1, 2, 5], [1, 0]], [[4, 1, 8], [1, 0]], [[1, 6, 0], [1, 0]], [[2, 0, 1], [0, 1]], [[7, 0, 4], [0, 1]], [[5, 9, 4], [0, 1]]]
epoch=1, batch_data=[[[4, 1, 8], [1, 0]], [[1, 2, 5], [1, 0]], [[1, 6, 0], [1, 0]], [[2, 0, 1], [0, 1]], [[9, 6, 3], [0, 1]], [[1, 9, 3], [0, 1]]]
epoch=2, batch_data=[[[4, 1, 8], [1, 0]], [[1, 2, 5], [1, 0]], [[1, 6, 0], [1, 0]], [[5, 5, 5], [0, 1]], [[7, 0, 4], [0, 1]], [[8, 4, 0], [0, 1]]]
epoch=3, batch_data=[[[1, 2, 5], [1, 0]], [[1, 6, 0], [1, 0]], [[4, 1, 8], [1, 0]], [[7, 7, 0], [0, 1]], [[8, 4, 0], [0, 1]], [[0, 3, 4], [0, 1]]]
epoch=4, batch_data=[[[4, 1, 8], [1, 0]], [[1, 6, 0], [1, 0]], [[1, 2, 5], [1, 0]], [[5, 5, 5], [0, 1]], [[0, 3, 4], [0, 1]], [[8, 4, 0], [0, 1]]]
epoch=5, batch_data=[[[1, 6, 0], [1, 0]], [[4, 1, 8], [1, 0]], [[1, 2, 5], [1, 0]], [[2, 0, 1], [0, 1]], [[7, 0, 4], [0, 1]], [[7, 7, 0], [0, 1]]]
epoch=6, batch_data=[[[1, 2, 5], [1, 0]], [[1, 6, 0], [1, 0]], [[4, 1, 8], [1, 0]], [[8, 4, 0], [0, 1]], [[5, 9, 4], [0, 1]], [[0, 3, 4], [0, 1]]]
epoch=7, batch_data=[[[1, 2, 5], [1, 0]], [[1, 6, 0], [1, 0]], [[4, 1, 8], [1, 0]], [[2, 0, 1], [0, 1]], [[0, 3, 4], [0, 1]], [[1, 9, 3], [0, 1]]]
epoch=8, batch_data=[[[1, 6, 0], [1, 0]], [[4, 1, 8], [1, 0]], [[1, 2, 5], [1, 0]], [[7, 7, 0], [0, 1]], [[2, 0, 1], [0, 1]], [[0, 3, 4], [0, 1]]]
epoch=9, batch_data=[[[1, 2, 5], [1, 0]], [[4, 1, 8], [1, 0]], [[1, 6, 0], [1, 0]], [[7, 0, 4], [0, 1]], [[0, 3, 4], [0, 1]], [[5, 5, 5], [0, 1]]]

In this step, you have tested the unbalanced_data_pipeline function to ensure it is working as expected. The function should now be able to process the unbalanced data and return batches of data with approximately balanced class distributions.

✨ Check Solution and Practice

Summary

Congratulations! You have completed this project. You can practice more labs in LabEx to improve your skills.

Balanced Batch Generation for Imbalanced Datasets