Divide Dataset Into Mini-Batches

PythonPythonBeginner
Practice Now

Introduction

In this project, you will learn how to implement a function to divide a dataset into mini-batches, which is a common technique used in deep learning training.

๐ŸŽฏ Tasks

In this project, you will learn:

  • How to implement the data_pipeline function to divide a dataset into mini-batches
  • How to test the data_pipeline function to ensure it works as expected

๐Ÿ† Achievements

After completing this project, you will be able to:

  • Divide a dataset into mini-batches using the data_pipeline function
  • Verify the functionality of the data_pipeline function through testing

Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL python(("`Python`")) -.-> python/ControlFlowGroup(["`Control Flow`"]) python(("`Python`")) -.-> python/DataStructuresGroup(["`Data Structures`"]) python(("`Python`")) -.-> python/FunctionsGroup(["`Functions`"]) python(("`Python`")) -.-> python/FileHandlingGroup(["`File Handling`"]) python(("`Python`")) -.-> python/AdvancedTopicsGroup(["`Advanced Topics`"]) python/ControlFlowGroup -.-> python/for_loops("`For Loops`") python/DataStructuresGroup -.-> python/lists("`Lists`") python/FunctionsGroup -.-> python/function_definition("`Function Definition`") python/FileHandlingGroup -.-> python/file_operations("`File Operations`") python/AdvancedTopicsGroup -.-> python/generators("`Generators`") subgraph Lab Skills python/for_loops -.-> lab-300212{{"`Divide Dataset Into Mini-Batches`"}} python/lists -.-> lab-300212{{"`Divide Dataset Into Mini-Batches`"}} python/function_definition -.-> lab-300212{{"`Divide Dataset Into Mini-Batches`"}} python/file_operations -.-> lab-300212{{"`Divide Dataset Into Mini-Batches`"}} python/generators -.-> lab-300212{{"`Divide Dataset Into Mini-Batches`"}} end

Implement Mini-Batches

In this step, you will learn how to implement the data_pipeline function to divide a dataset into mini-batches.

Open the data_pipeline.py file in your text editor.

Implement the data_pipeline function according to the requirements:

  • The function should take two parameters: data (a list of lists containing integers) and batch_size (an integer representing the size of each mini-batch).
  • The function should return a generator that yields batches of the input data, where each batch contains batch_size lists of integers.
  • If the remaining amount of data is less than batch_size, the function should output all the remaining samples.

Here's the completed data_pipeline function:

from typing import Generator, List

def data_pipeline(data: List[List[int]], batch_size: int) -> Generator[List[List[int]], None, None]:
    """
    This function takes a list of lists containing integers and divides it into smaller 'batches' of a specified size.
    It returns a generator that yields these batches sequentially.

    Parameters:
    data (List[List[int]]): The input dataset, a list of lists containing integers.
    batch_size (int): The size of each batch, i.e., the number of lists of integers to include in each batch.

    Returns:
    Generator[List[List[int]], None, None]: A generator yielding batches of the input data with each batch containing 'batch_size' lists of integers.
    """
    for i in range(0, len(data), batch_size):
        batch_data = data[i : i + batch_size]
        yield batch_data

Save the data_pipeline.py file.

โœจ Check Solution and Practice

Test the Mini Batches

In this step, you will test the data_pipeline function to ensure it works as expected.

Open the data_pipeline.py file in your text editor.

Add the following code at the end of the file to test the data_pipeline function:

if __name__ == "__main__":
    data = [[1, 2], [1, 3], [3, 5], [2, 1], [3, 3]]
    batch_size = 2
    batch_data = data_pipeline(data, batch_size)
    for batch in batch_data:
        print(f"{batch=}")

Save the data_pipeline.py file.

Run the data_pipeline.py file in your terminal:

python data_pipeline.py

The output should be:

batch=[[1, 2], [1, 3]]
batch=[[3, 5], [2, 1]]
batch=[[3, 3]]

This output confirms that the data_pipeline function is working as expected, dividing the input dataset into mini-batches of size 2.

โœจ Check Solution and Practice

Summary

Congratulations! You have completed this project. You can practice more labs in LabEx to improve your skills.

Other Python Tutorials you may like