Make a Custom Container

Beginner

This tutorial is from open-source community. Access the source code

Introduction

In this lab, you will learn about Python containers and memory management. You'll explore how Python handles memory for built - in data structures and discover how to create a memory - efficient custom container class.

The objectives of this lab are to examine the memory allocation behavior of Python lists and dictionaries, create a custom container class to optimize memory usage, and understand the benefits of column - oriented data storage.

This is a Guided Lab, which provides step-by-step instructions to help you learn and practice. Follow the instructions carefully to complete each step and gain hands-on experience. Historical data shows that this is a beginner level lab with a 93% completion rate. It has received a 97% positive review rate from learners.

Understanding List Memory Allocation

In Python, lists are a very useful data structure, especially when you need to add elements to them. Python lists are designed to be efficient for appending operations. Instead of allocating exactly the amount of memory needed, Python allocates extra memory in anticipation of future additions. This strategy minimizes the number of memory reallocations needed when the list grows.

Let's understand this concept better by using the sys.getsizeof() function. This function returns the size of an object in bytes, which helps us see how much memory a list is using at different stages.

  1. First, you need to open a Python interactive shell in your terminal. This is like a playground where you can run Python code immediately. To open it, type the following command in your terminal and press Enter:
python3
  1. Once you're in the Python interactive shell, you need to import the sys module. Modules in Python are like toolboxes that contain useful functions. The sys module has the getsizeof() function we need. After importing the module, create an empty list named items. Here's the code to do that:
import sys
items = []
  1. Now, let's check the initial size of the empty list. We'll use the sys.getsizeof() function with the items list as its argument. Type the following code in the Python interactive shell and press Enter:
sys.getsizeof(items)

You should see a value like 64 bytes. This value represents the overhead for an empty list. The overhead is the basic amount of memory that Python uses to manage the list, even when it has no elements.

  1. Next, we'll start adding items to the list one by one and observe how the memory allocation changes. We'll use the append() method to add an element to the list and then check the size again. Here's the code:
items.append(1)
sys.getsizeof(items)

You should see a larger value, around 96 bytes. This increase in size shows that Python has allocated more memory to accommodate the new element.

  1. Let's continue adding more items to the list and check the size after each addition. Here's the code to do that:
items.append(2)
sys.getsizeof(items)  ## Size remains the same

items.append(3)
sys.getsizeof(items)  ## Size still unchanged

items.append(4)
sys.getsizeof(items)  ## Size still unchanged

items.append(5)
sys.getsizeof(items)  ## Size jumps to a larger value

You'll notice that the size doesn't increase with every append operation. Instead, it jumps up periodically. This demonstrates that Python is allocating memory in chunks rather than individually for each new item.

This behavior is by design. Python allocates more memory than needed initially to avoid frequent reallocations as the list grows. Each time the list exceeds its current capacity, Python allocates a larger chunk of memory.

Remember that a list stores references to objects, not the objects themselves. On a 64-bit system, each reference typically requires 8 bytes of memory. This is important to understand because it affects how much memory a list actually uses when it contains different types of objects.

Dictionary Memory Allocation

In Python, just like lists, dictionaries are a fundamental data structure. One important aspect to understand about them is how they allocate memory. Memory allocation refers to how Python sets aside space in the computer's memory to store the data in your dictionary. Similar to lists, Python dictionaries also allocate memory in chunks. Let's explore how memory allocation works with dictionaries.

  1. First, we need to create a dictionary to work with. In the same Python shell (or open a new one if you closed it), we'll create a dictionary representing a data record. A dictionary in Python is a collection of key - value pairs, where each key is unique and is used to access its corresponding value.
import sys  ## Import sys if you're starting a new session
row = {'route': '22', 'date': '01/01/2001', 'daytype': 'U', 'rides': 7354}

Here, we've imported the sys module which provides access to some variables used or maintained by the Python interpreter and to functions that interact strongly with the interpreter. We then created a dictionary named row with four key - value pairs.

  1. Now that we have our dictionary, we want to check its initial size. The size of a dictionary refers to the amount of memory it occupies in the computer.
sys.getsizeof(row)

The sys.getsizeof() function returns the size of an object in bytes. When you run this code, you should see a value around 240 bytes. This gives you an idea of how much memory the dictionary takes up initially.

  1. Next, we'll add new key - value pairs to the dictionary and observe how the memory allocation changes. Adding items to a dictionary is a common operation, and understanding how it affects memory is crucial.
row['a'] = 1
sys.getsizeof(row)  ## Size might remain the same

row['b'] = 2
sys.getsizeof(row)  ## Size may increase

When you add the first key - value pair ('a': 1), the size of the dictionary might remain the same. This is because Python has already allocated a certain chunk of memory, and there might be enough space in that chunk to accommodate the new item. However, when you add the second key - value pair ('b': 2), the size may increase. You'll notice that after adding a certain number of items, the dictionary's size suddenly increases. This is because dictionaries, like lists, allocate memory in chunks to optimize performance. Allocating memory in chunks reduces the number of times Python has to request more memory from the system, which speeds up the process of adding new items.

  1. Let's try removing an item from the dictionary to see if the memory usage decreases. Removing items from a dictionary is also a common operation, and it's interesting to see how it affects memory.
del row['b']
sys.getsizeof(row)

Interestingly, removing an item doesn't typically reduce the memory allocation. This is because Python keeps the allocated memory to avoid reallocating if items are added again. Reallocating memory is a relatively expensive operation in terms of performance, so Python tries to avoid it as much as possible.

Memory Efficiency Considerations:

When working with large datasets where you need to create many records, using dictionaries for each record might not be the most memory - efficient approach. Dictionaries are very flexible and easy to use, but they can consume a significant amount of memory, especially when dealing with a large number of records. Here are some alternatives that consume less memory:

  • Tuples: Simple immutable sequences. A tuple is a collection of values that cannot be changed after it is created. It uses less memory than a dictionary because it doesn't need to store keys and manage the associated key - value mapping.
  • Named tuples: Tuples with field names. Named tuples are similar to regular tuples, but they allow you to access the values by name, which can make the code more readable. They also use less memory than dictionaries.
  • Classes with __slots__: Classes that explicitly define attributes to avoid using a dictionary for instance variables. When you use __slots__ in a class, Python doesn't create a dictionary to store the instance variables, which reduces the memory usage.

These alternatives can significantly reduce memory usage when handling many records.

Optimizing Memory with Column-Oriented Data

In traditional data storage, we often store each record as a separate dictionary, which is called a row-oriented approach. However, this method can consume a significant amount of memory. An alternative way is to store data in columns. In the column-oriented approach, we create separate lists for each attribute, and each list holds all the values for that specific attribute. This can help us save memory.

  1. First, you need to create a new Python file in your project directory. This file will contain the code for reading data in a column-oriented way. Name the file readrides.py. You can use the following commands in the terminal to achieve this:
cd ~/project
touch readrides.py

The cd ~/project command changes the current directory to your project directory, and the touch readrides.py command creates a new empty file named readrides.py.

  1. Next, open the readrides.py file in the WebIDE editor. Then, add the following Python code to the file. This code defines a function read_rides_as_columns that reads bus ride data from a CSV file and stores it in four separate lists, each representing a column of data.
## readrides.py
import csv
import sys
import tracemalloc

def read_rides_as_columns(filename):
    '''
    Read the bus ride data into 4 lists, representing columns
    '''
    routes = []
    dates = []
    daytypes = []
    numrides = []
    with open(filename) as f:
        rows = csv.reader(f)
        headings = next(rows)     ## Skip headers
        for row in rows:
            routes.append(row[0])
            dates.append(row[1])
            daytypes.append(row[2])
            numrides.append(int(row[3]))
    return dict(routes=routes, dates=dates, daytypes=daytypes, numrides=numrides)

In this code, we first import the necessary modules csv, sys, and tracemalloc. The csv module is used to read CSV files, sys can be used for system-related operations (although not used in this function), and tracemalloc is used for memory profiling. Inside the function, we initialize four empty lists to store different columns of data. Then, we open the file, skip the header row, and iterate through each row in the file, appending the corresponding values to the appropriate lists. Finally, we return a dictionary containing these four lists.

  1. Now, let's analyze why the column-oriented approach can save memory. We'll do this in the Python shell. Run the following code:
import readrides
import tracemalloc

## Estimate memory for row-oriented approach
nrows = 577563     ## Number of rows in original file
dict_overhead = 240  ## Approximate dictionary overhead in bytes
row_memory = nrows * dict_overhead
print(f"Estimated memory for row-oriented data: {row_memory} bytes ({row_memory/1024/1024:.2f} MB)")

## Estimate memory for column-oriented approach
pointer_size = 8   ## Size of a pointer in bytes on 64-bit systems
column_memory = nrows * 4 * pointer_size  ## 4 columns with one pointer per entry
print(f"Estimated memory for column-oriented data: {column_memory} bytes ({column_memory/1024/1024:.2f} MB)")

## Estimate savings
savings = row_memory - column_memory
print(f"Estimated memory savings: {savings} bytes ({savings/1024/1024:.2f} MB)")

In this code, we first import the readrides module we just created and the tracemalloc module. Then, we estimate the memory usage for the row-oriented approach. We assume that each dictionary has an overhead of 240 bytes, and we multiply this by the number of rows in the original file to get the total memory usage for the row-oriented data. For the column-oriented approach, we assume that on a 64-bit system, each pointer takes 8 bytes. Since we have 4 columns and one pointer per entry, we calculate the total memory usage for the column-oriented data. Finally, we calculate the memory savings by subtracting the column-oriented memory usage from the row-oriented memory usage.

This calculation shows that the column-oriented approach should save around 120MB of memory compared to the row-oriented approach with dictionaries.

  1. Let's verify this by measuring the actual memory usage with the tracemalloc module. Run the following code:
## Start tracking memory
tracemalloc.start()

## Read the data
columns = readrides.read_rides_as_columns('ctabus.csv')

## Get current and peak memory usage
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage: {current/1024/1024:.2f} MB")
print(f"Peak memory usage: {peak/1024/1024:.2f} MB")

## Stop tracking memory
tracemalloc.stop()

In this code, we first start tracking memory using tracemalloc.start(). Then, we call the read_rides_as_columns function to read the data from the ctabus.csv file. After that, we use tracemalloc.get_traced_memory() to get the current and peak memory usage. Finally, we stop tracking memory using tracemalloc.stop().

The output will show you the actual memory usage of your column-oriented data structure. This should be significantly less than our theoretical estimate for the row-oriented approach.

The significant memory savings come from eliminating the overhead of thousands of dictionary objects. Each dictionary in Python has a fixed overhead regardless of how many items it contains. By using column-oriented storage, we only need a few lists instead of thousands of dictionaries.

Creating a Custom Container Class

In data processing, the column-oriented approach is great for saving memory. However, it can cause issues when your existing code expects data to be in the form of a list of dictionaries. To solve this problem, we'll create a custom container class. This class will present a row-oriented interface, which means it will look and act like a list of dictionaries to your code. But internally, it will store data in a column-oriented format, helping us save memory.

  1. First, open the readrides.py file in the WebIDE editor. We're going to add a new class to this file. This class will be the foundation of our custom container.
## Add this to readrides.py
from collections.abc import Sequence

class RideData(Sequence):
    def __init__(self):
        ## Each value is a list with all of the values (a column)
        self.routes = []
        self.dates = []
        self.daytypes = []
        self.numrides = []

    def __len__(self):
        ## All lists assumed to have the same length
        return len(self.routes)

    def __getitem__(self, index):
        return {'route': self.routes[index],
                'date': self.dates[index],
                'daytype': self.daytypes[index],
                'rides': self.numrides[index]}

    def append(self, d):
        self.routes.append(d['route'])
        self.dates.append(d['date'])
        self.daytypes.append(d['daytype'])
        self.numrides.append(d['rides'])

In this code, we define a class named RideData that inherits from Sequence. The __init__ method initializes four empty lists, each representing a column of data. The __len__ method returns the length of the container, which is the same as the length of the routes list. The __getitem__ method allows us to access a specific record by index, returning it as a dictionary. The append method adds a new record to the container by appending values to each column list.

  1. Now, we need a function to read the bus ride data into our custom container. Add the following function to the readrides.py file.
## Add this to readrides.py
def read_rides_as_dicts(filename):
    '''
    Read the bus ride data as a list of dicts, but use our custom container
    '''
    records = RideData()
    with open(filename) as f:
        rows = csv.reader(f)
        headings = next(rows)     ## Skip headers
        for row in rows:
            route = row[0]
            date = row[1]
            daytype = row[2]
            rides = int(row[3])
            record = {
                'route': route,
                'date': date,
                'daytype': daytype,
                'rides': rides
            }
            records.append(record)
    return records

This function creates an instance of the RideData class and populates it with data from the CSV file. It reads each row from the file, extracts the relevant information, creates a dictionary for each record, and then appends it to the RideData container. The key thing is that it maintains the same interface as a list of dictionaries, but internally stores the data in columns.

  1. Let's test our custom container in the Python shell. This will help us verify that it works as expected.
import readrides

## Read the data using our custom container
rows = readrides.read_rides_as_dicts('ctabus.csv')

## Check the type of the returned object
type(rows)  ## Should be readrides.RideData

## Check the length
len(rows)   ## Should be 577563

## Access individual records
rows[0]     ## Should return a dictionary for the first record
rows[1]     ## Should return a dictionary for the second record
rows[2]     ## Should return a dictionary for the third record

Our custom container successfully implements the Sequence interface, which means it behaves like a list. You can use the len() function to get the number of records in the container, and you can use indexing to access individual records. Each record appears to be a dictionary, even though the data is stored in columns internally. This is great because existing code that expects a list of dictionaries will continue to work with our custom container without any modification.

  1. Finally, let's measure the memory usage of our custom container. This will show us how much memory we're saving compared to a list of dictionaries.
import tracemalloc

tracemalloc.start()
rows = readrides.read_rides_as_dicts('ctabus.csv')
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage: {current/1024/1024:.2f} MB")
print(f"Peak memory usage: {peak/1024/1024:.2f} MB")
tracemalloc.stop()

When you run this code, you should see that the memory usage is similar to the column-oriented approach, which is much lower than what a list of dictionaries would use. This demonstrates the advantage of our custom container in terms of memory efficiency.

Enhancing the Custom Container for Slicing

Our custom container is great for accessing individual records. However, there's an issue when it comes to slicing. When you attempt to take a slice of our container, the result isn't what you'd typically expect.

Let's understand why this happens. In Python, slicing is a common operation used to extract a portion of a sequence. But for our custom container, Python doesn't know how to create a new RideData object with just the sliced data. Instead, it creates a list containing the results of calling __getitem__ for each index in the slice.

  1. Let's test slicing in the Python shell:
import readrides

rows = readrides.read_rides_as_dicts('ctabus.csv')
r = rows[0:10]  ## Take a slice of the first 10 records
type(r)  ## This will likely be a list, not a RideData object
print(r)  ## This might look like a list of numbers, not dictionaries

In this code, we first import the readrides module. Then we read the data from the ctabus.csv file into a variable rows. When we try to take a slice of the first 10 records and check the type of the result, we find that it's a list instead of a RideData object. Printing the result might show something unexpected, like a list of numbers instead of dictionaries.

  1. Let's modify our RideData class to handle slicing properly. Open readrides.py and update the __getitem__ method:
def __getitem__(self, index):
    if isinstance(index, slice):
        ## Handle slice
        result = RideData()
        result.routes = self.routes[index]
        result.dates = self.dates[index]
        result.daytypes = self.daytypes[index]
        result.numrides = self.numrides[index]
        return result
    else:
        ## Handle single index
        return {'route': self.routes[index],
                'date': self.dates[index],
                'daytype': self.daytypes[index],
                'rides': self.numrides[index]}

In this updated __getitem__ method, we first check if the index is a slice. If it is, we create a new RideData object called result. Then we populate this new object with slices of the original data columns (routes, dates, daytypes, and numrides). This ensures that when we slice our custom container, we get another RideData object instead of a list. If the index is not a slice (i.e., it's a single index), we return a dictionary containing the relevant record.

  1. Let's test our improved slicing capability:
import readrides

rows = readrides.read_rides_as_dicts('ctabus.csv')
r = rows[0:10]  ## Take a slice of the first 10 records
type(r)  ## Should now be readrides.RideData
len(r)   ## Should be 10
r[0]     ## Should be the same as rows[0]
r[1]     ## Should be the same as rows[1]

After updating the __getitem__ method, we can test the slicing again. When we take a slice of the first 10 records, the type of the result should now be readrides.RideData. The length of the slice should be 10, and accessing individual elements in the slice should give us the same results as accessing the corresponding elements in the original container.

  1. You can also test with different slice patterns:
## Get every other record from the first 20
r2 = rows[0:20:2]
len(r2)  ## Should be 10

## Get the last 10 records
r3 = rows[-10:]
len(r3)  ## Should be 10

Here, we're testing different slice patterns. The first slice rows[0:20:2] gets every other record from the first 20 records, and the length of the resulting slice should be 10. The second slice rows[-10:] gets the last 10 records, and its length should also be 10.

By properly implementing slicing, our custom container now behaves even more like a standard Python list, while maintaining the memory efficiency of column-oriented storage.

This approach of creating custom container classes that mimic built-in Python containers but with different internal representations is a powerful technique for optimizing memory usage without changing the interface that your code presents to users.

Summary

In this lab, you have learned several important skills. First, you explored the memory allocation behavior in Python lists and dictionaries and learned to optimize memory usage by shifting from row-oriented to column-oriented data storage. Second, you created a custom container class that maintains the original interface while using less memory and enhanced it to handle slicing operations properly.

These techniques are highly valuable for working with large datasets, as they can significantly reduce memory usage without altering your code's interface. The ability to create custom container classes mimicking built - in Python containers with different internal representations is a powerful optimization tool for Python applications. You can apply these concepts to other memory - critical projects, especially those involving large, regularly - structured datasets.