Different Ways of Representing Records

PythonPythonBeginner
Practice Now

This tutorial is from open-source community. Access the source code

Introduction

In this lab, you will learn to explore memory-efficient ways to store large datasets in Python. You'll also discover different ways of representing records, such as tuples, dictionaries, classes, and named tuples.

Moreover, you will compare the memory usage of different data structures. Understanding the trade - offs between these structures is valuable for Python users who perform data analysis, as it helps optimize code.


Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL python(("Python")) -.-> python/PythonStandardLibraryGroup(["Python Standard Library"]) python(("Python")) -.-> python/DataStructuresGroup(["Data Structures"]) python(("Python")) -.-> python/FileHandlingGroup(["File Handling"]) python/DataStructuresGroup -.-> python/tuples("Tuples") python/DataStructuresGroup -.-> python/dictionaries("Dictionaries") python/FileHandlingGroup -.-> python/file_opening_closing("Opening and Closing Files") python/FileHandlingGroup -.-> python/file_reading_writing("Reading and Writing Files") python/FileHandlingGroup -.-> python/file_operations("File Operations") python/FileHandlingGroup -.-> python/with_statement("Using with Statement") python/PythonStandardLibraryGroup -.-> python/data_collections("Data Collections") subgraph Lab Skills python/tuples -.-> lab-132428{{"Different Ways of Representing Records"}} python/dictionaries -.-> lab-132428{{"Different Ways of Representing Records"}} python/file_opening_closing -.-> lab-132428{{"Different Ways of Representing Records"}} python/file_reading_writing -.-> lab-132428{{"Different Ways of Representing Records"}} python/file_operations -.-> lab-132428{{"Different Ways of Representing Records"}} python/with_statement -.-> lab-132428{{"Different Ways of Representing Records"}} python/data_collections -.-> lab-132428{{"Different Ways of Representing Records"}} end

Exploring the Dataset

Let's start our journey by taking a close look at the dataset we're going to work with. The file ctabus.csv is a CSV (Comma-Separated Values) file. CSV files are a common way to store tabular data, where each line represents a row, and the values within a row are separated by commas. This particular file holds daily ridership data for the Chicago Transit Authority (CTA) bus system, covering the period from January 1, 2001, to August 31, 2013.

To understand the structure of this file, we'll first peek inside it. We'll use Python to read the file and print out some lines. Open a terminal and run the following Python code:

f = open('/home/labex/project/ctabus.csv')
print(next(f))  ## Read the header line
print(next(f))  ## Read the first data line
print(next(f))  ## Read the second data line
f.close()

In this code, we first open the file using the open function and assign it to the variable f. The next function is used to read the next line from the file. We use it three times: the first time to read the header line, which usually contains the names of the columns in the dataset. The second and third times, we read the first and second data lines respectively. Finally, we close the file using the close method to free up system resources.

You should see output similar to this:

route,date,daytype,rides

3,01/01/2001,U,7354

4,01/01/2001,U,9288

This output shows that the file has 4 columns of data. Let's break down what each column represents:

  1. route: This is the bus route name or number. It's the first column (Column 0) in the dataset.
  2. date: It's a date string in the MM/DD/YYYY format. This is the second column (Column 1).
  3. daytype: It's a day type code, which is the third column (Column 2).
    • U = Sunday/Holiday
    • A = Saturday
    • W = Weekday
  4. rides: This column records the total number of riders as an integer. It's the fourth column (Column 3).

The rides column tells us how many people boarded a bus on a specific route on a given day. For example, from the output above, we can see that 7,354 people rode the number 3 bus on January 1, 2001.

Now, let's find out how many lines are in the file. Knowing the number of lines will give us an idea of the size of our dataset. Run the following Python code:

with open('/home/labex/project/ctabus.csv') as f:
    line_count = sum(1 for line in f)
    print(f"Total lines in the file: {line_count}")

In this code, we use the with statement to open the file. The advantage of using with is that it automatically takes care of closing the file when we're done with it. We then use a generator expression (1 for line in f) to create a sequence of 1s, one for each line in the file. The sum function adds up all these 1s, giving us the total number of lines in the file. Finally, we print out the result.

This should output approximately 577,564 lines, which means we're dealing with a substantial dataset. This large dataset will provide us with plenty of data to analyze and draw insights from.

Measuring Memory Usage with Different Storage Methods

In this step, we're going to look at how different ways of storing data can impact memory usage. Memory usage is an important aspect of programming, especially when dealing with large datasets. To measure the memory used by our Python code, we'll use Python's tracemalloc module. This module is very useful as it allows us to track the memory allocations made by Python. By using it, we can see how much memory our data storage methods are consuming.

Method 1: Storing the Entire File as a Single String

Let's start by creating a new Python file. Navigate to the /home/labex/project directory and create a file named memory_test1.py. You can use a text editor to open this file. Once the file is open, add the following code to it. This code will read the entire content of a file as a single string and measure the memory usage.

## memory_test1.py
import tracemalloc

def test_single_string():
    ## Start tracking memory
    tracemalloc.start()

    ## Read the entire file as a single string
    with open('/home/labex/project/ctabus.csv') as f:
        data = f.read()

    ## Get memory usage statistics
    current, peak = tracemalloc.get_traced_memory()

    print(f"File length: {len(data)} characters")
    print(f"Current memory usage: {current/1024/1024:.2f} MB")
    print(f"Peak memory usage: {peak/1024/1024:.2f} MB")

    ## Stop tracking memory
    tracemalloc.stop()

if __name__ == "__main__":
    test_single_string()

After adding the code, save the file. Now, to run this script, open your terminal and execute the following command:

python3 /home/labex/project/memory_test1.py

When you run the script, you should see output similar to this:

File length: 12361039 characters
Current memory usage: 11.80 MB
Peak memory usage: 23.58 MB

The exact numbers might be different on your system, but generally, you'll notice that the current memory usage is around 12 MB and the peak memory usage is about 24 MB.

Method 2: Storing as a List of Strings

Next, we'll test another way of storing the data. Create a new file named memory_test2.py in the same /home/labex/project directory. Open this file in the editor and add the following code. This code reads the file and stores each line as a separate string in a list, and then measures the memory usage.

## memory_test2.py
import tracemalloc

def test_list_of_strings():
    ## Start tracking memory
    tracemalloc.start()

    ## Read the file as a list of strings (one string per line)
    with open('/home/labex/project/ctabus.csv') as f:
        lines = f.readlines()

    ## Get memory usage statistics
    current, peak = tracemalloc.get_traced_memory()

    print(f"Number of lines: {len(lines)}")
    print(f"Current memory usage: {current/1024/1024:.2f} MB")
    print(f"Peak memory usage: {peak/1024/1024:.2f} MB")

    ## Stop tracking memory
    tracemalloc.stop()

if __name__ == "__main__":
    test_list_of_strings()

Save the file and then run the script using the following command in the terminal:

python3 /home/labex/project/memory_test2.py

You should see output similar to this:

Number of lines: 577564
Current memory usage: 43.70 MB
Peak memory usage: 43.74 MB

Notice that the memory usage has increased significantly compared to the previous method of storing the data as a single string. This is because each line in the list is a separate Python string object, and each object has its own memory overhead.

Understanding the Memory Difference

The difference in memory usage between the two approaches shows an important concept in Python programming called object overhead. When you store data as a list of strings, each string is a separate Python object. Each object has some additional memory requirements, which include:

  1. The Python object header (usually 16 - 24 bytes per object). This header contains information about the object, like its type and reference count.
  2. The actual string representation itself, which stores the characters of the string.
  3. Memory alignment padding. This is extra space added to ensure that the object's memory address is properly aligned for efficient access.

On the other hand, when you store the entire file content as a single string, there is only one object, and thus only one set of overhead. This makes it more memory - efficient when considering the total size of the data.

When designing programs that work with large datasets, you need to consider this trade - off between memory efficiency and data accessibility. Sometimes, it might be more convenient to access data when it's stored in a list of strings, but it will use more memory. Other times, you might prioritize memory efficiency and choose to store the data as a single string.

Working with Structured Data using Tuples

So far, we've been dealing with storing raw text data. But when it comes to data analysis, we usually need to transform the data into more organized and structured formats. This makes it easier to perform various operations and gain insights from the data. In this step, we'll learn how to read data as a list of tuples using the csv module. Tuples are a simple and useful data structure in Python that can hold multiple values.

Creating a Reader Function with Tuples

Let's create a new file named readrides.py in the /home/labex/project directory. This file will contain the code to read the data from a CSV file and store it as a list of tuples.

## readrides.py
import csv
import tracemalloc

def read_rides_as_tuples(filename):
    '''
    Read the bus ride data as a list of tuples
    '''
    records = []
    with open(filename) as f:
        rows = csv.reader(f)
        headings = next(rows)     ## Skip headers
        for row in rows:
            route = row[0]
            date = row[1]
            daytype = row[2]
            rides = int(row[3])
            record = (route, date, daytype, rides)
            records.append(record)
    return records

if __name__ == '__main__':
    tracemalloc.start()

    rows = read_rides_as_tuples('/home/labex/project/ctabus.csv')

    current, peak = tracemalloc.get_traced_memory()
    print(f'Number of records: {len(rows)}')
    print(f'First record: {rows[0]}')
    print(f'Second record: {rows[1]}')
    print(f'Memory Use: Current {current/1024/1024:.2f} MB, Peak {peak/1024/1024:.2f} MB')

This script defines a function called read_rides_as_tuples. Here's what it does step by step:

  1. It opens the CSV file specified by the filename parameter. This allows us to access the data inside the file.
  2. It uses the csv module to parse each line of the file. The csv.reader function helps us split the lines into individual values.
  3. It extracts the four fields (route, date, day type, and number of rides) from each row. These fields are important for our data analysis.
  4. It converts the 'rides' field to an integer. This is necessary because the data in the CSV file is initially in string format, and we need a numeric value for calculations.
  5. It creates a tuple with these four values. Tuples are immutable, which means their values cannot be changed once they are created.
  6. It adds the tuple to a list called records. This list will hold all the records from the CSV file.

Now, let's run the script. Open your terminal and enter the following command:

python3 /home/labex/project/readrides.py

You should see output similar to this:

Number of records: 577563
First record: ('3', '01/01/2001', 'U', 7354)
Second record: ('4', '01/01/2001', 'U', 9288)
Memory Use: Current 89.12 MB, Peak 89.15 MB

Notice that the memory usage has increased compared to our previous examples. There are a few reasons for this:

  1. We're now storing the data in a structured format (tuples). Structured data usually requires more memory because it has a defined organization.
  2. Each value in the tuple is a separate Python object. Python objects have some overhead, which contributes to the increased memory usage.
  3. We have an additional list structure that holds all these tuples. Lists also take up memory to store their elements.

The advantage of using this approach is that our data is now properly structured and ready for analysis. We can easily access specific fields of each record by index. For example:

## Example of accessing tuple elements (add this to readrides.py file to try it)
first_record = rows[0]
route = first_record[0]
date = first_record[1]
daytype = first_record[2]
rides = first_record[3]
print(f"Route: {route}, Date: {date}, Day type: {daytype}, Rides: {rides}")

However, accessing data by numeric index isn't always intuitive. It can be difficult to remember which index corresponds to which field, especially when dealing with a large number of fields. In the next step, we'll explore other data structures that can make our code more readable and maintainable.

โœจ Check Solution and Practice

Comparing Different Data Structures

In Python, data structures are used to organize and store related data. They are like containers that hold different types of information in a structured way. In this step, we'll compare different data structures and see how much memory they use.

Let's create a new file called compare_structures.py in the /home/labex/project directory. This file will contain the code to read data from a CSV file and store it in different data structures.

## compare_structures.py
import csv
import tracemalloc
from collections import namedtuple

## Define a named tuple for rides data
RideRecord = namedtuple('RideRecord', ['route', 'date', 'daytype', 'rides'])

## A named tuple is a lightweight class that allows you to access its fields by name.
## It's like a tuple, but with named attributes.

## Define a class with __slots__ for memory optimization
class SlottedRideRecord:
    __slots__ = ['route', 'date', 'daytype', 'rides']

    def __init__(self, route, date, daytype, rides):
        self.route = route
        self.date = date
        self.daytype = daytype
        self.rides = rides

## A class with __slots__ is a memory - optimized class.
## It avoids using an instance dictionary, which saves memory.

## Define a regular class for rides data
class RegularRideRecord:
    def __init__(self, route, date, daytype, rides):
        self.route = route
        self.date = date
        self.daytype = daytype
        self.rides = rides

## A regular class is an object - oriented way to represent data.
## It has named attributes and can have methods.

## Function to read data as tuples
def read_as_tuples(filename):
    records = []
    with open(filename) as f:
        rows = csv.reader(f)
        next(rows)  ## Skip headers
        for row in rows:
            record = (row[0], row[1], row[2], int(row[3]))
            records.append(record)
    return records

## This function reads data from a CSV file and stores it as tuples.
## Tuples are immutable sequences, and you access their elements by numeric index.

## Function to read data as dictionaries
def read_as_dicts(filename):
    records = []
    with open(filename) as f:
        rows = csv.reader(f)
        headers = next(rows)  ## Get headers
        for row in rows:
            record = {
                'route': row[0],
                'date': row[1],
                'daytype': row[2],
                'rides': int(row[3])
            }
            records.append(record)
    return records

## This function reads data from a CSV file and stores it as dictionaries.
## Dictionaries use key - value pairs, so you can access elements by their names.

## Function to read data as named tuples
def read_as_named_tuples(filename):
    records = []
    with open(filename) as f:
        rows = csv.reader(f)
        next(rows)  ## Skip headers
        for row in rows:
            record = RideRecord(row[0], row[1], row[2], int(row[3]))
            records.append(record)
    return records

## This function reads data from a CSV file and stores it as named tuples.
## Named tuples combine the efficiency of tuples with the readability of named access.

## Function to read data as regular class instances
def read_as_regular_classes(filename):
    records = []
    with open(filename) as f:
        rows = csv.reader(f)
        next(rows)  ## Skip headers
        for row in rows:
            record = RegularRideRecord(row[0], row[1], row[2], int(row[3]))
            records.append(record)
    return records

## This function reads data from a CSV file and stores it as instances of a regular class.
## Regular classes allow you to add methods to your data.

## Function to read data as slotted class instances
def read_as_slotted_classes(filename):
    records = []
    with open(filename) as f:
        rows = csv.reader(f)
        next(rows)  ## Skip headers
        for row in rows:
            record = SlottedRideRecord(row[0], row[1], row[2], int(row[3]))
            records.append(record)
    return records

## This function reads data from a CSV file and stores it as instances of a slotted class.
## Slotted classes are memory - optimized and still provide named access.

## Function to measure memory usage
def measure_memory(func, filename):
    tracemalloc.start()

    records = func(filename)

    current, peak = tracemalloc.get_traced_memory()

    ## Demonstrate how to use each data structure
    first_record = records[0]
    if func.__name__ == 'read_as_tuples':
        route, date, daytype, rides = first_record
    elif func.__name__ == 'read_as_dicts':
        route = first_record['route']
        date = first_record['date']
        daytype = first_record['daytype']
        rides = first_record['rides']
    else:  ## named tuples and classes
        route = first_record.route
        date = first_record.date
        daytype = first_record.daytype
        rides = first_record.rides

    print(f"Structure type: {func.__name__}")
    print(f"Record count: {len(records)}")
    print(f"Example access: Route={route}, Date={date}, Rides={rides}")
    print(f"Current memory: {current/1024/1024:.2f} MB")
    print(f"Peak memory: {peak/1024/1024:.2f} MB")
    print("-" * 50)

    tracemalloc.stop()

    return current

if __name__ == "__main__":
    filename = '/home/labex/project/ctabus.csv'

    ## Run all memory tests
    print("Memory usage comparison for different data structures:\n")

    results = []
    for reader_func in [
        read_as_tuples,
        read_as_dicts,
        read_as_named_tuples,
        read_as_regular_classes,
        read_as_slotted_classes
    ]:
        memory = measure_memory(reader_func, filename)
        results.append((reader_func.__name__, memory))

    ## Sort by memory usage (lowest first)
    results.sort(key=lambda x: x[1])

    print("\nRanking by memory efficiency (most efficient first):")
    for i, (name, memory) in enumerate(results, 1):
        print(f"{i}. {name}: {memory/1024/1024:.2f} MB")

Run the script to see the comparison results:

python3 /home/labex/project/compare_structures.py

The output will show the memory usage for each data structure, along with a ranking from most to least memory - efficient.

Understanding the Different Data Structures

  1. Tuples:

    • Tuples are lightweight and immutable sequences. This means once you create a tuple, you can't change its elements.
    • You access elements in a tuple by their numeric index, like record[0], record[1], etc.
    • They are very memory - efficient because they have a simple structure.
    • However, they can be less readable because you need to remember the index of each element.
  2. Dictionaries:

    • Dictionaries use key - value pairs, which allows you to access elements by their names.
    • They are more readable, for example, you can use record['route'], record['date'], etc.
    • They have higher memory usage because of the hash table overhead used to store the key - value pairs.
    • They are flexible because you can add or remove fields easily.
  3. Named Tuples:

    • Named tuples combine the efficiency of tuples with the ability to access elements by name.
    • You can access elements using dot notation, like record.route, record.date, etc.
    • They are immutable, just like regular tuples.
    • They are more memory - efficient than dictionaries.
  4. Regular Classes:

    • Regular classes follow an object - oriented approach and have named attributes.
    • You can access attributes using dot notation, like record.route, record.date, etc.
    • You can add methods to a regular class to define behavior.
    • They use more memory because each instance has an instance dictionary to store its attributes.
  5. **Classes with **slots****:

    • Classes with __slots__ are memory - optimized classes. They avoid using an instance dictionary, which saves memory.
    • They still provide named access to attributes, like record.route, record.date, etc.
    • They restrict adding new attributes after the object is created.
    • They are more memory - efficient than regular classes.

When to Use Each Approach

  • Tuples: Use tuples when memory is a critical factor and you only need simple indexed access to your data.
  • Dictionaries: Use dictionaries when you need flexibility, such as when the fields in your data may vary.
  • Named Tuples: Use named tuples when you need both readability and memory efficiency.
  • Regular Classes: Use regular classes when you need to add behavior (methods) to your data.
  • **Classes with **slots****: Use classes with __slots__ when you need behavior and maximum memory efficiency.

By choosing the right data structure for your needs, you can significantly improve the performance and memory usage of your Python programs, especially when working with large datasets.

โœจ Check Solution and Practice

Summary

In this lab, you have learned different ways to represent records in Python and analyzed their memory efficiency. First, you understood the basic CSV dataset structure and compared raw text storage methods. Then, you worked with structured data using tuples and implemented five different data structures: tuples, dictionaries, named tuples, regular classes, and classes with slots.

Key takeaways include that different data structures offer trade - offs among memory efficiency, readability, and functionality. Python's object overhead has a significant impact on memory usage for large datasets, and the choice of data structure can greatly affect memory consumption. Named tuples and classes with slots are good compromises between memory efficiency and code readability. These concepts are valuable for Python developers in data processing, especially when handling large datasets where memory efficiency is crucial.