Cómo procesar eficientemente archivos CSV grandes en Python

PythonPythonBeginner
Practicar Ahora

💡 Este tutorial está traducido por IA desde la versión en inglés. Para ver la versión original, puedes hacer clic aquí

Introduction

Handling large CSV files is a common challenge for Python developers. This tutorial guides you through efficient techniques to process these files effectively, focusing on optimizing performance and memory usage. By the end of this lab, you will be equipped with practical knowledge to tackle data-intensive CSV processing tasks in Python.

Whether you are analyzing customer data, processing financial records, or working with any type of structured data, the skills learned in this lab will help you process large datasets efficiently without running into memory issues.


Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL python(("Python")) -.-> python/ModulesandPackagesGroup(["Modules and Packages"]) python(("Python")) -.-> python/FileHandlingGroup(["File Handling"]) python(("Python")) -.-> python/PythonStandardLibraryGroup(["Python Standard Library"]) python(("Python")) -.-> python/DataScienceandMachineLearningGroup(["Data Science and Machine Learning"]) python/ModulesandPackagesGroup -.-> python/standard_libraries("Common Standard Libraries") python/FileHandlingGroup -.-> python/file_opening_closing("Opening and Closing Files") python/FileHandlingGroup -.-> python/file_reading_writing("Reading and Writing Files") python/FileHandlingGroup -.-> python/file_operations("File Operations") python/PythonStandardLibraryGroup -.-> python/data_collections("Data Collections") python/DataScienceandMachineLearningGroup -.-> python/data_analysis("Data Analysis") subgraph Lab Skills python/standard_libraries -.-> lab-398186{{"Cómo procesar eficientemente archivos CSV grandes en Python"}} python/file_opening_closing -.-> lab-398186{{"Cómo procesar eficientemente archivos CSV grandes en Python"}} python/file_reading_writing -.-> lab-398186{{"Cómo procesar eficientemente archivos CSV grandes en Python"}} python/file_operations -.-> lab-398186{{"Cómo procesar eficientemente archivos CSV grandes en Python"}} python/data_collections -.-> lab-398186{{"Cómo procesar eficientemente archivos CSV grandes en Python"}} python/data_analysis -.-> lab-398186{{"Cómo procesar eficientemente archivos CSV grandes en Python"}} end

Creating and Reading a Simple CSV File

CSV (Comma-Separated Values) is a popular file format used to store tabular data. In this step, we will create a simple CSV file and learn how to read it using Python's built-in csv module.

Understanding CSV Files

A CSV file stores data in a plain text format where:

  • Each line represents a row of data
  • Values within each row are separated by a delimiter (typically a comma)
  • The first row often contains column headers

Let's start by creating a simple CSV file to work with.

Creating a CSV File

First, let's create a directory to work in and then create a simple CSV file:

  1. Open the terminal in the WebIDE
  2. Create a new Python file named csv_basics.py in the editor

Now, add the following code to csv_basics.py:

import csv

## Data to write to CSV
data = [
    ['Name', 'Age', 'City'],              ## Header row
    ['John Smith', '28', 'New York'],
    ['Sarah Johnson', '32', 'San Francisco'],
    ['Michael Brown', '45', 'Chicago'],
    ['Emily Davis', '36', 'Boston'],
    ['David Wilson', '52', 'Seattle']
]

## Writing data to a CSV file
with open('sample_data.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(data)

print("CSV file 'sample_data.csv' created successfully.")

Run this code by executing the following command in the terminal:

python3 csv_basics.py

Expected output:

CSV file 'sample_data.csv' created successfully.

This will create a new CSV file named sample_data.csv in your current directory. You can view the contents of this file by running:

cat sample_data.csv

Expected output:

Name,Age,City
John Smith,28,New York
Sarah Johnson,32,San Francisco
Michael Brown,45,Chicago
Emily Davis,36,Boston
David Wilson,52,Seattle

Reading a CSV File

Now let's read the CSV file we just created. Create a new file named read_csv.py with the following code:

import csv

## Reading a CSV file
with open('sample_data.csv', 'r') as file:
    reader = csv.reader(file)

    print("Contents of sample_data.csv:")
    print("--------------------------")

    for row in reader:
        print(row)

## Reading and accessing specific columns
print("\nReading specific columns:")
print("--------------------------")

with open('sample_data.csv', 'r') as file:
    reader = csv.reader(file)
    headers = next(reader)  ## Skip the header row

    for row in reader:
        name = row[0]
        age = row[1]
        city = row[2]
        print(f"Name: {name}, Age: {age}, City: {city}")

Run this code with:

python3 read_csv.py

Expected output:

Contents of sample_data.csv:
--------------------------
['Name', 'Age', 'City']
['John Smith', '28', 'New York']
['Sarah Johnson', '32', 'San Francisco']
['Michael Brown', '45', 'Chicago']
['Emily Davis', '36', 'Boston']
['David Wilson', '52', 'Seattle']

Reading specific columns:
--------------------------
Name: John Smith, Age: 28, City: New York
Name: Sarah Johnson, Age: 32, City: San Francisco
Name: Michael Brown, Age: 45, City: Chicago
Name: Emily Davis, Age: 36, City: Boston
Name: David Wilson, Age: 52, City: Seattle

Understanding the CSV Module

The Python csv module provides two main classes:

  • csv.reader: Reads CSV files and returns each row as a list of strings
  • csv.writer: Writes data to CSV files

This module handles all the complexities of dealing with different CSV formats, such as escaping special characters and handling quotes.

In this step, you've learned how to create and read a simple CSV file. In the next step, we'll explore more efficient ways to handle larger CSV files.

Using DictReader for Convenient CSV Processing

In the previous step, we worked with the basic csv.reader and csv.writer functions. Now, let's explore a more convenient way to process CSV files using the csv.DictReader class, which is especially useful when working with data that has column headers.

What is DictReader?

csv.DictReader reads CSV files and returns each row as a dictionary where:

  • The keys are taken from the column headers (first row of the CSV file by default)
  • The values are the corresponding data from each row

This approach makes your code more readable and less error-prone because you can reference columns by name instead of by index.

Create a Larger Test File

First, let's create a slightly larger CSV file to demonstrate the benefits of DictReader. Create a new file named create_users_data.py with the following code:

import csv
import random

## Generate some sample user data
def generate_users(count):
    users = [['id', 'name', 'email', 'age', 'country']]  ## Header row

    domains = ['gmail.com', 'yahoo.com', 'outlook.com', 'example.com']
    countries = ['USA', 'Canada', 'UK', 'Australia', 'Germany', 'France', 'Japan', 'Brazil']
    first_names = ['John', 'Jane', 'Michael', 'Emily', 'David', 'Sarah', 'Robert', 'Lisa']
    last_names = ['Smith', 'Johnson', 'Brown', 'Davis', 'Wilson', 'Miller', 'Jones', 'Taylor']

    for i in range(1, count + 1):
        first_name = random.choice(first_names)
        last_name = random.choice(last_names)
        name = f"{first_name} {last_name}"
        email = f"{first_name.lower()}.{last_name.lower()}@{random.choice(domains)}"
        age = random.randint(18, 65)
        country = random.choice(countries)

        users.append([str(i), name, email, str(age), country])

    return users

## Create a CSV file with 100 users
users_data = generate_users(100)

## Write data to CSV file
with open('users_data.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(users_data)

print(f"Created 'users_data.csv' with 100 user records")

Run the script to create the file:

python3 create_users_data.py

Expected output:

Created 'users_data.csv' with 100 user records

Let's examine the first few lines of this new file:

head -n 5 users_data.csv

You should see the header row followed by 4 rows of data:

id,name,email,age,country
1,John Smith,[email protected],25,USA
2,Emily Brown,[email protected],32,Canada
3,David Jones,[email protected],45,UK
4,Sarah Wilson,[email protected],28,Australia

Using DictReader to Process the CSV File

Now, let's create a script to process this file using DictReader. Create a new file named dict_reader_example.py with the following code:

import csv

## Read the CSV file using DictReader
with open('users_data.csv', 'r') as file:
    csv_reader = csv.DictReader(file)

    ## Print the field names (column headers)
    print(f"Column headers: {csv_reader.fieldnames}")
    print("\nFirst 5 records:")
    print("-----------------")

    ## Print the first 5 records
    for i, row in enumerate(csv_reader):
        if i < 5:
            ## Access fields by name
            print(f"User {row['id']}: {row['name']}, {row['age']} years old, from {row['country']}")
            print(f"  Email: {row['email']}")
        else:
            break

## Using DictReader for data analysis
with open('users_data.csv', 'r') as file:
    csv_reader = csv.DictReader(file)

    ## Calculate average age
    total_age = 0
    user_count = 0

    ## Count users by country
    countries = {}

    for row in csv_reader:
        user_count += 1
        total_age += int(row['age'])

        ## Count users by country
        country = row['country']
        if country in countries:
            countries[country] += 1
        else:
            countries[country] = 1

    avg_age = total_age / user_count if user_count > 0 else 0

    print("\nData Analysis:")
    print("--------------")
    print(f"Total users: {user_count}")
    print(f"Average age: {avg_age:.2f} years")
    print("\nUsers by country:")

    for country, count in sorted(countries.items(), key=lambda x: x[1], reverse=True):
        print(f"  {country}: {count} users")

Run this script:

python3 dict_reader_example.py

Expected output (your exact values may vary since the data is randomly generated):

Column headers: ['id', 'name', 'email', 'age', 'country']

First 5 records:
-----------------
User 1: John Smith, 25 years old, from USA
  Email: [email protected]
User 2: Emily Brown, 32 years old, from Canada
  Email: [email protected]
User 3: David Jones, 45 years old, from UK
  Email: [email protected]
User 4: Sarah Wilson, 28 years old, from Australia
  Email: [email protected]
User 5: Michael Taylor, 37 years old, from Germany
  Email: [email protected]

Data Analysis:
--------------
Total users: 100
Average age: 41.35 years

Users by country:
  USA: 16 users
  Canada: 14 users
  Japan: 13 users
  UK: 12 users
  Germany: 12 users
  Australia: 12 users
  France: 11 users
  Brazil: 10 users

Benefits of Using DictReader

As you can see, using DictReader provides several advantages:

  1. Readable code: You can access fields by name instead of remembering index positions
  2. Self-documenting: The code clearly shows which field you're accessing
  3. Flexibility: If the column order changes in the CSV file, your code still works as long as the column names remain the same

This approach is particularly useful when working with real-world data that has many columns or when the column order might change over time.

In the next step, we'll explore efficient techniques for processing larger CSV files without loading everything into memory at once.

Processing Large CSV Files Efficiently

In real-world scenarios, you may need to process CSV files that are several gigabytes in size. Loading such files entirely into memory can cause your application to crash or slow down significantly. In this step, we'll explore techniques to process large CSV files efficiently.

The Memory Challenge with Large Files

When working with CSV files, there are three common approaches, each with different memory requirements:

  1. Loading the entire file into memory - Simple but uses the most memory
  2. Streaming the file line by line - Uses minimal memory but may be slower for complex operations
  3. Chunking - A middle ground that processes the file in manageable chunks

Let's explore each of these approaches with practical examples.

Creating a Larger Sample File

First, let's create a larger sample file to demonstrate these techniques. Create a new file named create_large_csv.py:

import csv
import random
from datetime import datetime, timedelta

def create_large_csv(filename, num_rows):
    ## Define column headers
    headers = ['transaction_id', 'date', 'customer_id', 'product_id', 'amount', 'status']

    ## Create a file with the specified number of rows
    with open(filename, 'w', newline='') as file:
        writer = csv.writer(file)
        writer.writerow(headers)

        ## Generate random data
        start_date = datetime(2022, 1, 1)
        status_options = ['completed', 'pending', 'failed', 'refunded']

        for i in range(1, num_rows + 1):
            ## Generate random values
            transaction_id = f"TXN-{i:08d}"
            days_offset = random.randint(0, 365)
            date = (start_date + timedelta(days=days_offset)).strftime('%Y-%m-%d')
            customer_id = f"CUST-{random.randint(1001, 9999)}"
            product_id = f"PROD-{random.randint(101, 999)}"
            amount = round(random.uniform(10.0, 500.0), 2)
            status = random.choice(status_options)

            ## Write row to CSV
            writer.writerow([transaction_id, date, customer_id, product_id, amount, status])

            ## Print progress indicator for every 10,000 rows
            if i % 10000 == 0:
                print(f"Generated {i} rows...")

## Create a CSV file with 50,000 rows (about 5-10 MB)
create_large_csv('transactions.csv', 50000)
print("Large CSV file 'transactions.csv' has been created.")

Run this script to create the file:

python3 create_large_csv.py

Expected output:

Generated 10000 rows...
Generated 20000 rows...
Generated 30000 rows...
Generated 40000 rows...
Generated 50000 rows...
Large CSV file 'transactions.csv' has been created.

You can check the size of the file with:

ls -lh transactions.csv

Expected output (size may vary slightly):

-rw-r--r-- 1 labex labex 3.8M Apr 15 12:30 transactions.csv

Approach 1: Line-by-Line Processing (Streaming)

The most memory-efficient approach is to process a CSV file line by line. Create a file named streaming_example.py:

import csv
import time

def process_csv_streaming(filename):
    print(f"Processing {filename} using streaming (line by line)...")
    start_time = time.time()

    ## Track some statistics
    row_count = 0
    total_amount = 0
    status_counts = {'completed': 0, 'pending': 0, 'failed': 0, 'refunded': 0}

    ## Process the file line by line
    with open(filename, 'r') as file:
        reader = csv.DictReader(file)

        for row in reader:
            ## Increment row counter
            row_count += 1

            ## Process row data
            amount = float(row['amount'])
            status = row['status']

            ## Update statistics
            total_amount += amount
            status_counts[status] += 1

    ## Calculate and display results
    end_time = time.time()
    processing_time = end_time - start_time

    print(f"\nResults:")
    print(f"Processed {row_count:,} rows in {processing_time:.2f} seconds")
    print(f"Total transaction amount: ${total_amount:,.2f}")
    print(f"Average transaction amount: ${total_amount/row_count:.2f}")
    print("\nTransaction status breakdown:")
    for status, count in status_counts.items():
        percentage = (count / row_count) * 100
        print(f"  {status}: {count:,} ({percentage:.1f}%)")

## Process the file
process_csv_streaming('transactions.csv')

Run this script:

python3 streaming_example.py

Expected output (your exact numbers may vary):

Processing transactions.csv using streaming (line by line)...

Results:
Processed 50,000 rows in 0.17 seconds
Total transaction amount: $12,739,853.35
Average transaction amount: $254.80

Transaction status breakdown:
  completed: 12,432 (24.9%)
  pending: 12,598 (25.2%)
  failed: 12,414 (24.8%)
  refunded: 12,556 (25.1%)

Approach 2: Chunked Processing

For more complex operations or when you need to process data in batches, you can use a chunked approach. Create a file named chunked_example.py:

import csv
import time

def process_csv_chunked(filename, chunk_size=10000):
    print(f"Processing {filename} using chunks of {chunk_size} rows...")
    start_time = time.time()

    ## Track some statistics
    row_count = 0
    total_amount = 0
    status_counts = {'completed': 0, 'pending': 0, 'failed': 0, 'refunded': 0}

    ## Process the file in chunks
    with open(filename, 'r') as file:
        reader = csv.DictReader(file)

        chunk = []
        for row in reader:
            ## Add row to current chunk
            chunk.append(row)

            ## When chunk reaches desired size, process it
            if len(chunk) >= chunk_size:
                ## Process the chunk
                for row_data in chunk:
                    ## Update statistics
                    row_count += 1
                    amount = float(row_data['amount'])
                    status = row_data['status']

                    total_amount += amount
                    status_counts[status] += 1

                print(f"Processed chunk of {len(chunk)} rows... ({row_count:,} total)")
                ## Clear the chunk for next batch
                chunk = []

        ## Process any remaining rows in the last chunk
        if chunk:
            for row_data in chunk:
                row_count += 1
                amount = float(row_data['amount'])
                status = row_data['status']

                total_amount += amount
                status_counts[status] += 1

            print(f"Processed final chunk of {len(chunk)} rows... ({row_count:,} total)")

    ## Calculate and display results
    end_time = time.time()
    processing_time = end_time - start_time

    print(f"\nResults:")
    print(f"Processed {row_count:,} rows in {processing_time:.2f} seconds")
    print(f"Total transaction amount: ${total_amount:,.2f}")
    print(f"Average transaction amount: ${total_amount/row_count:.2f}")
    print("\nTransaction status breakdown:")
    for status, count in status_counts.items():
        percentage = (count / row_count) * 100
        print(f"  {status}: {count:,} ({percentage:.1f}%)")

## Process the file with chunks of 10,000 rows
process_csv_chunked('transactions.csv', chunk_size=10000)

Run this script:

python3 chunked_example.py

Expected output:

Processing transactions.csv using chunks of 10000 rows...
Processed chunk of 10000 rows... (10,000 total)
Processed chunk of 10000 rows... (20,000 total)
Processed chunk of 10000 rows... (30,000 total)
Processed chunk of 10000 rows... (40,000 total)
Processed chunk of 10000 rows... (50,000 total)

Results:
Processed 50,000 rows in 0.20 seconds
Total transaction amount: $12,739,853.35
Average transaction amount: $254.80

Transaction status breakdown:
  completed: 12,432 (24.9%)
  pending: 12,598 (25.2%)
  failed: 12,414 (24.8%)
  refunded: 12,556 (25.1%)

Memory Usage Comparison

The key differences between these approaches are:

  1. Streaming (line by line):

    • Uses minimal memory regardless of file size
    • Best for very large files
    • Simple operations on each row
  2. Chunked processing:

    • Uses more memory than streaming but still efficient
    • Good for operations that need to process rows in batches
    • Provides progress updates during processing
    • Can be combined with multiprocessing for parallel processing

For most practical purposes, the streaming approach is recommended unless you specifically need batch processing capabilities. It provides the best memory efficiency while still maintaining good performance.

In the next step, we'll explore using third-party libraries like pandas for even more powerful CSV processing capabilities.

Using Pandas for Advanced CSV Processing

While Python's built-in csv module is powerful, the pandas library offers more advanced functionality for data analysis and manipulation. In this step, we'll explore how to use pandas for efficient CSV processing.

Installing Pandas

First, let's install the pandas library:

pip install pandas

Expected output:

Collecting pandas
  Downloading pandas-2.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.3/12.3 MB 42.6 MB/s eta 0:00:00
...
Successfully installed pandas-2.0.0 numpy-1.24.3 python-dateutil-2.8.2 pytz-2023.3 tzdata-2023.3

Reading CSV Files with Pandas

Pandas makes it easy to read, analyze, and manipulate CSV data. Create a file named pandas_basic.py:

import pandas as pd
import time

def process_with_pandas(filename):
    print(f"Processing {filename} with pandas...")
    start_time = time.time()

    ## Read the CSV file into a DataFrame
    df = pd.read_csv(filename)

    ## Display basic information
    print(f"\nDataFrame Info:")
    print(f"Shape: {df.shape} (rows, columns)")
    print(f"Column names: {', '.join(df.columns)}")

    ## Display the first 5 rows
    print("\nFirst 5 rows:")
    print(df.head())

    ## Basic statistics
    print("\nSummary statistics for numeric columns:")
    print(df.describe())

    ## Group by analysis
    print("\nTransaction counts by status:")
    status_counts = df['status'].value_counts()
    print(status_counts)

    ## Calculate average amount by status
    print("\nAverage transaction amount by status:")
    avg_by_status = df.groupby('status')['amount'].mean()
    print(avg_by_status)

    ## Calculate total amount by date (first 5 dates)
    print("\nTotal transaction amount by date (first 5 dates):")
    total_by_date = df.groupby('date')['amount'].sum().sort_values(ascending=False).head(5)
    print(total_by_date)

    end_time = time.time()
    print(f"\nProcessed in {end_time - start_time:.2f} seconds")

## Process the transactions file
process_with_pandas('transactions.csv')

Run this script:

python3 pandas_basic.py

Expected output (truncated for brevity):

Processing transactions.csv with pandas...

DataFrame Info:
Shape: (50000, 6) (rows, columns)
Column names: transaction_id, date, customer_id, product_id, amount, status

First 5 rows:
  transaction_id        date customer_id product_id   amount    status
0    TXN-00000001  2022-12-19   CUST-5421   PROD-383  385.75  refunded
1    TXN-00000002  2022-02-01   CUST-7078   PROD-442  286.83  completed
2    TXN-00000003  2022-12-24   CUST-2356   PROD-701  364.87    failed
3    TXN-00000004  2022-04-09   CUST-3458   PROD-854  247.73   pending
4    TXN-00000005  2022-03-07   CUST-6977   PROD-307  298.69  completed

Summary statistics for numeric columns:
              amount
count  50000.000000
mean     254.797067
std      141.389125
min       10.010000
25%      127.732500
50%      254.865000
75%      381.387500
max      499.990000

Transaction counts by status:
pending     12598
refunded    12556
completed   12432
failed      12414
Name: status, dtype: int64

Average transaction amount by status:
status
completed    255.028733
failed       254.709444
pending      254.690785
refunded     254.760390
Name: amount, dtype: float64

Total transaction amount by date (first 5 dates):
date
2022-01-20    38883.19
2022-08-30    38542.49
2022-03-10    38331.67
2022-11-29    38103.61
2022-06-24    37954.87
Name: amount, dtype: float64

Processed in 0.11 seconds

Processing Large CSV Files with Pandas

While pandas is powerful, it can consume a lot of memory when loading large CSV files. For very large files, you can use the chunksize parameter to read the file in chunks. Create a file named pandas_chunked.py:

import pandas as pd
import time

def process_large_csv_with_pandas(filename, chunk_size=10000):
    print(f"Processing {filename} with pandas using chunks of {chunk_size} rows...")
    start_time = time.time()

    ## Initialize variables to store aggregated results
    total_rows = 0
    total_amount = 0
    status_counts = {'completed': 0, 'pending': 0, 'failed': 0, 'refunded': 0}
    daily_totals = {}

    ## Process the file in chunks
    for chunk_num, chunk in enumerate(pd.read_csv(filename, chunksize=chunk_size)):
        ## Update row count
        chunk_rows = len(chunk)
        total_rows += chunk_rows

        ## Update total amount
        chunk_amount = chunk['amount'].sum()
        total_amount += chunk_amount

        ## Update status counts
        for status, count in chunk['status'].value_counts().items():
            status_counts[status] += count

        ## Update daily totals
        for date, group in chunk.groupby('date'):
            if date in daily_totals:
                daily_totals[date] += group['amount'].sum()
            else:
                daily_totals[date] = group['amount'].sum()

        print(f"Processed chunk {chunk_num + 1} ({total_rows:,} rows so far)")

    ## Calculate results
    end_time = time.time()
    processing_time = end_time - start_time

    print(f"\nResults after processing {total_rows:,} rows in {processing_time:.2f} seconds:")
    print(f"Total transaction amount: ${total_amount:,.2f}")
    print(f"Average transaction amount: ${total_amount/total_rows:.2f}")

    print("\nTransaction status breakdown:")
    for status, count in status_counts.items():
        percentage = (count / total_rows) * 100
        print(f"  {status}: {count:,} ({percentage:.1f}%)")

    ## Show top 5 days by transaction amount
    print("\nTop 5 days by transaction amount:")
    top_days = sorted(daily_totals.items(), key=lambda x: x[1], reverse=True)[:5]
    for date, amount in top_days:
        print(f"  {date}: ${amount:,.2f}")

## Process the transactions file with chunks
process_large_csv_with_pandas('transactions.csv', chunk_size=10000)

Run this script:

python3 pandas_chunked.py

Expected output:

Processing transactions.csv with pandas using chunks of 10000 rows...
Processed chunk 1 (10,000 rows so far)
Processed chunk 2 (20,000 rows so far)
Processed chunk 3 (30,000 rows so far)
Processed chunk 4 (40,000 rows so far)
Processed chunk 5 (50,000 rows so far)

Results after processing 50,000 rows in 0.34 seconds:
Total transaction amount: $12,739,853.35
Average transaction amount: $254.80

Transaction status breakdown:
  completed: 12,432 (24.9%)
  pending: 12,598 (25.2%)
  failed: 12,414 (24.8%)
  refunded: 12,556 (25.1%)

Top 5 days by transaction amount:
  2022-01-20: $38,883.19
  2022-08-30: $38,542.49
  2022-03-10: $38,331.67
  2022-11-29: $38,103.61
  2022-06-24: $37,954.87

Filtering and Transforming Data with Pandas

One of the great benefits of pandas is the ability to easily filter and transform data. Create a file named pandas_filter.py:

import pandas as pd
import time

def filter_and_transform(filename):
    print(f"Filtering and transforming data from {filename}...")
    start_time = time.time()

    ## Read the CSV file
    df = pd.read_csv(filename)

    ## 1. Filter: Get only completed transactions
    completed = df[df['status'] == 'completed']
    print(f"Number of completed transactions: {len(completed)}")

    ## 2. Filter: Get high-value transactions (over $400)
    high_value = df[df['amount'] > 400]
    print(f"Number of high-value transactions (>${400}): {len(high_value)}")

    ## 3. Filter: Transactions from first quarter of 2022
    df['date'] = pd.to_datetime(df['date'])  ## Convert to datetime
    q1_2022 = df[(df['date'] >= '2022-01-01') & (df['date'] <= '2022-03-31')]
    print(f"Number of transactions in Q1 2022: {len(q1_2022)}")

    ## 4. Add a new column: transaction_month
    df['month'] = df['date'].dt.strftime('%Y-%m')

    ## 5. Group by month and status
    monthly_by_status = df.groupby(['month', 'status']).agg({
        'transaction_id': 'count',
        'amount': 'sum'
    }).rename(columns={'transaction_id': 'count'})

    ## Calculate success rate by month (completed / total)
    print("\nTransaction success rates by month:")
    for month, month_data in df.groupby('month'):
        total = len(month_data)
        completed = len(month_data[month_data['status'] == 'completed'])
        success_rate = (completed / total) * 100
        print(f"  {month}: {success_rate:.1f}% ({completed}/{total})")

    ## Save filtered data to a new CSV file
    completed_high_value = df[(df['status'] == 'completed') & (df['amount'] > 300)]
    output_file = 'high_value_completed.csv'
    completed_high_value.to_csv(output_file, index=False)

    end_time = time.time()
    print(f"\nFiltering completed in {end_time - start_time:.2f} seconds")
    print(f"Saved {len(completed_high_value)} high-value completed transactions to {output_file}")

## Filter and transform the data
filter_and_transform('transactions.csv')

Run this script:

python3 pandas_filter.py

Expected output:

Filtering and transforming data from transactions.csv...
Number of completed transactions: 12432
Number of high-value transactions (>$400): 6190
Number of transactions in Q1 2022: 12348

Transaction success rates by month:
  2022-01: 24.8% (1048/4225)
  2022-02: 25.0% (1010/4034)
  2022-03: 25.4% (1042/4089)
  2022-04: 24.2% (978/4052)
  2022-05: 24.4% (1047/4297)
  2022-06: 24.4% (1046/4280)
  2022-07: 24.7% (1071/4341)
  2022-08: 25.1% (1090/4343)
  2022-09: 26.1% (1091/4177)
  2022-10: 24.1% (1008/4182)
  2022-11: 24.8% (1009/4075)
  2022-12: 25.2% (992/3905)

Filtering completed in 0.38 seconds
Saved 6304 high-value completed transactions to high_value_completed.csv

Advantages of Using Pandas

As you can see from these examples, pandas offers several advantages for CSV processing:

  1. Rich functionality: Built-in methods for filtering, grouping, aggregating, and transforming data
  2. Performance: Optimized C code under the hood for fast operations on large datasets
  3. Easy data analysis: Simple ways to calculate statistics and gain insights
  4. Visualization capabilities: Easy integration with plotting libraries (not shown in examples)
  5. Chunked processing: Ability to handle files larger than available memory

For most data analysis tasks involving CSV files, pandas is the recommended approach unless you have specific memory constraints that require using the pure Python csv module.

Summary

In this tutorial, you learned several approaches to efficiently process CSV files in Python, from small datasets to large ones that require careful memory management:

  1. Basic CSV Processing: Using Python's built-in csv module to read and write CSV files with csv.reader and csv.writer.

  2. Dictionary-Based Processing: Using csv.DictReader to work with CSV data in a more intuitive way, accessing fields by name instead of index.

  3. Efficient Processing Techniques:

    • Streaming: Processing files line by line for minimal memory usage
    • Chunking: Processing files in batches for better memory management
  4. Advanced Processing with Pandas:

    • Reading CSV files into DataFrames
    • Analyzing and filtering data
    • Processing large files in chunks
    • Transforming and exporting data

These techniques provide a comprehensive toolkit for handling CSV files of any size in Python. For most data analysis tasks, pandas is the recommended library due to its rich functionality and performance. However, for very large files or simple processing tasks, the streaming and chunking approaches with the built-in csv module can be more memory-efficient.

By applying the appropriate technique based on your specific requirements, you can efficiently process CSV files of any size without running into memory issues.