如何根据给定函数对Python列表进行高效分组

PythonPythonBeginner
立即练习

💡 本教程由 AI 辅助翻译自英文原版。如需查看原文,您可以 切换至英文原版

Introduction

Organizing and manipulating data collections is a fundamental task in Python programming. One common operation is grouping list elements based on certain criteria. This process transforms your data into organized categories, making it easier to analyze and work with.

In this tutorial, you will learn how to efficiently group elements in a Python list using various techniques. We will start with basic approaches and gradually introduce more powerful built-in functions for this purpose. By the end of this lab, you will have a practical understanding of different ways to group list data in Python.


Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL python(("Python")) -.-> python/FunctionsGroup(["Functions"]) python(("Python")) -.-> python/ModulesandPackagesGroup(["Modules and Packages"]) python(("Python")) -.-> python/AdvancedTopicsGroup(["Advanced Topics"]) python(("Python")) -.-> python/PythonStandardLibraryGroup(["Python Standard Library"]) python(("Python")) -.-> python/DataScienceandMachineLearningGroup(["Data Science and Machine Learning"]) python(("Python")) -.-> python/ControlFlowGroup(["Control Flow"]) python(("Python")) -.-> python/DataStructuresGroup(["Data Structures"]) python/ControlFlowGroup -.-> python/for_loops("For Loops") python/DataStructuresGroup -.-> python/dictionaries("Dictionaries") python/FunctionsGroup -.-> python/build_in_functions("Build-in Functions") python/ModulesandPackagesGroup -.-> python/standard_libraries("Common Standard Libraries") python/AdvancedTopicsGroup -.-> python/iterators("Iterators") python/PythonStandardLibraryGroup -.-> python/data_collections("Data Collections") python/DataScienceandMachineLearningGroup -.-> python/data_analysis("Data Analysis") subgraph Lab Skills python/for_loops -.-> lab-417802{{"如何根据给定函数对Python列表进行高效分组"}} python/dictionaries -.-> lab-417802{{"如何根据给定函数对Python列表进行高效分组"}} python/build_in_functions -.-> lab-417802{{"如何根据给定函数对Python列表进行高效分组"}} python/standard_libraries -.-> lab-417802{{"如何根据给定函数对Python列表进行高效分组"}} python/iterators -.-> lab-417802{{"如何根据给定函数对Python列表进行高效分组"}} python/data_collections -.-> lab-417802{{"如何根据给定函数对Python列表进行高效分组"}} python/data_analysis -.-> lab-417802{{"如何根据给定函数对Python列表进行高效分组"}} end

Basic List Grouping with Dictionaries

Let's begin by understanding what list grouping means and how to implement a basic grouping technique using Python dictionaries.

What is List Grouping?

List grouping is the process of organizing list elements into categories based on a specific characteristic or function. For example, you might want to group a list of numbers by whether they are even or odd, or group a list of words by their first letter.

Using Dictionaries for Basic Grouping

The most straightforward way to group list elements in Python is to use a dictionary:

  • The keys represent the groups
  • The values are lists containing the elements belonging to each group

Let's create a simple example where we group numbers based on whether they are even or odd.

Step 1: Create a Python File

First, let's create a new Python file to write our code:

  1. Open the WebIDE and create a new file named group_numbers.py in the /home/labex/project directory.

  2. Add the following code to the file:

## Basic list grouping using dictionaries
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

## Initialize empty dictionary to store our groups
even_odd_groups = {"even": [], "odd": []}

## Group numbers based on whether they are even or odd
for num in numbers:
    if num % 2 == 0:
        even_odd_groups["even"].append(num)
    else:
        even_odd_groups["odd"].append(num)

## Print the resulting groups
print("Grouping numbers by even/odd:")
print(f"Even numbers: {even_odd_groups['even']}")
print(f"Odd numbers: {even_odd_groups['odd']}")
  1. Save the file.

Step 2: Run the Python Script

Run the script to see the results:

  1. Open a terminal in the WebIDE.

  2. Execute the script:

python3 /home/labex/project/group_numbers.py

You should see output similar to:

Grouping numbers by even/odd:
Even numbers: [2, 4, 6, 8, 10]
Odd numbers: [1, 3, 5, 7, 9]

Step 3: Group by a More Complex Criterion

Now, let's modify our script to group numbers based on their remainder when divided by 3:

  1. Add the following code to your group_numbers.py file:
## Group numbers by remainder when divided by 3
remainder_groups = {0: [], 1: [], 2: []}

for num in numbers:
    remainder = num % 3
    remainder_groups[remainder].append(num)

print("\nGrouping numbers by remainder when divided by 3:")
for remainder, nums in remainder_groups.items():
    print(f"Numbers with remainder {remainder}: {nums}")
  1. Save the file.

  2. Run the script again:

python3 /home/labex/project/group_numbers.py

Now you should see additional output:

Grouping numbers by remainder when divided by 3:
Numbers with remainder 0: [3, 6, 9]
Numbers with remainder 1: [1, 4, 7, 10]
Numbers with remainder 2: [2, 5, 8]

This basic technique using dictionaries provides a straightforward way to group list elements. However, as your grouping needs become more complex, Python offers more powerful and efficient methods, which we'll explore in the next steps.

Using itertools.groupby() for Efficient Grouping

Now that you understand the basic concept of grouping, let's explore a more powerful approach using the built-in itertools.groupby() function. This function is particularly useful when working with sorted data.

Understanding itertools.groupby()

The groupby() function from the itertools module groups consecutive elements in an iterable based on a key function. It returns an iterator that produces pairs of:

  • The value returned by the key function
  • An iterator producing the items in the group

Important note: groupby() only groups consecutive items, so the input data typically needs to be sorted first.

Let's implement an example to see how this works in practice.

Step 1: Create a New Python File

  1. Create a new file named groupby_example.py in the /home/labex/project directory.

  2. Add the following code to import the necessary module:

import itertools

## Sample data
words = ["apple", "banana", "avocado", "blueberry", "apricot", "blackberry"]

Step 2: Group Words by First Letter

Now, let's use itertools.groupby() to group the words by their first letter:

  1. Add the following code to your groupby_example.py file:
## First, we need to sort the list by the key we'll use for grouping
## In this case, the first letter of each word
words.sort(key=lambda x: x[0])
print("Sorted words:", words)

## Now group by first letter
grouped_words = {}
for first_letter, group in itertools.groupby(words, key=lambda x: x[0]):
    grouped_words[first_letter] = list(group)

## Print the resulting groups
print("\nGrouping words by first letter:")
for letter, words_group in grouped_words.items():
    print(f"Words starting with '{letter}': {words_group}")
  1. Save the file.

  2. Run the script:

python3 /home/labex/project/groupby_example.py

You should see output similar to:

Sorted words: ['apple', 'apricot', 'avocado', 'banana', 'blackberry', 'blueberry']

Grouping words by first letter:
Words starting with 'a': ['apple', 'apricot', 'avocado']
Words starting with 'b': ['banana', 'blackberry', 'blueberry']

Step 3: Understanding the Importance of Sorting

To demonstrate why sorting is crucial when using groupby(), let's add another example without sorting:

  1. Add the following code to your groupby_example.py file:
## Sample data (unsorted)
unsorted_words = ["apple", "banana", "avocado", "blueberry", "apricot", "blackberry"]

print("\n--- Without sorting first ---")
print("Original words:", unsorted_words)

## Try to group without sorting
unsorted_grouped = {}
for first_letter, group in itertools.groupby(unsorted_words, key=lambda x: x[0]):
    unsorted_grouped[first_letter] = list(group)

print("\nGrouping without sorting:")
for letter, words_group in unsorted_grouped.items():
    print(f"Words starting with '{letter}': {words_group}")
  1. Save the file.

  2. Run the script again:

python3 /home/labex/project/groupby_example.py

In the output, you'll notice that the grouping without sorting produces different results:

--- Without sorting first ---
Original words: ['apple', 'banana', 'avocado', 'blueberry', 'apricot', 'blackberry']

Grouping without sorting:
Words starting with 'a': ['apple']
Words starting with 'b': ['banana']
Words starting with 'a': ['avocado']
Words starting with 'b': ['blueberry']
Words starting with 'a': ['apricot']
Words starting with 'b': ['blackberry']

Notice how we have multiple groups with the same key. This happens because groupby() only groups consecutive items. When the data isn't sorted, items with the same key but appearing in different positions in the list will be placed in separate groups.

The itertools.groupby() function is very efficient and is part of the standard library, making it a powerful tool for many grouping tasks. However, remember that it works best with sorted data.

Grouping with collections.defaultdict

Another powerful tool for grouping in Python is the defaultdict class from the collections module. This approach offers a cleaner, more efficient way to group data compared to using regular dictionaries.

Understanding defaultdict

A defaultdict is a dictionary subclass that automatically initializes the first value for a missing key. This eliminates the need to check if a key exists before adding an item to a dictionary. For grouping purposes, this means we can avoid writing conditional code to initialize empty lists for new groups.

Let's see how defaultdict simplifies the grouping process.

Step 1: Create a New Python File

  1. Create a new file named defaultdict_grouping.py in the /home/labex/project directory.

  2. Add the following code to import the necessary module and create some sample data:

from collections import defaultdict

## Sample data - a list of people with their ages
people = [
    {"name": "Alice", "age": 25, "city": "New York"},
    {"name": "Bob", "age": 30, "city": "Boston"},
    {"name": "Charlie", "age": 35, "city": "Chicago"},
    {"name": "David", "age": 25, "city": "Denver"},
    {"name": "Eve", "age": 30, "city": "Boston"},
    {"name": "Frank", "age": 35, "city": "Chicago"},
    {"name": "Grace", "age": 25, "city": "New York"}
]

Step 2: Group People by Age

Now, let's use defaultdict to group people by their age:

  1. Add the following code to your defaultdict_grouping.py file:
## Group people by age using defaultdict
age_groups = defaultdict(list)

for person in people:
    age_groups[person["age"]].append(person["name"])

## Print the resulting groups
print("Grouping people by age:")
for age, names in age_groups.items():
    print(f"Age {age}: {names}")
  1. Save the file.

  2. Run the script:

python3 /home/labex/project/defaultdict_grouping.py

You should see output similar to:

Grouping people by age:
Age 25: ['Alice', 'David', 'Grace']
Age 30: ['Bob', 'Eve']
Age 35: ['Charlie', 'Frank']

Step 3: Compare with Regular Dictionary Approach

To understand the advantage of using defaultdict, let's compare it with the regular dictionary approach:

  1. Add the following code to your defaultdict_grouping.py file:
print("\n--- Comparison with regular dictionary ---")

## Using a regular dictionary (the conventional way)
regular_dict_groups = {}

for person in people:
    age = person["age"]
    name = person["name"]

    ## Need to check if the key exists
    if age not in regular_dict_groups:
        regular_dict_groups[age] = []

    regular_dict_groups[age].append(name)

print("\nRegular dictionary approach:")
for age, names in regular_dict_groups.items():
    print(f"Age {age}: {names}")
  1. Save the file.

  2. Run the script again:

python3 /home/labex/project/defaultdict_grouping.py

You'll notice that both approaches produce the same result, but the defaultdict approach is cleaner and requires less code.

Step 4: Group by Multiple Criteria

Now, let's extend our example to group people by both city and age:

  1. Add the following code to your defaultdict_grouping.py file:
## Grouping by city and then by age
city_age_groups = defaultdict(lambda: defaultdict(list))

for person in people:
    city = person["city"]
    age = person["age"]
    name = person["name"]

    city_age_groups[city][age].append(name)

print("\nGrouping people by city and then by age:")
for city, age_groups in city_age_groups.items():
    print(f"\nCity: {city}")
    for age, names in age_groups.items():
        print(f"  Age {age}: {names}")
  1. Save the file.

  2. Run the script again:

python3 /home/labex/project/defaultdict_grouping.py

You should see additional output similar to:

Grouping people by city and then by age:

City: New York
  Age 25: ['Alice', 'Grace']

City: Boston
  Age 30: ['Bob', 'Eve']

City: Chicago
  Age 35: ['Charlie', 'Frank']

City: Denver
  Age 25: ['David']

This nested defaultdict approach allows for more complex grouping hierarchies with minimal code. The defaultdict is particularly useful when you don't know all the group keys in advance, as it creates new groups automatically when needed.

Practical Application: Analyzing Data with Grouping Techniques

Now that you understand several methods for grouping data, let's apply these techniques to solve a real-world problem: analyzing a dataset of student records. We'll use different grouping methods to extract useful information from the data.

Setting Up the Example Dataset

First, let's create our student records dataset:

  1. Create a new file named student_analysis.py in the /home/labex/project directory.

  2. Add the following code to set up the example data:

import itertools
from collections import defaultdict

## Sample student data
students = [
    {"id": 1, "name": "Emma", "grade": "A", "subject": "Math", "score": 95},
    {"id": 2, "name": "Noah", "grade": "B", "subject": "Math", "score": 82},
    {"id": 3, "name": "Olivia", "grade": "A", "subject": "Science", "score": 90},
    {"id": 4, "name": "Liam", "grade": "C", "subject": "Math", "score": 75},
    {"id": 5, "name": "Ava", "grade": "B", "subject": "Science", "score": 88},
    {"id": 6, "name": "William", "grade": "A", "subject": "History", "score": 96},
    {"id": 7, "name": "Sophia", "grade": "B", "subject": "History", "score": 85},
    {"id": 8, "name": "James", "grade": "C", "subject": "Science", "score": 72},
    {"id": 9, "name": "Isabella", "grade": "A", "subject": "Math", "score": 91},
    {"id": 10, "name": "Benjamin", "grade": "B", "subject": "History", "score": 84}
]

print("Student Records:")
for student in students:
    print(f"ID: {student['id']}, Name: {student['name']}, Subject: {student['subject']}, Grade: {student['grade']}, Score: {student['score']}")
  1. Save the file.

Using defaultdict to Group Students by Subject

Let's analyze which students are taking each subject:

  1. Add the following code to your student_analysis.py file:
print("\n--- Students Grouped by Subject ---")

## Group students by subject using defaultdict
subject_groups = defaultdict(list)

for student in students:
    subject_groups[student["subject"]].append(student["name"])

## Print students by subject
for subject, names in subject_groups.items():
    print(f"{subject}: {names}")
  1. Save the file.

Calculating Average Scores by Subject

Let's calculate the average score for each subject:

  1. Add the following code to your student_analysis.py file:
print("\n--- Average Scores by Subject ---")

## Calculate average scores for each subject
subject_scores = defaultdict(list)

for student in students:
    subject_scores[student["subject"]].append(student["score"])

## Calculate and print averages
for subject, scores in subject_scores.items():
    average = sum(scores) / len(scores)
    print(f"{subject} Average: {average:.2f}")
  1. Save the file.

Using itertools.groupby() to Analyze Grades

Now let's use itertools.groupby() to analyze the distribution of grades:

  1. Add the following code to your student_analysis.py file:
print("\n--- Grade Distribution (using itertools.groupby) ---")

## Sort students by grade first
sorted_students = sorted(students, key=lambda x: x["grade"])

## Group and count students by grade
grade_counts = {}
for grade, group in itertools.groupby(sorted_students, key=lambda x: x["grade"]):
    grade_counts[grade] = len(list(group))

## Print grade distribution
for grade, count in grade_counts.items():
    print(f"Grade {grade}: {count} students")
  1. Save the file.

Combining Techniques: Advanced Analysis

Finally, let's perform a more complex analysis by combining our grouping techniques:

  1. Add the following code to your student_analysis.py file:
print("\n--- Advanced Analysis: Grade Distribution by Subject ---")

## Group by subject and grade
subject_grade_counts = defaultdict(lambda: defaultdict(int))

for student in students:
    subject = student["subject"]
    grade = student["grade"]
    subject_grade_counts[subject][grade] += 1

## Print detailed grade distribution by subject
for subject, grades in subject_grade_counts.items():
    print(f"\n{subject}:")
    for grade, count in grades.items():
        print(f"  Grade {grade}: {count} students")
  1. Save the file.

  2. Run the complete script:

python3 /home/labex/project/student_analysis.py

You should see a comprehensive analysis of the student data, including:

  • Student records
  • Students grouped by subject
  • Average scores by subject
  • Overall grade distribution
  • Grade distribution by subject

This example demonstrates how different grouping techniques can be combined to perform complex data analysis with relatively simple code. Each approach has its strengths:

  • defaultdict is excellent for simple grouping without having to check for key existence
  • itertools.groupby() is efficient for working with sorted data
  • Combining techniques allows for multi-level grouping and complex analysis

Selecting the right grouping technique depends on your specific needs and the structure of your data.

Summary

In this tutorial, you learned several efficient methods for grouping lists in Python:

  1. Basic Dictionary Grouping: You started with a fundamental approach using regular dictionaries to create groups based on specific criteria.

  2. itertools.groupby(): You explored this built-in function which efficiently groups consecutive elements in sorted data, understanding its advantages and limitations.

  3. collections.defaultdict: You used this convenient dictionary subclass that automatically handles missing keys, making your grouping code cleaner and more concise.

  4. Practical Data Analysis: You applied these techniques to analyze a dataset, seeing how they can be used individually and in combination to extract meaningful insights.

Each of these methods has its strengths and ideal use cases:

  • Use basic dictionaries for simple grouping when clarity is more important than conciseness
  • Use itertools.groupby() when your data is sorted or can be sorted by the grouping key
  • Use defaultdict when you want clean, concise code and don't know all group keys in advance
  • Combine techniques for complex, multi-level grouping and analysis

By mastering these grouping techniques, you've added powerful tools to your Python programming toolkit that will help you organize, analyze, and manipulate data more efficiently.