Introduction
Grouping and aggregating data are fundamental operations in data analysis. They allow you to split a large dataset into smaller groups based on certain criteria and then compute summary statistics for each group. This process is essential for uncovering patterns, comparing segments, and deriving meaningful insights from raw data.
In the Python data analysis library, Pandas, this "split-apply-combine" strategy is primarily handled by the powerful groupby() method. In this lab, you will learn how to use groupby() to perform grouping and aggregation. You will start by grouping data by a single column, apply aggregation functions, use multiple functions at once, group by multiple columns, and finally, format the output into a standard DataFrame.
Group by single column using groupby
In this step, you will learn the basic usage of the groupby() method to split a DataFrame into groups. The groupby() method itself doesn't compute anything but returns a DataFrameGroupBy object. This object holds all the information needed to then apply calculations to each group.
First, let's group our sample DataFrame by the Category column. This will create distinct groups for 'Electronics', 'Clothing', and 'Books'.
Open the main.py file in the editor on the left. The initial code to create a DataFrame is already there. Add the following code to the end of the main.py file:
## Group by the 'Category' column
grouped_by_category = df.groupby('Category')
## The result is a DataFrameGroupBy object
print("Type of the grouped object:")
print(type(grouped_by_category))
## To see the contents, you can iterate over the groups
print("\nIterating over groups to see their content:")
for name, group in grouped_by_category:
print(f"\nGroup: {name}")
print(group)
Now, run the script from the terminal to see the output.
python3 main.py
You will see the original DataFrame, followed by the type of the groupby object, and then the content of each group. This demonstrates that the DataFrame has been successfully split based on the unique values in the 'Category' column.
Original DataFrame:
Category Region Sales Units
0 Electronics North 1200 10
1 Clothing South 800 25
2 Electronics North 1500 8
3 Books West 300 15
4 Clothing East 900 20
5 Books West 450 18
==============================
Type of the grouped object:
<class 'pandas.core.groupby.generic.DataFrameGroupBy'>
Iterating over groups to see their content:
Group: Books
Category Region Sales Units
3 Books West 300 15
5 Books West 450 18
Group: Clothing
Category Region Sales Units
1 Clothing South 800 25
4 Clothing East 900 20
Group: Electronics
Category Region Sales Units
0 Electronics North 1200 10
2 Electronics North 1500 8
Apply sum aggregation on groups
In this step, you will learn how to apply an aggregation function to the groups you've created. After grouping the data, the most common next step is to perform a calculation, such as sum(), mean(), count(), or max(), on each group.
Let's calculate the total sales for each product category. To do this, you first group by Category and then select the Sales column to apply the sum() function.
Add the following code to the end of your main.py file. You can remove the for loop from the previous step to keep the output clean.
## Group by 'Category' and calculate the sum of 'Sales' for each group
category_sales_sum = df.groupby('Category')['Sales'].sum()
print("Total sales per category:")
print(category_sales_sum)
Save the file and run it again.
python3 main.py
The output will now show a Pandas Series where the index is the category name and the values are the total sales for that category.
... (previous output) ...
Total sales per category:
Category
Books 750
Clothing 1700
Electronics 2700
Name: Sales, dtype: int64
This is a powerful and concise way to summarize your data. You can apply the same logic to other numeric columns or use other aggregation functions like mean() to find the average.
Aggregate multiple functions with agg
In this step, you'll learn how to apply multiple aggregation functions to your groups simultaneously using the agg() method. This is very useful when you want to compute several summary statistics at once, such as both the total and the average sales.
The agg() method can be passed a list of strings, where each string is the name of an aggregation function. Let's calculate both the sum and mean of Sales for each Category.
Add the following code to the end of your main.py file:
## Group by 'Category' and apply multiple aggregations on 'Sales'
category_agg = df.groupby('Category')['Sales'].agg(['sum', 'mean'])
print("\nSum and mean of sales per category:")
print(category_agg)
Save the file and execute it.
python3 main.py
The output is now a DataFrame. The index is still the Category, but the columns are hierarchical, showing both sum and mean for the Sales data.
... (previous output) ...
Sum and mean of sales per category:
sum mean
Category
Books 750 375.0
Clothing 1700 850.0
Electronics 2700 1350.0
The agg() method provides a flexible way to generate comprehensive summaries of your grouped data.
Group by multiple columns
In this step, you will learn how to group a DataFrame by more than one column. This is useful for creating more granular groups and analyzing interactions between different categories. To do this, you simply pass a list of column names to the groupby() method.
Let's find the total sales for each combination of Region and Category. This will show us how sales of different product categories are distributed across regions.
Add the following code to the end of your main.py file:
## Group by multiple columns: 'Region' and 'Category'
multi_group_sum = df.groupby(['Region', 'Category'])['Sales'].sum()
print("\nTotal sales per Region and Category:")
print(multi_group_sum)
Save the file and run the script.
python3 main.py
The output will have a MultiIndex on the rows, with the first level being Region and the second level being Category. This provides a detailed breakdown of sales.
... (previous output) ...
Total sales per Region and Category:
Region Category
East Clothing 900
North Electronics 2700
South Clothing 800
West Books 750
Name: Sales, dtype: int64
As you can see, grouping by multiple columns allows for a deeper and more hierarchical analysis of your dataset.
Reset index on grouped DataFrame
In this step, you will learn how to convert the grouped output back into a regular DataFrame, where the grouping keys are columns instead of the index. By default, groupby() makes the grouping keys the index of the resulting Series or DataFrame. Sometimes, you want a "flat" DataFrame for further processing or visualization.
The easiest way to achieve this is by using the as_index=False parameter within the groupby() method.
Let's repeat the single-column grouping from Step 2, but this time we'll keep Category as a regular column.
Add the following code to the end of your main.py file:
## Group by 'Category' and aggregate, but keep 'Category' as a column
category_sales_flat = df.groupby('Category', as_index=False)['Sales'].sum()
print("\nGrouped data with 'Category' as a column:")
print(category_sales_flat)
Save the file and run it one last time.
python3 main.py
Observe the output. Instead of Category being the index, it is now the first column of a new DataFrame, and the DataFrame has a standard integer index (0, 1, 2).
... (previous output) ...
Grouped data with 'Category' as a column:
Category Sales
0 Books 750
1 Clothing 1700
2 Electronics 2700
This format is often more convenient for subsequent data manipulation tasks. An alternative method is to call .reset_index() on the grouped result, which achieves the same outcome.
Summary
Congratulations on completing this lab on Pandas grouping and aggregation! You have learned one of the most powerful and commonly used features of the Pandas library for data analysis.
In this lab, you practiced:
- Splitting a DataFrame into groups using
df.groupby('column_name'). - Applying a single aggregation function like
.sum()to the groups. - Using the
.agg()method to apply multiple aggregation functions at once. - Grouping by multiple columns to create a hierarchical summary using
df.groupby(['col1', 'col2']). - Creating a flat DataFrame from a
groupbyoperation by using theas_index=Falseparameter.
Mastering these techniques is a crucial step toward becoming proficient in data manipulation and analysis with Python and Pandas.



