Pandas DataFrame Filtering: A Practical Guide

Introduction

Welcome to the Pandas Filtering Data lab! Data filtering is one of the most common and essential tasks in data analysis. It allows you to select a subset of your data that meets specific criteria, enabling you to focus on the information that is most relevant to your analysis.

Pandas is a powerful Python library for data manipulation and analysis. Its core data structure, the DataFrame, is a two-dimensional table of data with rows and columns. In this lab, you will learn several fundamental techniques to filter rows from a Pandas DataFrame. We will cover:

Filtering with a single boolean condition.
Combining multiple conditions using logical operators.
Using the isin() method for matching multiple values.
Removing rows with missing data (NaN).
Counting the results of your filtered data.

By the end of this lab, you will have a solid foundation for selecting and isolating data in your Pandas projects.

Filter rows with boolean condition

In this step, you will learn the most basic filtering technique: boolean indexing. This method involves creating a boolean condition that returns a Pandas Series of True and False values. When you pass this Series to a DataFrame, it returns only the rows where the value is True.

Let's start by filtering our DataFrame to find all employees who are older than 30.

First, open the main.py file in the editor on the left. The file has been pre-populated with a sample DataFrame.

Now, replace the existing print statements at the bottom of main.py with the following code. This code creates a boolean condition df['Age'] > 30 and uses it to filter the DataFrame.

## Filter for employees older than 30
older_than_30 = df[df['Age'] > 30]

print("Employees older than 30:")
print(older_than_30)

Your complete main.py file should now look like this:

import pandas as pd
import numpy as np

## Create a sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'Department': ['HR', 'HR', 'Sales', 'IT', 'IT', 'Finance'],
    'Age': [25, 45, 38, 52, 29, 33],
    'Salary': [50000, 80000, 75000, 95000, 62000, np.nan]
}

df = pd.DataFrame(data)

## Filter for employees older than 30
older_than_30 = df[df['Age'] > 30]

print("Employees older than 30:")
print(older_than_30)

Now, let's run the script. Open the terminal and execute the following command:

python3 main.py

You should see the following output, which lists only the employees whose age is greater than 30.

Employees older than 30:
      Name Department  Age   Salary
1      Bob         HR   45  80000.0
2  Charlie      Sales   38  75000.0
3    David         IT   52  95000.0
5    Frank    Finance   33      NaN

Combine filters using & operator

In this step, we will combine multiple conditions to perform more complex filtering. You can combine boolean conditions using logical operators. The most common ones are & for AND and | for OR.

A crucial syntax rule is that each condition must be enclosed in parentheses () due to Python's operator precedence.

Let's filter the DataFrame to find employees who are in the 'IT' department and are older than 30.

Modify your main.py file. Replace the filtering code from the previous step with the new code below.

## Filter for employees in IT and older than 30
it_and_older = df[(df['Department'] == 'IT') & (df['Age'] > 30)]

print("IT employees older than 30:")
print(it_and_older)

Notice how each condition, df['Department'] == 'IT' and df['Age'] > 30, is wrapped in its own set of parentheses.

Save the main.py file and run it from the terminal:

python3 main.py

The output will show only the employees who satisfy both conditions. In our dataset, only David matches this criteria.

IT employees older than 30:
    Name Department  Age   Salary
3  David         IT   52  95000.0

Use isin method for value matching

In this step, you'll learn how to filter for rows where a column's value is one of several possible values. While you could use multiple | (OR) conditions, a more efficient and readable way is to use the isin() method. This method takes a list of values and returns True for each row where the column's value is in that list.

Let's find all employees who work in either the 'HR' or 'Finance' departments.

Update the filtering logic in your main.py file with the following code:

## Filter for employees in HR or Finance departments
hr_or_finance = df[df['Department'].isin(['HR', 'Finance'])]

print("Employees in HR or Finance:")
print(hr_or_finance)

Here, isin(['HR', 'Finance']) checks which rows in the Department column have a value of either 'HR' or 'Finance'.

Save the file and execute the script in the terminal:

python3 main.py

The output will display all employees from the specified departments.

Employees in HR or Finance:
    Name Department  Age   Salary
0  Alice         HR   25  50000.0
1    Bob         HR   45  80000.0
5  Frank    Finance   33      NaN

Filter with notnull to remove NaN

In this step, we will address how to handle missing data. In Pandas, missing values are typically represented by NaN (Not a Number). It's often necessary to filter out rows that contain these missing values before performing calculations.

The notnull() method returns a boolean Series that is True for non-missing values and False for missing (NaN) values. You can use this to easily remove rows with NaN in a specific column.

Let's filter our DataFrame to see only the employees for whom we have salary information.

Modify the main.py file to use the notnull() method on the Salary column.

## Filter out rows with missing Salary
valid_salary = df[df['Salary'].notnull()]

print("Employees with valid salary information:")
print(valid_salary)

This code will select all rows where the Salary column does not contain NaN.

Save the file and run the script from the terminal:

python3 main.py

As you can see in the output, Frank, who had a NaN salary, has been excluded from the result.

Employees with valid salary information:
      Name Department  Age   Salary
0    Alice         HR   25  50000.0
1      Bob         HR   45  80000.0
2  Charlie      Sales   38  75000.0
3    David         IT   52  95000.0
4      Eve         IT   29  62000.0

Count filtered rows using len function

In this final step, you will learn how to count the number of rows in a filtered DataFrame. After applying a filter, you often need to know how many rows matched your criteria. A simple way to do this is by using Python's built-in len() function, which returns the number of rows in a DataFrame.

Let's filter for all employees in the 'IT' department and then count them.

Update your main.py file with the following code. We will first filter the DataFrame and then pass the resulting filtered DataFrame to the len() function.

## Filter for employees in the IT department
it_employees = df[df['Department'] == 'IT']

## Count the number of IT employees
num_it_employees = len(it_employees)

print(f"Number of employees in IT: {num_it_employees}")

This code first creates a new DataFrame it_employees containing only the rows for the IT department. Then, it calculates the length of this new DataFrame and prints a formatted string with the result.

Save the file and run the script:

python3 main.py

The output will be a single line telling you the count.

Number of employees in IT: 2

Summary

Congratulations on completing the Pandas Filtering Data lab!

In this lab, you have learned and practiced the essential techniques for selecting subsets of data from a Pandas DataFrame. You have covered:

Boolean Indexing: Filtering data based on a single condition (e.g., df[df['Age'] > 30]).
Combining Filters: Using the & (AND) operator to apply multiple conditions simultaneously, remembering to wrap each condition in parentheses.
Value Matching with isin(): Efficiently filtering for rows where a column's value matches any value in a given list.
Handling Missing Data: Using the notnull() method to remove rows with NaN values.
Counting Filtered Rows: Using the len() function to get the number of rows in a filtered DataFrame.

These filtering skills are fundamental to nearly all data analysis tasks. Mastering them will allow you to effectively explore and prepare your datasets for further analysis and visualization. Keep practicing these techniques to become more proficient with Pandas.