Introduction
Welcome to the Pandas Filtering Data lab! Data filtering is one of the most common and essential tasks in data analysis. It allows you to select a subset of your data that meets specific criteria, enabling you to focus on the information that is most relevant to your analysis.
Pandas is a powerful Python library for data manipulation and analysis. Its core data structure, the DataFrame, is a two-dimensional table of data with rows and columns. In this lab, you will learn several fundamental techniques to filter rows from a Pandas DataFrame. We will cover:
- Filtering with a single boolean condition.
- Combining multiple conditions using logical operators.
- Using the
isin()method for matching multiple values. - Removing rows with missing data (
NaN). - Counting the results of your filtered data.
By the end of this lab, you will have a solid foundation for selecting and isolating data in your Pandas projects.
Filter rows with boolean condition
In this step, you will learn the most basic filtering technique: boolean indexing. This method involves creating a boolean condition that returns a Pandas Series of True and False values. When you pass this Series to a DataFrame, it returns only the rows where the value is True.
Let's start by filtering our DataFrame to find all employees who are older than 30.
First, open the main.py file in the editor on the left. The file has been pre-populated with a sample DataFrame.
Now, replace the existing print statements at the bottom of main.py with the following code. This code creates a boolean condition df['Age'] > 30 and uses it to filter the DataFrame.
## Filter for employees older than 30
older_than_30 = df[df['Age'] > 30]
print("Employees older than 30:")
print(older_than_30)
Your complete main.py file should now look like this:
import pandas as pd
import numpy as np
## Create a sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
'Department': ['HR', 'HR', 'Sales', 'IT', 'IT', 'Finance'],
'Age': [25, 45, 38, 52, 29, 33],
'Salary': [50000, 80000, 75000, 95000, 62000, np.nan]
}
df = pd.DataFrame(data)
## Filter for employees older than 30
older_than_30 = df[df['Age'] > 30]
print("Employees older than 30:")
print(older_than_30)
Now, let's run the script. Open the terminal and execute the following command:
python3 main.py
You should see the following output, which lists only the employees whose age is greater than 30.
Employees older than 30:
Name Department Age Salary
1 Bob HR 45 80000.0
2 Charlie Sales 38 75000.0
3 David IT 52 95000.0
5 Frank Finance 33 NaN
Combine filters using & operator
In this step, we will combine multiple conditions to perform more complex filtering. You can combine boolean conditions using logical operators. The most common ones are & for AND and | for OR.
A crucial syntax rule is that each condition must be enclosed in parentheses () due to Python's operator precedence.
Let's filter the DataFrame to find employees who are in the 'IT' department and are older than 30.
Modify your main.py file. Replace the filtering code from the previous step with the new code below.
## Filter for employees in IT and older than 30
it_and_older = df[(df['Department'] == 'IT') & (df['Age'] > 30)]
print("IT employees older than 30:")
print(it_and_older)
Notice how each condition, df['Department'] == 'IT' and df['Age'] > 30, is wrapped in its own set of parentheses.
Save the main.py file and run it from the terminal:
python3 main.py
The output will show only the employees who satisfy both conditions. In our dataset, only David matches this criteria.
IT employees older than 30:
Name Department Age Salary
3 David IT 52 95000.0
Use isin method for value matching
In this step, you'll learn how to filter for rows where a column's value is one of several possible values. While you could use multiple | (OR) conditions, a more efficient and readable way is to use the isin() method. This method takes a list of values and returns True for each row where the column's value is in that list.
Let's find all employees who work in either the 'HR' or 'Finance' departments.
Update the filtering logic in your main.py file with the following code:
## Filter for employees in HR or Finance departments
hr_or_finance = df[df['Department'].isin(['HR', 'Finance'])]
print("Employees in HR or Finance:")
print(hr_or_finance)
Here, isin(['HR', 'Finance']) checks which rows in the Department column have a value of either 'HR' or 'Finance'.
Save the file and execute the script in the terminal:
python3 main.py
The output will display all employees from the specified departments.
Employees in HR or Finance:
Name Department Age Salary
0 Alice HR 25 50000.0
1 Bob HR 45 80000.0
5 Frank Finance 33 NaN
Filter with notnull to remove NaN
In this step, we will address how to handle missing data. In Pandas, missing values are typically represented by NaN (Not a Number). It's often necessary to filter out rows that contain these missing values before performing calculations.
The notnull() method returns a boolean Series that is True for non-missing values and False for missing (NaN) values. You can use this to easily remove rows with NaN in a specific column.
Let's filter our DataFrame to see only the employees for whom we have salary information.
Modify the main.py file to use the notnull() method on the Salary column.
## Filter out rows with missing Salary
valid_salary = df[df['Salary'].notnull()]
print("Employees with valid salary information:")
print(valid_salary)
This code will select all rows where the Salary column does not contain NaN.
Save the file and run the script from the terminal:
python3 main.py
As you can see in the output, Frank, who had a NaN salary, has been excluded from the result.
Employees with valid salary information:
Name Department Age Salary
0 Alice HR 25 50000.0
1 Bob HR 45 80000.0
2 Charlie Sales 38 75000.0
3 David IT 52 95000.0
4 Eve IT 29 62000.0
Count filtered rows using len function
In this final step, you will learn how to count the number of rows in a filtered DataFrame. After applying a filter, you often need to know how many rows matched your criteria. A simple way to do this is by using Python's built-in len() function, which returns the number of rows in a DataFrame.
Let's filter for all employees in the 'IT' department and then count them.
Update your main.py file with the following code. We will first filter the DataFrame and then pass the resulting filtered DataFrame to the len() function.
## Filter for employees in the IT department
it_employees = df[df['Department'] == 'IT']
## Count the number of IT employees
num_it_employees = len(it_employees)
print(f"Number of employees in IT: {num_it_employees}")
This code first creates a new DataFrame it_employees containing only the rows for the IT department. Then, it calculates the length of this new DataFrame and prints a formatted string with the result.
Save the file and run the script:
python3 main.py
The output will be a single line telling you the count.
Number of employees in IT: 2
Summary
Congratulations on completing the Pandas Filtering Data lab!
In this lab, you have learned and practiced the essential techniques for selecting subsets of data from a Pandas DataFrame. You have covered:
- Boolean Indexing: Filtering data based on a single condition (e.g.,
df[df['Age'] > 30]). - Combining Filters: Using the
&(AND) operator to apply multiple conditions simultaneously, remembering to wrap each condition in parentheses. - Value Matching with
isin(): Efficiently filtering for rows where a column's value matches any value in a given list. - Handling Missing Data: Using the
notnull()method to remove rows withNaNvalues. - Counting Filtered Rows: Using the
len()function to get the number of rows in a filtered DataFrame.
These filtering skills are fundamental to nearly all data analysis tasks. Mastering them will allow you to effectively explore and prepare your datasets for further analysis and visualization. Keep practicing these techniques to become more proficient with Pandas.



