Using a callable (like a lambda function) in the column position of .loc is an advanced technique. The function receives the entire DataFrame as its input and must return a valid selection (like a list of column names or a boolean array).
This is particularly useful when you don't know the exact column names or index positions beforehand, or when you are chaining commands together.
Here are the most common ways to use callables for columns:
1. Filtering by Column Name Properties
You can filter columns based on their names (strings). For example, finding columns that contain a specific word or exceed a certain length.
# Select columns that have names longer than 4 characters
# 'x' represents the DataFrame inside the lambda
short_names = df.loc[:, lambda x: [col for col in x.columns if len(col) > 4]]
2. Filtering by Data Type
You can use a callable to select only numeric columns or only object (string) columns.
# Select only columns that are numeric (int or float)
numeric_cols = df.loc[:, lambda x: x.dtypes != 'object']
3. Filtering by Data Values
A callable can look at the data inside the columns to decide whether to keep them. For example, selecting columns where the average value is above a certain threshold.
# Select columns where the mean value is greater than 50
# Note: This only works on numeric columns to avoid errors
high_value_cols = df.loc[:, lambda x: x.select_dtypes('number').mean() > 50]
Why use a callable instead of just a list?
The main reason is Method Chaining. If you are performing multiple operations in one go, the DataFrame changes at each step. A callable allows you to reference the "current" state of the data:
# Example of chaining (Conceptual)
result = (df.rename(columns={'major': 'specialization'})
.loc[:, lambda x: x.columns.str.startswith('s')])
# The lambda 'x' knows about the renamed 'specialization' column!
Try this in your environment:
If you want to see it in action, add this to your main.py:
# Select columns where the name contains the letter 'a'
lambda_filter = df.loc[:, lambda x: [c for c in x.columns if 'a' in c]]
print(lambda_filter)
It's a very dynamic way to handle data that might change structure! Do you have a specific filtering goal in mind for your project?