Pandas Data Selection: Select Columns and Rows

Introduction

Welcome to the Pandas Selecting Data lab! Pandas is a powerful library for data manipulation and analysis in Python. One of the most fundamental tasks in data analysis is selecting specific subsets of your data. Whether you need to examine a single column, a few specific rows, or a complex slice of your dataset, Pandas provides a variety of flexible and efficient methods to get the job done.

In this lab, you will work with a sample dataset of student information. You will learn how to:

Select single and multiple columns using bracket notation.
Select rows by their labels using the .loc accessor.
Select rows by their integer position using the .iloc accessor.
Combine row and column selection to extract precise slices of data.

By the end of this lab, you will have a solid understanding of the core data selection techniques in Pandas, which are essential for any data-related task.

Select single column using bracket notation

In this step, you will learn the most common way to select a single column from a Pandas DataFrame. This is done using bracket notation [], similar to how you would access a value in a Python dictionary.

First, we need to load our data from the students.csv file into a DataFrame. Then, we can select a column by passing its name as a string inside the brackets.

Please open the main.py file from the file explorer on the left and add the following code.

import pandas as pd

## Load the CSV file into a DataFrame
df = pd.read_csv('students.csv')

## Select the 'name' column
name_column = df['name']

## Print the selected column
print(name_column)

Now, let's run the script. Open a terminal in the WebIDE and execute the following command:

python3 main.py

You will see the output, which is a Pandas Series object containing all the names from the 'name' column.

0      Alice
1        Bob
2    Charlie
3      David
4        Eve
Name: name, dtype: object

As you can see, selecting a single column returns a Series, which is essentially a one-dimensional labeled array.

Select multiple columns by list

In this step, we will expand on the previous technique to select multiple columns at once. To do this, instead of passing a single string, you pass a list of column names inside the selection brackets. Notice the use of double brackets [[]]: the outer brackets are for the selection itself, and the inner brackets create the list of columns.

Let's modify the main.py file to select both the name and score columns.

Update your main.py with the following code:

import pandas as pd

## Load the CSV file into a DataFrame
df = pd.read_csv('students.csv')

## Select the 'name' and 'score' columns
subset = df[['name', 'score']]

## Print the resulting subset DataFrame
print(subset)

Now, run the script again from your terminal:

python3 main.py

The output will be a new DataFrame containing only the columns you specified.

      name  score
0    Alice     85
1      Bob     92
2  Charlie     95
3    David     88
4      Eve     90

Unlike selecting a single column which returns a Series, selecting multiple columns returns a new DataFrame.

Use loc to select rows by label

In this step, you will learn how to select data based on its label using the .loc accessor. The .loc indexer is primarily label-based, meaning you use the index names (or labels) to make selections. By default, when you load a CSV without specifying an index column, Pandas assigns a default integer index starting from 0. These integers act as the labels for the rows.

Let's use .loc to select the third row of our DataFrame, which has the index label 2.

Update your main.py file with the following code:

import pandas as pd

## Load the CSV file into a DataFrame
df = pd.read_csv('students.csv')

## Select the row with index label 2
charlie_data = df.loc[2]

## Print the selected row
print(charlie_data)

Run the script from your terminal:

python3 main.py

The output will be a Series containing all the data for the row with index label 2.

name        Charlie
age              21
major     Mathematics
score            95
Name: 2, dtype: object

This shows the data for the student "Charlie". Using .loc is a powerful way to access rows when you know their labels.

Use iloc to select rows by integer position

In this step, we'll explore another selection method: .iloc. The .iloc indexer is primarily integer-position based. It works just like Python list slicing, where you use integer indices to access elements. This is different from .loc, which uses labels. While our default index labels are also integers, the distinction becomes crucial when you have non-integer labels.

Let's use .iloc to select the very first row of the DataFrame, which is at integer position 0.

Update your main.py file with the following code:

import pandas as pd

## Load the CSV file into a DataFrame
df = pd.read_csv('students.csv')

## Select the first row (at integer position 0)
first_row = df.iloc[0]

## Print the selected row
print(first_row)

Run the script from your terminal:

python3 main.py

The output will be a Series containing the data for the first student, "Alice".

name       Alice
age           20
major    Physics
score         85
Name: 0, dtype: object

Remember the key difference: .loc is for labels, .iloc is for integer positions.

Slice rows and columns with loc

In this final step, you will combine what you've learned to perform more powerful selections. Both .loc and .iloc can select rows and columns simultaneously using the syntax df.loc[row_selector, column_selector].

We will use .loc to select a slice of rows and a slice of columns. A key feature of .loc is that when you slice with labels (e.g., 1:3), the end label (3) is inclusive.

Let's select the rows from index label 1 to 3 and the columns from name to major.

Update your main.py file with the following code:

import pandas as pd

## Load the CSV file into a DataFrame
df = pd.read_csv('students.csv')

## Select rows with index labels 1 through 3 (inclusive)
## and columns from 'name' to 'major' (inclusive)
data_slice = df.loc[1:3, 'name':'major']

## Print the resulting slice
print(data_slice)

Run the script from your terminal:

python3 main.py

The output is a new DataFrame that is a specific "slice" of the original data.

      name  age           major
1      Bob   22  Computer Science
2  Charlie   21       Mathematics
3    David   23       Engineering

This technique is extremely useful for isolating specific regions of your dataset for analysis.

Summary

Congratulations on completing the lab! You have successfully learned the fundamental methods for selecting data in Pandas.

In this lab, you practiced:

Selecting a single column using bracket notation df['column'], which returns a Series.
Selecting multiple columns using a list in bracket notation df[['col1', 'col2']], which returns a DataFrame.
Selecting rows by their label with .loc, which is powerful for label-based indexing.
Selecting rows by their integer position with .iloc, which follows standard Python slicing rules.
Combining row and column selectors with .loc to extract specific, two-dimensional slices of your data.

Mastering these selection techniques is a critical first step in becoming proficient with Pandas for data analysis and manipulation. You can now confidently access any part of your DataFrame to inspect, analyze, or modify it.