Pandas: Reading External Data from CSV Files

Introduction

In data analysis, the first step is often to load your data into a structured format. Pandas, a powerful Python library, excels at this task. The most common function for reading data from flat files like CSVs is pandas.read_csv(). This function is incredibly flexible, offering a wide range of parameters to handle different file formats and potential issues.

In this lab, you will learn how to use pd.read_csv() to import data from a CSV file into a Pandas DataFrame. We will cover how to handle comments, specify headers, manage missing values, and perform initial inspections of the loaded data.

Read CSV file using read_csv

In this step, you will learn the basic usage of the pd.read_csv() function to load data from a CSV file into a Pandas DataFrame.

Our project directory, ~/project, contains a file named data.csv. Let's examine its content. It includes some comment lines at the beginning, which start with a # symbol. The read_csv function can be instructed to ignore these lines using the comment parameter.

First, open the main.py file from the file explorer on the left. We will write our Python code in this file.

Add the following code to main.py to import Pandas and read the data.csv file. We'll assign the resulting DataFrame to a variable named df and then print it.

import pandas as pd

## Read the CSV file, treating lines starting with '#' as comments
df = pd.read_csv('data.csv', comment='#')

## Print the DataFrame
print(df)

Now, to run the script, open a terminal in the WebIDE (you can use the menu Terminal > New Terminal) and execute the following command:

python3 main.py

You should see the following output, which shows the content of the CSV file loaded into a structured DataFrame.

   id     name          age         city
0   1    Alice           25     New York
1   2      Bob           30  Los Angeles
2   3  Charlie  Not Available       London
3   4    David           35          NaN
4   5      Eve           22        Paris

Specify header row in read_csv

In this step, we will explore the header parameter of the read_csv function.

By default, read_csv assumes that the first non-commented, non-skipped row in your file is the header row containing column names. In our data.csv file, after skipping the comment lines, the line id,name,age,city is correctly inferred as the header.

The header parameter allows you to explicitly specify which row to use as the header. It takes an integer representing the row index (starting from 0). Since the first data line (after comments) is our header, its index is 0.

Let's modify the main.py file to explicitly set header=0. While this is the default behavior in our case, being explicit can prevent errors with unusually formatted files.

Update your main.py with the following code:

import pandas as pd

## Explicitly specify that the first row (index 0) after comments is the header
df = pd.read_csv('data.csv', comment='#', header=0)

print(df)

Run the script again from the terminal:

python3 main.py

The output will be identical to the previous step, as we have only confirmed the default behavior. This practice is useful for code clarity and robustness.

   id     name          age         city
0   1    Alice           25     New York
1   2      Bob           30  Los Angeles
2   3  Charlie  Not Available       London
3   4    David           35          N/A
4   5      Eve           22        Paris

Handle missing values with na_values parameter

In this step, you'll learn how to handle custom representations of missing data.

If you look at the DataFrame from the previous step, you'll see the values "Not Available" in the age column and NaN in the city column. Pandas automatically recognizes some common missing value indicators like empty strings, NA, or N/A, but not custom ones like "Not Available". We can use the na_values parameter to provide a list of strings that should be interpreted as NaN (Not a Number), which is Pandas' standard marker for missing data.

Modify your main.py to include the na_values parameter.

import pandas as pd

## Define a list of strings to be treated as missing values
missing_values = ["Not Available", "N/A"]

## Read the CSV, specifying the custom missing values
df = pd.read_csv('data.csv', comment='#', na_values=missing_values)

print(df)

Now, run the script:

python3 main.py

Observe the new output. The "Not Available" string has been replaced with NaN.

   id     name   age         city
0   1    Alice  25.0     New York
1   2      Bob  30.0  Los Angeles
2   3  Charlie   NaN       London
3   4    David  35.0          NaN
4   5      Eve  22.0        Paris

Notice that the age column's data type has also changed to float64 to accommodate the NaN value.

Display first rows using head method

In this step, we will learn how to inspect the first few rows of a DataFrame.

When working with large datasets, printing the entire DataFrame is inefficient and can clutter your screen. The .head() method is a convenient way to get a quick preview of your data. By default, it returns the first 5 rows.

You can also pass an integer to .head() to specify the number of rows you want to see. Let's display only the first 3 rows of our DataFrame.

Update your main.py file as follows. Instead of printing the whole DataFrame, we will now print only its head.

import pandas as pd

missing_values = ["Not Available", "N/A"]
df = pd.read_csv('data.csv', comment='#', na_values=missing_values)

## Display the first 3 rows of the DataFrame
print(df.head(3))

Execute the script from your terminal:

python3 main.py

The output will now be much shorter, showing only the header and the first three records.

   id     name   age         city
0   1    Alice  25.0     New York
1   2      Bob  30.0  Los Angeles
2   3  Charlie   NaN       London

Check DataFrame shape using shape attribute

In this step, you will learn how to check the dimensions of your DataFrame.

Knowing the number of rows and columns is a fundamental check in data analysis. Pandas DataFrames have a .shape attribute that returns a tuple containing the number of rows and columns.

Note that .shape is an attribute, not a method, so you don't use parentheses () when accessing it.

Let's modify main.py to print the shape of our DataFrame.

import pandas as pd

missing_values = ["Not Available", "N/A"]
df = pd.read_csv('data.csv', comment='#', na_values=missing_values)

## Get the dimensions (rows, columns) of the DataFrame
df_shape = df.shape

print(df_shape)

Run the script one last time:

python3 main.py

The output will be a tuple indicating that our DataFrame has 5 rows and 4 columns.

(5, 4)

Summary

Congratulations on completing this lab! You have learned the essential skills for reading and inspecting data using Pandas.

In this lab, you have mastered:

Reading a CSV file into a DataFrame using pd.read_csv().
Skipping comment lines with the comment parameter.
Explicitly defining the header row with the header parameter.
Handling custom missing values using the na_values parameter.
Previewing a DataFrame with the .head() method.
Checking the dimensions of a DataFrame with the .shape attribute.

These are fundamental operations that form the starting point for nearly every data analysis task. With these skills, you are now better equipped to handle a variety of data import challenges.

Pandas Reading External Data