Handling Missing Data in Pandas

Introduction

In this lab, we will learn how to handle missing data in pandas, a common issue in data analysis. We'll cover how to identify missing data, fill in missing values, and drop data that's not needed. We will also discuss the experimental NA scalar in pandas that can be used to denote missing values.

VM Tips

After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.

Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.

If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.

Import Necessary Libraries and Create DataFrame

To start, we need to import the necessary libraries - pandas and NumPy. Then, we'll create a DataFrame with some missing values.

import pandas as pd
import numpy as np

## Create a DataFrame with missing values
df = pd.DataFrame(
   np.random.randn(5, 3),
   index=["a", "c", "e", "f", "h"],
   columns=["one", "two", "three"],
)
df["four"] = "bar"
df["five"] = df["one"] > 0
df2 = df.reindex(["a", "b", "c", "d", "e", "f", "g", "h"])

Detect Missing Values

Next, we'll use the isna and notna functions to detect missing values.

## Use isna and notna to detect missing values
pd.isna(df2["one"])
df2["four"].notna()
df2.isna()

Insert Missing Data

Here, we'll see how to insert missing values into our data.

## Insert missing values
s = pd.Series([1., 2., 3.])
s.loc[0] = None

Perform Calculations with Missing Data

We'll perform some basic arithmetic and statistical calculations with missing data.

## Perform calculations with missing data
df["one"].sum()
df.mean(1)
df.cumsum()

Drop Axis Labels with Missing Data

We'll learn how to exclude labels with missing data using dropna.

df.dropna(axis=0)
df.dropna(axis=1)
df["one"].dropna()

Interpolate Missing Values

We'll use the interpolate function to fill in missing values in a DataFrame.

df = pd.DataFrame(
   {
       "A": [1, 2.1, np.nan, 4.7, 5.6, 6.8],
       "B": [0.25, np.nan, np.nan, 4, 12.2, 14.4],
   }
)
df.interpolate()

Replace Generic Values

We'll learn how to replace arbitrary values with other values using replace.

ser = pd.Series([0.0, 1.0, 2.0, 3.0, 4.0])
ser.replace(0, 5)

Understand NA Scalar to Denote Missing Values

Finally, we'll discuss the experimental NA scalar in pandas that can be used to denote missing values.

s = pd.Series([1, 2, None], dtype="Int64")
s

Summary

In this lab, we have learned how to handle missing data using pandas. We have covered how to detect, insert, calculate with, and drop missing data. We have also learned how to interpolate and replace missing values. Lastly, we have discussed the experimental NA scalar in pandas to denote missing values. This knowledge will be very useful when dealing with real-world data analysis tasks where missing data is often a common issue.

Handling Missing Data