Pandas Data Cleaning: Handle Missing Values and Duplicates

Introduction

Welcome to the Pandas Basic Data Cleaning lab. Data cleaning is a crucial first step in any data analysis or machine learning project. Real-world data is often messy, containing missing values, duplicate entries, or incorrect data types. Using raw, unclean data can lead to inaccurate analysis and unreliable conclusions.

Pandas is a powerful Python library that provides high-performance, easy-to-use data structures and data analysis tools. It is the go-to tool for data cleaning and manipulation in Python.

In this lab, you will learn the fundamental techniques for cleaning a dataset using Pandas. You will practice:

Dropping rows with missing values using dropna().
Filling missing values with fillna().
Removing duplicate rows with drop_duplicates().
Renaming columns with rename().
Converting column data types with astype().

By the end of this lab, you will have a solid understanding of the basic data cleaning workflow in Pandas.

Drop rows with dropna method

In this step, you will learn how to handle missing data by dropping rows that contain null values. In Pandas, missing data is represented by NaN (Not a Number). One of the simplest strategies for dealing with NaN values is to remove the rows or columns that contain them.

The dropna() method allows you to do this easily. By default, it removes any row that contains at least one NaN value.

First, let's run the initial script to see our starting DataFrame. The setup script has already created a file named main.py in the ~/project directory.

Open a terminal in the WebIDE and run the following command:

python3 main.py

You should see the original DataFrame, which contains NaN values in the age and city columns.

Original DataFrame:
      name   age         city SALARY_IN_USD
0    Alice  25.0     New York         50000
1      Bob  30.0  Los Angeles         60000
2  Charlie  35.0     New York         70000
3    David  40.0      Chicago         80000
4    Alice  25.0     New York         50000
5      Eva   NaN       Boston         90000
6    Frank  45.0          NaN        100000

Original DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
 ##   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   name           7 non-null      object
 1   age            6 non-null      float64
 2   city           6 non-null      object
 3   SALARY_IN_USD  7 non-null      object
dtypes: float64(1), object(3)
memory usage: 352.0+ bytes

Now, let's use dropna() to see how it works. Open the main.py file from the file explorer on the left panel. Add the following code to the end of the file. This code will create a new DataFrame df_dropped with the NaN rows removed and print it. We are not modifying the original df yet, so we can explore other methods in the next steps.

## Add this to the end of main.py

print("\nDataFrame after dropping rows with any missing values:")
df_dropped = df.dropna()
print(df_dropped)

Save the file (Ctrl+S or Cmd+S) and run it again from the terminal:

python3 main.py

The output will now include a new section showing the DataFrame after the rows with NaN values (rows for Eva and Frank) have been removed.

## ... (previous output) ...

DataFrame after dropping rows with any missing values:
      name   age         city SALARY_IN_USD
0    Alice  25.0     New York         50000
1      Bob  30.0  Los Angeles         60000
2  Charlie  35.0     New York         70000
3    David  40.0      Chicago         80000
4    Alice  25.0     New York         50000

Fill missing values using fillna

In this step, you will learn another way to handle missing data: filling the missing values with a specific value using the fillna() method. Dropping rows with missing data can sometimes lead to significant data loss, especially if many rows have missing values. Filling them is often a better alternative.

You can fill NaN values with a constant, like 0 or "Unknown", or with a calculated value, like the mean or median of the column.

Let's modify our main.py file. Instead of just printing the dropped DataFrame, we will now clean our main DataFrame df by filling the missing values. We will fill the missing age with the mean of the existing ages and the missing city with the string 'Unknown'.

Remove the dropna() code you added in the last step and replace it with the following code. We use inplace=True to modify the DataFrame directly.

## Replace the dropna() code with this at the end of main.py

## Calculate the mean of the 'age' column
mean_age = df['age'].mean()

## Fill missing values using the recommended approach to avoid FutureWarnings
df.fillna({'age': mean_age, 'city': 'Unknown'}, inplace=True)

print("\nDataFrame after filling missing values:")
print(df)

The inplace=True argument modifies the DataFrame in place, meaning you don't need to assign the result back to a variable (e.g., df = df.fillna(...)).

Save the file and run it from the terminal:

python3 main.py

You will see that the NaN values have been replaced. Eva's age is now the mean of the other ages, and Frank's city is 'Unknown'.

## ... (original DataFrame output) ...

DataFrame after filling missing values:
      name   age         city SALARY_IN_USD
0    Alice  25.0     New York         50000
1      Bob  30.0  Los Angeles         60000
2  Charlie  35.0     New York         70000
3    David  40.0      Chicago         80000
4    Alice  25.0     New York         50000
5      Eva  33.3       Boston         90000
6    Frank  45.0      Unknown        100000

Note: The mean age is (25+30+35+40+25+45)/6 = 33.33.... Pandas will fill NaN with this value. The output above shows 35.0 for simplicity, your actual output for Eva's age will be the calculated mean.

Drop duplicate rows with drop_duplicates

In this step, you will learn how to remove duplicate rows from your DataFrame. Duplicate data can skew analysis and lead to incorrect results. The drop_duplicates() method helps identify and remove them.

In our original dataset, the row for 'Alice' appears twice with the exact same information. Now that we have handled missing values, our next cleaning task is to remove this duplicate entry.

By default, drop_duplicates() considers a row a duplicate if all of its column values are identical to another row's values.

Add the following code to the end of your main.py script. We will continue to use inplace=True to modify our DataFrame.

## Add this to the end of main.py

df.drop_duplicates(inplace=True)

print("\nDataFrame after dropping duplicates:")
print(df)

Save the file and execute it from the terminal:

python3 main.py

The output will now show a DataFrame with only 6 rows, as one of the 'Alice' rows has been removed.

## ... (previous output) ...

DataFrame after dropping duplicates:
      name   age         city SALARY_IN_USD
0    Alice  25.0     New York         50000
1      Bob  30.0  Los Angeles         60000
2  Charlie  35.0     New York         70000
3    David  40.0      Chicago         80000
5      Eva  33.3       Boston         90000
6    Frank  45.0      Unknown        100000

Notice that the index 4 is now missing, which corresponds to the duplicate row that was dropped. The DataFrame now has 6 unique rows.

Rename columns using rename method

In this step, you will learn how to rename columns. Consistent and clear column names are essential for code readability and maintainability. It's a common practice to use a consistent naming convention, such as all lowercase with underscores.

Our DataFrame has a column named SALARY_IN_USD. Let's rename it to a simpler, lowercase name: salary. The rename() method is perfect for this. You pass a dictionary to the columns argument where keys are the old names and values are the new names.

Add the following code to the end of your main.py script:

## Add this to the end of main.py

df.rename(columns={'SALARY_IN_USD': 'salary'}, inplace=True)

print("\nDataFrame after renaming columns:")
print(df)

Save the file and run it from the terminal:

python3 main.py

You will see that the SALARY_IN_USD column has been successfully renamed to salary.

## ... (previous output) ...

DataFrame after renaming columns:
      name   age         city   salary
0    Alice  25.0     New York    50000
1      Bob  30.0  Los Angeles    60000
2  Charlie  35.0     New York    70000
3    David  40.0      Chicago    80000
5      Eva  33.3       Boston    90000
6    Frank  45.0      Unknown   100000

This simple change makes the column name easier to type and follows a common Python style guide.

Convert column types with astype

In this final step, you will learn how to convert the data type of a column. Correct data types are crucial for performing calculations and for memory efficiency.

If you look at the initial df.info() output, you'll notice the SALARY_IN_USD column had a Dtype of object, which means it was storing the numbers as strings. We cannot perform mathematical operations (like calculating the average salary) on strings. We need to convert this column to a numeric type, such as int (integer).

The astype() method is used for this purpose. Let's convert our new salary column to the int type.

Add the following code to the end of main.py. We will also print the DataFrame's info again to confirm the change.

## Add this to the end of main.py

df['salary'] = df['salary'].astype(int)

print("\nDataFrame after converting data types:")
print(df)

print("\nFinal DataFrame Info:")
df.info()

Save the file and run it for the last time:

python3 main.py

The final output will show the cleaned DataFrame and its new info. Look closely at the Dtype for the salary column in the info output. It should now be int64, not object.

## ... (previous output) ...

DataFrame after converting data types:
      name   age         city  salary
0    Alice  25.0     New York   50000
1      Bob  30.0  Los Angeles   60000
2  Charlie  35.0     New York   70000
3    David  40.0      Chicago   80000
5      Eva  33.3       Boston   90000
6    Frank  45.0      Unknown  100000

Final DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, 0 to 6
Data columns (total 4 columns):
 ##   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   name    6 non-null      object
 1   age     6 non-null      float64
 2   city    6 non-null      object
 3   salary  6 non-null      int64
dtypes: float64(1), int64(1), object(2)
memory usage: 240.0+ bytes

With the salary column as an integer type, you can now perform calculations like df['salary'].mean().

Summary

Congratulations on completing the Pandas Basic Data Cleaning lab! You have successfully transformed a messy, raw dataset into a clean, analysis-ready format.

In this lab, you have learned and practiced several essential data cleaning techniques in Pandas:

Handling Missing Values: You saw how to remove rows with NaN using dropna() and how to fill them with meaningful data using fillna().
Removing Duplicates: You used drop_duplicates() to eliminate redundant rows from your dataset.
Renaming Columns: You learned to make column names more consistent and readable with the rename() method.
Converting Data Types: You used astype() to change a column's data type to the correct format, enabling further analysis.

These fundamental skills are the building blocks for any serious data work. Mastering them will allow you to confidently tackle real-world data challenges. Keep practicing these techniques to become proficient in data manipulation with Pandas.