Introduction
Welcome to the Pandas Basic Data Cleaning lab. Data cleaning is a crucial first step in any data analysis or machine learning project. Real-world data is often messy, containing missing values, duplicate entries, or incorrect data types. Using raw, unclean data can lead to inaccurate analysis and unreliable conclusions.
Pandas is a powerful Python library that provides high-performance, easy-to-use data structures and data analysis tools. It is the go-to tool for data cleaning and manipulation in Python.
In this lab, you will learn the fundamental techniques for cleaning a dataset using Pandas. You will practice:
- Dropping rows with missing values using
dropna(). - Filling missing values with
fillna(). - Removing duplicate rows with
drop_duplicates(). - Renaming columns with
rename(). - Converting column data types with
astype().
By the end of this lab, you will have a solid understanding of the basic data cleaning workflow in Pandas.
Drop rows with dropna method
In this step, you will learn how to handle missing data by dropping rows that contain null values. In Pandas, missing data is represented by NaN (Not a Number). One of the simplest strategies for dealing with NaN values is to remove the rows or columns that contain them.
The dropna() method allows you to do this easily. By default, it removes any row that contains at least one NaN value.
First, let's run the initial script to see our starting DataFrame. The setup script has already created a file named main.py in the ~/project directory.
Open a terminal in the WebIDE and run the following command:
python3 main.py
You should see the original DataFrame, which contains NaN values in the age and city columns.
Original DataFrame:
name age city SALARY_IN_USD
0 Alice 25.0 New York 50000
1 Bob 30.0 Los Angeles 60000
2 Charlie 35.0 New York 70000
3 David 40.0 Chicago 80000
4 Alice 25.0 New York 50000
5 Eva NaN Boston 90000
6 Frank 45.0 NaN 100000
Original DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
## Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 7 non-null object
1 age 6 non-null float64
2 city 6 non-null object
3 SALARY_IN_USD 7 non-null object
dtypes: float64(1), object(3)
memory usage: 352.0+ bytes
Now, let's use dropna() to see how it works. Open the main.py file from the file explorer on the left panel. Add the following code to the end of the file. This code will create a new DataFrame df_dropped with the NaN rows removed and print it. We are not modifying the original df yet, so we can explore other methods in the next steps.
## Add this to the end of main.py
print("\nDataFrame after dropping rows with any missing values:")
df_dropped = df.dropna()
print(df_dropped)
Save the file (Ctrl+S or Cmd+S) and run it again from the terminal:
python3 main.py
The output will now include a new section showing the DataFrame after the rows with NaN values (rows for Eva and Frank) have been removed.
## ... (previous output) ...
DataFrame after dropping rows with any missing values:
name age city SALARY_IN_USD
0 Alice 25.0 New York 50000
1 Bob 30.0 Los Angeles 60000
2 Charlie 35.0 New York 70000
3 David 40.0 Chicago 80000
4 Alice 25.0 New York 50000
Fill missing values using fillna
In this step, you will learn another way to handle missing data: filling the missing values with a specific value using the fillna() method. Dropping rows with missing data can sometimes lead to significant data loss, especially if many rows have missing values. Filling them is often a better alternative.
You can fill NaN values with a constant, like 0 or "Unknown", or with a calculated value, like the mean or median of the column.
Let's modify our main.py file. Instead of just printing the dropped DataFrame, we will now clean our main DataFrame df by filling the missing values. We will fill the missing age with the mean of the existing ages and the missing city with the string 'Unknown'.
Remove the dropna() code you added in the last step and replace it with the following code. We use inplace=True to modify the DataFrame directly.
## Replace the dropna() code with this at the end of main.py
## Calculate the mean of the 'age' column
mean_age = df['age'].mean()
## Fill missing values using the recommended approach to avoid FutureWarnings
df.fillna({'age': mean_age, 'city': 'Unknown'}, inplace=True)
print("\nDataFrame after filling missing values:")
print(df)
The inplace=True argument modifies the DataFrame in place, meaning you don't need to assign the result back to a variable (e.g., df = df.fillna(...)).
Save the file and run it from the terminal:
python3 main.py
You will see that the NaN values have been replaced. Eva's age is now the mean of the other ages, and Frank's city is 'Unknown'.
## ... (original DataFrame output) ...
DataFrame after filling missing values:
name age city SALARY_IN_USD
0 Alice 25.0 New York 50000
1 Bob 30.0 Los Angeles 60000
2 Charlie 35.0 New York 70000
3 David 40.0 Chicago 80000
4 Alice 25.0 New York 50000
5 Eva 33.3 Boston 90000
6 Frank 45.0 Unknown 100000
Note: The mean age is (25+30+35+40+25+45)/6 = 33.33.... Pandas will fill NaN with this value. The output above shows 35.0 for simplicity, your actual output for Eva's age will be the calculated mean.
Drop duplicate rows with drop_duplicates
In this step, you will learn how to remove duplicate rows from your DataFrame. Duplicate data can skew analysis and lead to incorrect results. The drop_duplicates() method helps identify and remove them.
In our original dataset, the row for 'Alice' appears twice with the exact same information. Now that we have handled missing values, our next cleaning task is to remove this duplicate entry.
By default, drop_duplicates() considers a row a duplicate if all of its column values are identical to another row's values.
Add the following code to the end of your main.py script. We will continue to use inplace=True to modify our DataFrame.
## Add this to the end of main.py
df.drop_duplicates(inplace=True)
print("\nDataFrame after dropping duplicates:")
print(df)
Save the file and execute it from the terminal:
python3 main.py
The output will now show a DataFrame with only 6 rows, as one of the 'Alice' rows has been removed.
## ... (previous output) ...
DataFrame after dropping duplicates:
name age city SALARY_IN_USD
0 Alice 25.0 New York 50000
1 Bob 30.0 Los Angeles 60000
2 Charlie 35.0 New York 70000
3 David 40.0 Chicago 80000
5 Eva 33.3 Boston 90000
6 Frank 45.0 Unknown 100000
Notice that the index 4 is now missing, which corresponds to the duplicate row that was dropped. The DataFrame now has 6 unique rows.
Rename columns using rename method
In this step, you will learn how to rename columns. Consistent and clear column names are essential for code readability and maintainability. It's a common practice to use a consistent naming convention, such as all lowercase with underscores.
Our DataFrame has a column named SALARY_IN_USD. Let's rename it to a simpler, lowercase name: salary. The rename() method is perfect for this. You pass a dictionary to the columns argument where keys are the old names and values are the new names.
Add the following code to the end of your main.py script:
## Add this to the end of main.py
df.rename(columns={'SALARY_IN_USD': 'salary'}, inplace=True)
print("\nDataFrame after renaming columns:")
print(df)
Save the file and run it from the terminal:
python3 main.py
You will see that the SALARY_IN_USD column has been successfully renamed to salary.
## ... (previous output) ...
DataFrame after renaming columns:
name age city salary
0 Alice 25.0 New York 50000
1 Bob 30.0 Los Angeles 60000
2 Charlie 35.0 New York 70000
3 David 40.0 Chicago 80000
5 Eva 33.3 Boston 90000
6 Frank 45.0 Unknown 100000
This simple change makes the column name easier to type and follows a common Python style guide.
Convert column types with astype
In this final step, you will learn how to convert the data type of a column. Correct data types are crucial for performing calculations and for memory efficiency.
If you look at the initial df.info() output, you'll notice the SALARY_IN_USD column had a Dtype of object, which means it was storing the numbers as strings. We cannot perform mathematical operations (like calculating the average salary) on strings. We need to convert this column to a numeric type, such as int (integer).
The astype() method is used for this purpose. Let's convert our new salary column to the int type.
Add the following code to the end of main.py. We will also print the DataFrame's info again to confirm the change.
## Add this to the end of main.py
df['salary'] = df['salary'].astype(int)
print("\nDataFrame after converting data types:")
print(df)
print("\nFinal DataFrame Info:")
df.info()
Save the file and run it for the last time:
python3 main.py
The final output will show the cleaned DataFrame and its new info. Look closely at the Dtype for the salary column in the info output. It should now be int64, not object.
## ... (previous output) ...
DataFrame after converting data types:
name age city salary
0 Alice 25.0 New York 50000
1 Bob 30.0 Los Angeles 60000
2 Charlie 35.0 New York 70000
3 David 40.0 Chicago 80000
5 Eva 33.3 Boston 90000
6 Frank 45.0 Unknown 100000
Final DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, 0 to 6
Data columns (total 4 columns):
## Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 6 non-null object
1 age 6 non-null float64
2 city 6 non-null object
3 salary 6 non-null int64
dtypes: float64(1), int64(1), object(2)
memory usage: 240.0+ bytes
With the salary column as an integer type, you can now perform calculations like df['salary'].mean().
Summary
Congratulations on completing the Pandas Basic Data Cleaning lab! You have successfully transformed a messy, raw dataset into a clean, analysis-ready format.
In this lab, you have learned and practiced several essential data cleaning techniques in Pandas:
- Handling Missing Values: You saw how to remove rows with
NaNusingdropna()and how to fill them with meaningful data usingfillna(). - Removing Duplicates: You used
drop_duplicates()to eliminate redundant rows from your dataset. - Renaming Columns: You learned to make column names more consistent and readable with the
rename()method. - Converting Data Types: You used
astype()to change a column's data type to the correct format, enabling further analysis.
These fundamental skills are the building blocks for any serious data work. Mastering them will allow you to confidently tackle real-world data challenges. Keep practicing these techniques to become proficient in data manipulation with Pandas.



