Introduction
In this project, you will learn how to clean and purify CSV data by removing incomplete, incorrect, and invalid data. The goal is to create a clean dataset from the raw data, which can be used for further analysis or processing.
🎯 Tasks
In this project, you will learn:
- How to set up the project environment and prepare the necessary files
- How to import the required libraries for data cleaning
- How to read and process the raw data, checking for various types of dirty data
- How to write the cleaned data to a new CSV file
🏆 Achievements
After completing this project, you will be able to:
- Use Python and its standard library to work with CSV data
- Apply techniques for validating and cleaning data, such as checking for missing values, invalid formats, and unrealistic data
- Implement a data cleaning process to create a high-quality dataset
- Generate a new CSV file with the cleaned data
Understanding the Data Format
In this step, you will understand the data before the data cleaning process.
- Navigate to the
/home/labex/projectdirectory. - Inside the project directory, you should find a
raw_data.csvfile. This file contains the raw data that needs to be cleaned. - Open the
raw_data.csvfile, you can see all the columns, and its correct format should be:- Name Column: Over 1 word long.
- Gender Column: Expect 'F' or 'M'.
- Birth Date Column: Formatted as
%Y-%m-%d. - Email Column: Conforms to
username@domain.com.
Import Necessary Libraries
In this step, you will import the required libraries for the data cleaning process.
Open the
data_clean.pyfile in a text editor.Add the following code at the beginning of the file:
import csv
import re
from datetime import datetime
These libraries will be used for working with CSV files, regular expressions, and date/time operations.
Initialize the Cleaned Data List
In this step, you will create an empty list to store the cleaned data.
- In the
data_clean.pyfile, add the following code below the imports:
## Initialize an empty list to store cleaned data
clean_data = []
This list will be used to store the cleaned data rows.
Read and Process the Raw Data
In this step, you will read the raw data from the raw_data.csv file, process each row, and add the valid rows to the clean_data list.
- In the
data_clean.pyfile, add the following code below theclean_datalist initialization:
## Open and read the raw data CSV file
with open("raw_data.csv", "r") as f:
reader = csv.DictReader(f) ## Use DictReader for easy access to columns by name
for row in reader:
## Extract relevant fields from each row
name = row["name"]
sex = row["gender"]
date = row["birth date"]
mail = row["mail"]
## Check if the name field is empty and skip the row if it is
if len(name) < 1:
continue
## Check if the gender field is valid (either 'M' or 'F') and skip the row if not
if sex not in ["M", "F"]:
continue
## Attempt to parse the birth date and calculate age; skip the row if parsing fails
try:
date = datetime.strptime(date, "%Y-%m-%d")
except ValueError:
continue
age = datetime.now().year - date.year
## Skip the row if the calculated age is unrealistic (less than 0 or more than 200)
if age < 0 or age > 200:
continue
## Define a regex pattern for validating email addresses
r = r"^[a-zA-Z0-9_-]+(\.[a-zA-Z0-9_-]+){0,4}@[a-zA-Z0-9_-]+(\.[a-zA-Z0-9_-]+){0,4}$"
## Check if the email field matches the regex pattern and skip the row if it doesn't
if not re.match(r, mail):
continue
## If all checks pass, append the row to the cleaned data list
clean_data.append(row)
This code reads the raw data from the raw_data.csv file, processes each row, and adds the valid rows to the clean_data list.
Write the Cleaned Data to a New Csv File
In this step, you will write the cleaned data from the clean_data list to a new CSV file named clean_data.csv.
- In the
data_clean.pyfile, add the following code below the data processing section:
## Write the cleaned data to a new CSV file
with open("clean_data.csv", "w", newline="") as f:
writer = csv.DictWriter(
f, fieldnames=row.keys()
) ## DictWriter to write using column names
writer.writeheader() ## Write the header row
writer.writerows(clean_data) ## Write all the cleaned rows
This code creates a new CSV file named clean_data.csv and writes the cleaned data from the clean_data list to it.
Run the Data Cleaning Script
In this final step, you will run the data_clean.py script to generate the clean_data.csv file.
Save the
data_clean.pyfile.In the terminal, navigate to the
/home/labex/projectdirectory if you haven't already.Run the following command to execute the data cleaning script:
python data_clean.py
After running the script, you should find a new clean_data.csv file in the /home/labex/project directory, containing the cleaned data.
Congratulations! You have successfully completed the CSV data purification project.
Summary
Congratulations! You have completed this project. You can practice more labs in LabEx to improve your skills.



