Data Cleaning and Purification | Python CSV Tutorial

Introduction

In this project, you will learn how to clean and purify CSV data by removing incomplete, incorrect, and invalid data. The goal is to create a clean dataset from the raw data, which can be used for further analysis or processing.

🎯 Tasks

In this project, you will learn:

How to set up the project environment and prepare the necessary files
How to import the required libraries for data cleaning
How to read and process the raw data, checking for various types of dirty data
How to write the cleaned data to a new CSV file

🏆 Achievements

After completing this project, you will be able to:

Use Python and its standard library to work with CSV data
Apply techniques for validating and cleaning data, such as checking for missing values, invalid formats, and unrealistic data
Implement a data cleaning process to create a high-quality dataset
Generate a new CSV file with the cleaned data

Understanding the Data Format

In this step, you will understand the data before the data cleaning process.

Navigate to the /home/labex/project directory.
Inside the project directory, you should find a raw_data.csv file. This file contains the raw data that needs to be cleaned.
Open the raw_data.csv file, you can see all the columns, and its correct format should be:
- Name Column: Over 1 word long.
- Gender Column: Expect 'F' or 'M'.
- Birth Date Column: Formatted as %Y-%m-%d.
- Email Column: Conforms to username@domain.com.

Import Necessary Libraries

In this step, you will import the required libraries for the data cleaning process.

Open the data_clean.py file in a text editor.
Add the following code at the beginning of the file:

import csv
import re
from datetime import datetime

These libraries will be used for working with CSV files, regular expressions, and date/time operations.

Initialize the Cleaned Data List

In this step, you will create an empty list to store the cleaned data.

In the data_clean.py file, add the following code below the imports:

## Initialize an empty list to store cleaned data
clean_data = []

This list will be used to store the cleaned data rows.

Read and Process the Raw Data

In this step, you will read the raw data from the raw_data.csv file, process each row, and add the valid rows to the clean_data list.

In the data_clean.py file, add the following code below the clean_data list initialization:

## Open and read the raw data CSV file
with open("raw_data.csv", "r") as f:
    reader = csv.DictReader(f)  ## Use DictReader for easy access to columns by name
    for row in reader:
        ## Extract relevant fields from each row
        name = row["name"]
        sex = row["gender"]
        date = row["birth date"]
        mail = row["mail"]

        ## Check if the name field is empty and skip the row if it is
        if len(name) < 1:
            continue

        ## Check if the gender field is valid (either 'M' or 'F') and skip the row if not
        if sex not in ["M", "F"]:
            continue

        ## Attempt to parse the birth date and calculate age; skip the row if parsing fails
        try:
            date = datetime.strptime(date, "%Y-%m-%d")
        except ValueError:
            continue
        age = datetime.now().year - date.year
        ## Skip the row if the calculated age is unrealistic (less than 0 or more than 200)
        if age < 0 or age > 200:
            continue

        ## Define a regex pattern for validating email addresses
        r = r"^[a-zA-Z0-9_-]+(\.[a-zA-Z0-9_-]+){0,4}@[a-zA-Z0-9_-]+(\.[a-zA-Z0-9_-]+){0,4}$"
        ## Check if the email field matches the regex pattern and skip the row if it doesn't
        if not re.match(r, mail):
            continue

        ## If all checks pass, append the row to the cleaned data list
        clean_data.append(row)

This code reads the raw data from the raw_data.csv file, processes each row, and adds the valid rows to the clean_data list.

Write the Cleaned Data to a New Csv File

In this step, you will write the cleaned data from the clean_data list to a new CSV file named clean_data.csv.

In the data_clean.py file, add the following code below the data processing section:

## Write the cleaned data to a new CSV file
with open("clean_data.csv", "w", newline="") as f:
    writer = csv.DictWriter(
        f, fieldnames=row.keys()
    )  ## DictWriter to write using column names
    writer.writeheader()  ## Write the header row
    writer.writerows(clean_data)  ## Write all the cleaned rows

This code creates a new CSV file named clean_data.csv and writes the cleaned data from the clean_data list to it.

Run the Data Cleaning Script

In this final step, you will run the data_clean.py script to generate the clean_data.csv file.

Save the data_clean.py file.
In the terminal, navigate to the /home/labex/project directory if you haven't already.
Run the following command to execute the data cleaning script:

python data_clean.py

After running the script, you should find a new clean_data.csv file in the /home/labex/project directory, containing the cleaned data.

Congratulations! You have successfully completed the CSV data purification project.

Summary

Congratulations! You have completed this project. You can practice more labs in LabEx to improve your skills.

Data Cleaning and Purification with Python