Strategies for Cleaning
Data Cleaning Approaches
graph TD
A[Data Cleaning Strategies] --> B[Deletion]
A --> C[Imputation]
A --> D[Advanced Techniques]
Deletion Methods
Listwise Deletion
import pandas as pd
import numpy as np
## Create sample DataFrame
df = pd.DataFrame({
'age': [25, 30, np.nan, 35, None],
'salary': [50000, 60000, 75000, None, 80000]
})
## Remove rows with any missing values
cleaned_df = df.dropna()
print(cleaned_df)
Pairwise Deletion
## Delete only columns with missing values
cleaned_columns = df.dropna(axis=1)
print(cleaned_columns)
Imputation Techniques
Method |
Description |
Use Case |
Mean Imputation |
Replace with column mean |
Numeric columns |
Median Imputation |
Replace with column median |
Skewed distributions |
Mode Imputation |
Replace with most frequent value |
Categorical data |
Practical Imputation Example
## Mean imputation
df['salary'].fillna(df['salary'].mean(), inplace=True)
## Median imputation
df['age'].fillna(df['age'].median(), inplace=True)
## Mode imputation for categorical data
df['department'].fillna(df['department'].mode()[0], inplace=True)
Advanced Cleaning Techniques
Machine Learning Imputation
from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
## Simple imputation
simple_imputer = SimpleImputer(strategy='mean')
df_imputed = simple_imputer.fit_transform(df)
## Advanced iterative imputation
iterative_imputer = IterativeImputer()
advanced_imputed = iterative_imputer.fit_transform(df)
Handling Different Data Types
graph TD
A[Imputation Strategy] --> B[Numeric Data]
A --> C[Categorical Data]
A --> D[Time Series Data]
Specialized Imputation Approaches
- Use domain knowledge
- Consider data distribution
- Validate imputation results
Best Practices
- Understand data context
- Choose appropriate imputation method
- Validate cleaned data
- Document cleaning process
At LabEx, we emphasize a systematic approach to data cleaning that balances statistical rigor with practical considerations.
Final Validation
def validate_cleaning(original_df, cleaned_df):
print("Original Missing Values:", original_df.isnull().sum())
print("Cleaned Missing Values:", cleaned_df.isnull().sum())
return cleaned_df
By applying these strategies, data scientists can effectively handle missing data while maintaining data integrity and analytical accuracy.