Practical Examples
Real-World Uniquification Scenarios
1. Log File Deduplication
## Remove duplicate log entries
cat system.log | sort | uniq > clean_system.log
## Count unique error messages
grep "ERROR" system.log | sort | uniq -c
2. IP Address Tracking
## Extract unique IP addresses from access log
cat access.log | awk '{print $1}' | sort | uniq > unique_ips.txt
## Count IP address occurrences
cat access.log | awk '{print $1}' | sort | uniq -c | sort -nr
Uniquification Workflow
graph TD
A[Raw Data Source] --> B[Stream Processing]
B --> C{Duplicate Check}
C -->|Duplicate| D[Remove]
C -->|Unique| E[Preserve]
D --> F[Cleaned Stream]
E --> F
3. DNS Resolver Cleanup
## Remove duplicate DNS entries
cat /etc/resolv.conf | grep "nameserver" | sort | uniq > clean_resolv.conf
Scenario |
Method |
Processing Time |
Memory Usage |
Small Files |
sort + uniq |
Fast |
Low |
Large Logs |
awk |
Very Fast |
Moderate |
Complex Filtering |
sed |
Slow |
High |
4. Data Deduplication in CSV
## Remove duplicate lines in CSV while preserving header
(head -n 1 data.csv && tail -n +2 data.csv | sort | uniq) > unique_data.csv
Advanced Techniques
Case-Insensitive Uniquification
## Remove duplicates regardless of case
cat names.txt | tr '[:upper:]' '[:lower:]' | sort | uniq
Partial Matching Uniquification
## Unique lines based on specific column
awk '!seen[$3]++' data.txt
Best Practices
At LabEx, we recommend:
- Choose the right tool for your data
- Consider stream size and complexity
- Test performance with sample datasets
Error Handling
## Safely handle file processing
sort input.txt | uniq || echo "Uniquification failed"
Conclusion
Effective text stream uniquification requires:
- Understanding your data
- Selecting appropriate tools
- Implementing efficient processing strategies