Introduction
In this lab, we will explore the uniq command in Linux, a powerful tool for identifying and filtering duplicate lines in text files. We'll use a scenario where you're a data analyst at an e-commerce company, tasked with analyzing customer purchase data. The uniq command will help you process this data efficiently, providing valuable insights into customer behavior.
Examining the Raw Customer Data
Let's begin by examining our raw customer purchase data. This data represents daily purchases made by customers.
First, we need to navigate to the project directory. In Linux, we use the cd command to change directories. The tilde (~) is a shortcut that represents your home directory.
cd ~/project
This command changes our current working directory to /home/labex/project. Now that we're in the correct directory, let's view the contents of our customer data file. We'll use the cat command, which is short for "concatenate". It's commonly used to display the contents of files.
cat customer_purchases.txt
You should see output similar to this:
Alice,Electronics
Bob,Books
Charlie,Clothing
Alice,Electronics
David,Home Goods
Bob,Books
Eve,Toys
Charlie,Clothing
Frank,Sports
Alice,Electronics
This file contains customer names and their purchases, with some customers making multiple purchases. Each line represents a single purchase, with the customer's name followed by the category of item they purchased, separated by a comma.
Sorting the Data
Before we can use the uniq command effectively, we need to sort our data. The uniq command works on adjacent duplicate lines, so sorting ensures that any duplicate entries are next to each other.
We'll use the sort command to alphabetically sort our customer data:
sort customer_purchases.txt > sorted_purchases.txt
Let's break down this command:
sortis the command to sort lines of text.customer_purchases.txtis the input file we're sorting.>is a redirection operator. It takes the output of the command on its left and writes it to the file on its right.sorted_purchases.txtis the new file where we're saving the sorted data.
Now, let's view the contents of the sorted file:
cat sorted_purchases.txt
You should see output similar to this:
Alice,Electronics
Alice,Electronics
Alice,Electronics
Bob,Books
Bob,Books
Charlie,Clothing
Charlie,Clothing
David,Home Goods
Eve,Toys
Frank,Sports
Notice how the entries are now sorted alphabetically by customer name. This alphabetical sorting brings all purchases by the same customer together, which is crucial for the next steps.
Using uniq to Remove Duplicate Entries
Now that our data is sorted, we can use the uniq command to remove duplicate entries. This will give us a list of unique customer purchases.
Run the following command:
uniq sorted_purchases.txt unique_purchases.txt
Let's break down this command:
uniqis the command to filter out repeated lines in a file.sorted_purchases.txtis our input file (the sorted data).unique_purchases.txtis the output file where we're saving the results.
The uniq command reads the sorted data from sorted_purchases.txt, removes adjacent duplicate lines, and saves the result in a new file called unique_purchases.txt.
Now, let's view the contents of the new file:
cat unique_purchases.txt
You should see output similar to this:
Alice,Electronics
Bob,Books
Charlie,Clothing
David,Home Goods
Eve,Toys
Frank,Sports
Now we have a list of unique customer purchases, with each customer appearing only once. This gives us a clear view of the different types of purchases made, without repetition.
Counting Purchases with uniq -c
The uniq command becomes even more powerful when we use its options. Let's use the -c option to count how many times each customer made a purchase.
Run the following command:
uniq -c sorted_purchases.txt purchase_counts.txt
Let's break down this command:
uniqis our command to filter repeated lines.-cis an option that tellsuniqto prefix lines with the number of occurrences.sorted_purchases.txtis our input file.purchase_counts.txtis the output file where we're saving the results.
This command counts the number of occurrences of each unique line and saves the result in purchase_counts.txt.
Now, let's view the contents of this new file:
cat purchase_counts.txt
You should see output similar to this:
3 Alice,Electronics
2 Bob,Books
2 Charlie,Clothing
1 David,Home Goods
1 Eve,Toys
1 Frank,Sports
The number at the beginning of each line indicates how many times that customer made a purchase. For example, Alice made 3 purchases of Electronics, while Frank made 1 purchase of Sports items.
Finding Repeat Customers with uniq -d
As a data analyst, you might be interested in identifying repeat customers. We can use the -d option of the uniq command to display only the duplicate lines, which represent customers who made multiple purchases.
Run the following command:
uniq -d sorted_purchases.txt repeat_customers.txt
Let's break down this command:
uniqis our command to filter repeated lines.-dis an option that tellsuniqto only print duplicate lines.sorted_purchases.txtis our input file.repeat_customers.txtis the output file where we're saving the results.
This command identifies duplicate lines in sorted_purchases.txt and saves them to repeat_customers.txt.
Let's view the contents of this new file:
cat repeat_customers.txt
You should see output similar to this:
Alice,Electronics
Bob,Books
Charlie,Clothing
These are the customers who made more than one purchase. This information could be valuable for customer loyalty programs or targeted marketing campaigns.
Summary
In this lab, we explored the uniq command in Linux and its application in analyzing customer purchase data. We learned how to:
- Prepare data for use with
uniqby sorting it first. - Use
uniqto remove duplicate entries from a sorted file. - Use
uniq -cto count occurrences of each unique line. - Use
uniq -dto identify duplicate lines.
These skills are valuable for data analysis tasks, helping you to efficiently process and extract insights from large datasets.
Additional uniq command options not covered in this lab include:
-u: Display only unique lines (lines that appear exactly once)-i: Ignore case when comparing lines-f N: Skip the first N fields when comparing lines-s N: Skip the first N characters when comparing lines



