Linux Tutorial: Using uniq and sort Commands for Data Analysis

Introduction

In this lab, we will explore the uniq command in Linux, a powerful tool for identifying and filtering duplicate lines in text files. We'll use a scenario where you're a data analyst at an e-commerce company, tasked with analyzing customer purchase data. The uniq command will help you process this data efficiently, providing valuable insights into customer behavior.

Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL linux(("Linux")) -.-> linux/BasicFileOperationsGroup(["Basic File Operations"]) linux(("Linux")) -.-> linux/FileandDirectoryManagementGroup(["File and Directory Management"]) linux(("Linux")) -.-> linux/TextProcessingGroup(["Text Processing"]) linux/BasicFileOperationsGroup -.-> linux/cat("File Concatenating") linux/FileandDirectoryManagementGroup -.-> linux/cd("Directory Changing") linux/TextProcessingGroup -.-> linux/sort("Text Sorting") linux/TextProcessingGroup -.-> linux/uniq("Duplicate Filtering") subgraph Lab Skills linux/cat -.-> lab-219199{{"Linux uniq Command: Duplicate Filtering"}} linux/cd -.-> lab-219199{{"Linux uniq Command: Duplicate Filtering"}} linux/sort -.-> lab-219199{{"Linux uniq Command: Duplicate Filtering"}} linux/uniq -.-> lab-219199{{"Linux uniq Command: Duplicate Filtering"}} end

Examining the Raw Customer Data

Let's begin by examining our raw customer purchase data. This data represents daily purchases made by customers.

First, we need to navigate to the project directory. In Linux, we use the cd command to change directories. The tilde (~) is a shortcut that represents your home directory.

cd ~/project

This command changes our current working directory to /home/labex/project. Now that we're in the correct directory, let's view the contents of our customer data file. We'll use the cat command, which is short for "concatenate". It's commonly used to display the contents of files.

cat customer_purchases.txt

You should see output similar to this:

Alice,Electronics
Bob,Books
Charlie,Clothing
Alice,Electronics
David,Home Goods
Bob,Books
Eve,Toys
Charlie,Clothing
Frank,Sports
Alice,Electronics

This file contains customer names and their purchases, with some customers making multiple purchases. Each line represents a single purchase, with the customer's name followed by the category of item they purchased, separated by a comma.

Sorting the Data

Before we can use the uniq command effectively, we need to sort our data. The uniq command works on adjacent duplicate lines, so sorting ensures that any duplicate entries are next to each other.

We'll use the sort command to alphabetically sort our customer data:

sort customer_purchases.txt > sorted_purchases.txt

Let's break down this command:

sort is the command to sort lines of text.
customer_purchases.txt is the input file we're sorting.
> is a redirection operator. It takes the output of the command on its left and writes it to the file on its right.
sorted_purchases.txt is the new file where we're saving the sorted data.

Now, let's view the contents of the sorted file:

cat sorted_purchases.txt

You should see output similar to this:

Alice,Electronics
Alice,Electronics
Alice,Electronics
Bob,Books
Bob,Books
Charlie,Clothing
Charlie,Clothing
David,Home Goods
Eve,Toys
Frank,Sports

Notice how the entries are now sorted alphabetically by customer name. This alphabetical sorting brings all purchases by the same customer together, which is crucial for the next steps.

Using uniq to Remove Duplicate Entries

Now that our data is sorted, we can use the uniq command to remove duplicate entries. This will give us a list of unique customer purchases.

Run the following command:

uniq sorted_purchases.txt unique_purchases.txt

Let's break down this command:

uniq is the command to filter out repeated lines in a file.
sorted_purchases.txt is our input file (the sorted data).
unique_purchases.txt is the output file where we're saving the results.

The uniq command reads the sorted data from sorted_purchases.txt, removes adjacent duplicate lines, and saves the result in a new file called unique_purchases.txt.

Now, let's view the contents of the new file:

cat unique_purchases.txt

You should see output similar to this:

Alice,Electronics
Bob,Books
Charlie,Clothing
David,Home Goods
Eve,Toys
Frank,Sports

Now we have a list of unique customer purchases, with each customer appearing only once. This gives us a clear view of the different types of purchases made, without repetition.

Counting Purchases with uniq -c

The uniq command becomes even more powerful when we use its options. Let's use the -c option to count how many times each customer made a purchase.

Run the following command:

uniq -c sorted_purchases.txt purchase_counts.txt

Let's break down this command:

uniq is our command to filter repeated lines.
-c is an option that tells uniq to prefix lines with the number of occurrences.
sorted_purchases.txt is our input file.
purchase_counts.txt is the output file where we're saving the results.

This command counts the number of occurrences of each unique line and saves the result in purchase_counts.txt.

Now, let's view the contents of this new file:

cat purchase_counts.txt

You should see output similar to this:

   3 Alice,Electronics
   2 Bob,Books
   2 Charlie,Clothing
   1 David,Home Goods
   1 Eve,Toys
   1 Frank,Sports

The number at the beginning of each line indicates how many times that customer made a purchase. For example, Alice made 3 purchases of Electronics, while Frank made 1 purchase of Sports items.

Finding Repeat Customers with uniq -d

As a data analyst, you might be interested in identifying repeat customers. We can use the -d option of the uniq command to display only the duplicate lines, which represent customers who made multiple purchases.

Run the following command:

uniq -d sorted_purchases.txt repeat_customers.txt

Let's break down this command:

uniq is our command to filter repeated lines.
-d is an option that tells uniq to only print duplicate lines.
sorted_purchases.txt is our input file.
repeat_customers.txt is the output file where we're saving the results.

This command identifies duplicate lines in sorted_purchases.txt and saves them to repeat_customers.txt.

Let's view the contents of this new file:

cat repeat_customers.txt

You should see output similar to this:

Alice,Electronics
Bob,Books
Charlie,Clothing

These are the customers who made more than one purchase. This information could be valuable for customer loyalty programs or targeted marketing campaigns.

Summary

In this lab, we explored the uniq command in Linux and its application in analyzing customer purchase data. We learned how to:

Prepare data for use with uniq by sorting it first.
Use uniq to remove duplicate entries from a sorted file.
Use uniq -c to count occurrences of each unique line.
Use uniq -d to identify duplicate lines.

These skills are valuable for data analysis tasks, helping you to efficiently process and extract insights from large datasets.

Additional uniq command options not covered in this lab include:

-u: Display only unique lines (lines that appear exactly once)
-i: Ignore case when comparing lines
-f N: Skip the first N fields when comparing lines
-s N: Skip the first N characters when comparing lines

Resources

Linux Roadmap - roadmap.sh