Introduction
Welcome to the world of text processing with AWK. In this lab, you will learn how to use the awk command to analyze log files, a common task for system administrators and data analysts. AWK is a powerful tool for processing structured text data in Linux, allowing you to extract, filter, and transform information efficiently.
Imagine you are a junior system administrator tasked with analyzing server logs to identify potential security threats and performance issues. The awk command will be your primary tool for this task, enabling you to quickly sift through large log files and extract meaningful insights.
Examining the Log File
Let's start by examining the contents of our sample log file. This file contains simulated server access logs that we'll analyze throughout this lab.
First, navigate to the project directory:
cd ~/project
Now, let's view the first few lines of the log file:
head -n 5 server_logs.txt
You should see output similar to this:
2023-08-01 08:15:23 192.168.1.100 GET /index.html 200
2023-08-01 08:16:45 192.168.1.101 GET /about.html 200
2023-08-01 08:17:30 192.168.1.102 POST /login.php 302
2023-08-01 08:18:12 192.168.1.103 GET /products.html 404
2023-08-01 08:19:05 192.168.1.104 GET /services.html 200
This log file contains information about server requests, including the date and time, IP address, HTTP method, requested resource, and status code.
Basic AWK Usage - Printing Specific Fields
Now that we've seen the structure of our log file, let's use AWK to extract specific information. By default, AWK splits each line into fields based on whitespace. We can refer to these fields using $1, $2, etc., where $1 is the first field, $2 is the second, and so on.
Let's extract the IP addresses (the third field) from our log file:
awk '{print $3}' server_logs.txt | head -n 5
You should see output similar to this:
192.168.1.100
192.168.1.101
192.168.1.102
192.168.1.103
192.168.1.104
In this command:
awk '{print $3}'tells AWK to print the third field of each line.- We pipe (
|) the output tohead -n 5to limit the display to the first 5 lines.
Now, let's print both the IP address and the requested resource:
awk '{print $3, $5}' server_logs.txt | head -n 5
Output:
192.168.1.100 /index.html
192.168.1.101 /about.html
192.168.1.102 /login.php
192.168.1.103 /products.html
192.168.1.104 /services.html
Here, we're printing the third field (IP address) and the fifth field (requested resource) for each line.
Filtering Log Entries
One of AWK's strengths is its ability to filter data based on conditions. Let's use this feature to find all POST requests in our log file, as these might be more security-sensitive than GET requests.
Run the following command:
awk '$4 == "POST" {print $0}' server_logs.txt
This command may print hundreds of lines because the sample file contains 5,000 log entries. If you only want to inspect a manageable sample while learning, add | head -n 10:
awk '$4 == "POST" {print $0}' server_logs.txt | head -n 10
The verification still accepts the plain awk command, so use whichever version helps you read the output more comfortably.
Let's break down this command's syntax to understand how AWK filtering works:
$4 == "POST"- This is a pattern or condition that AWK evaluates for each line:$4refers to the fourth field in the current line (in our log file, this is the HTTP method)==is the equality operator that checks if two values are equal"POST"is the string we're comparing against
{print $0}- This is the action AWK performs when the condition is true:- The curly braces
{}enclose the action printis the command to output text$0represents the entire current line (all fields)
- The curly braces
The command structure follows the AWK pattern: condition {action}. AWK reads each line, and if the condition evaluates to true, it performs the action. If no condition is specified (as in our earlier examples), the action is performed for every line.
You should see output similar to this:
2023-08-01 08:17:30 192.168.1.102 POST /login.php 302
2023-08-01 09:23:45 192.168.1.110 POST /submit_form.php 200
2023-08-01 10:45:12 192.168.1.115 POST /upload.php 500
Now, let's find all requests that resulted in a 404 (Not Found) status:
awk '$6 == "404" {print $1, $2, $5}' server_logs.txt
This command follows the same pattern but with different values:
- The condition
$6 == "404"checks if the sixth field (status code) equals 404 - The action
{print $1, $2, $5}prints only specific fields:$1- First field (date)$2- Second field (time)$5- Fifth field (requested resource)
This selective printing allows you to focus on just the information you need.
Output:
2023-08-01 08:18:12 /products.html
2023-08-01 09:30:18 /nonexistent.html
2023-08-01 11:05:30 /missing_page.html
You can combine multiple conditions using logical operators:
&&for AND (both conditions must be true)||for OR (at least one condition must be true)!for NOT (negates a condition)
For example, to find all POST requests that resulted in an error (status code >= 400):
awk '$4 == "POST" && $6 >= 400 {print $0}' server_logs.txt
These filters can help you quickly identify potential issues or suspicious activities in your server logs.
Counting and Summarizing Data
AWK is excellent for counting occurrences and summarizing data. Let's use it to count the number of requests for each HTTP status code.
Run this command:
awk '{count[$6]++} END {for (code in count) print code, count[code]}' server_logs.txt | sort -n
This command is more complex, so let's break it down step by step:
{count[$6]++}- This is the main action performed for each line:countis an array (associative array or dictionary) we're creating[$6]uses the value of the 6th field (status code) as the array index/key++is the increment operator, adding 1 to the current value- So for each line, we increment the counter for the specific status code found
END {for (code in count) print code, count[code]}- This is executed after processing all lines:ENDis a special pattern that matches the end of the input{...}contains the action to perform after all input is processedfor (code in count)is a loop that iterates through all keys in thecountarrayprint code, count[code]prints each status code and its count
| sort -n- Pipes the output to the sort command, which sorts numerically
When AWK processes an array like count[$6]++, it automatically:
- Creates the array if it doesn't exist
- Creates a new element with value 0 if the key doesn't exist
- Then increments the value by 1
You should see output similar to this:
200 3562
301 45
302 78
304 112
400 23
403 8
404 89
500 15
This summary quickly shows you the distribution of status codes in your log file.
Now, let's find the top 5 most frequently accessed resources:
awk '{count[$5]++} END {for (resource in count) print count[resource], resource}' server_logs.txt | sort -rn | head -n 5
This command follows a similar pattern with a few changes:
{count[$5]++}- Counts occurrences of the 5th field (the requested resource)END {for (resource in count) print count[resource], resource}- After processing all lines:- Prints the count first, followed by the resource
- This order change facilitates numerical sorting by count
| sort -rn- Sorts numerically in reverse order (highest counts first)| head -n 5- Limits output to the first 5 lines (top 5 results)
Output:
1823 /index.html
956 /about.html
743 /products.html
512 /services.html
298 /contact.html
These AWK commands demonstrate the power of using arrays for counting and summarizing. You can adapt this pattern to count any field or combination of fields in your data.
For example, to count the number of requests per IP address:
awk '{count[$3]++} END {for (ip in count) print ip, count[ip]}' server_logs.txt
To count requests by both method and status:
awk '{key=$4"-"$6; count[key]++} END {for (k in count) print k, count[k]}' server_logs.txt
These summaries can help you understand traffic patterns and identify popular (or problematic) resources on your server.
Creating a Simple Report
For our final task, let's create a simple HTML report summarizing some key information from our log file. We'll use an AWK script stored in a separate file for this more complex operation.
This step combines several AWK ideas from earlier sections:
- counters such as
total++ - arrays such as
ip_count[$3]++ - an
ENDblock that prints the final summary
If the script feels long at first glance, focus on one block at a time. You do not need to memorize the whole file before running it.
First, create a file named log_report.awk with the following content:
Tips: Copy the content below and paste it into your terminal to create the file.
cat << 'EOF' > log_report.awk
BEGIN {
print "<html><body>"
print "<h1>Server Log Summary</h1>"
total = 0
errors = 0
}
{
total++
if ($6 >= 400) errors++
ip_count[$3]++
resource_count[$5]++
}
END {
print "<p>Total requests: " total "</p>"
print "<p>Error rate: " (errors/total) * 100 "%</p>"
print "<h2>Top 5 IP Addresses</h2>"
print "<ul>"
for (ip in ip_count) {
top_ips[ip] = ip_count[ip]
}
n = asort(top_ips, sorted_ips, "@val_num_desc")
for (i = 1; i <= 5 && i <= n; i++) {
for (ip in ip_count) {
if (ip_count[ip] == sorted_ips[i]) {
print "<li>" ip ": " ip_count[ip] " requests</li>"
delete ip_count[ip]
break
}
}
}
print "</ul>"
print "<h2>Top 5 Requested Resources</h2>"
print "<ul>"
for (resource in resource_count) {
top_resources[resource] = resource_count[resource]
}
n = asort(top_resources, sorted_resources, "@val_num_desc")
for (i = 1; i <= 5 && i <= n; i++) {
for (resource in resource_count) {
if (resource_count[resource] == sorted_resources[i]) {
print "<li>" resource ": " resource_count[resource] " requests</li>"
delete resource_count[resource]
break
}
}
}
print "</ul>"
print "</body></html>"
}
EOF
Let's understand this AWK script section by section:
BEGIN Block: Executes before processing any input lines
BEGIN { print "<html><body>" ## Start HTML structure print "<h1>Server Log Summary</h1>" total = 0 ## Initialize counter for total requests errors = 0 ## Initialize counter for error requests }Main Processing Block: Executes for each line of the input file
{ total++ ## Increment total request counter if ($6 >= 400) errors++ ## Count error responses (status codes >= 400) ip_count[$3]++ ## Count requests by IP address (field 3) resource_count[$5]++ ## Count requests by resource (field 5) }END Block: Executes after processing all input lines
END { ## Print summary statistics print "<p>Total requests: " total "</p>" print "<p>Error rate: " (errors/total) * 100 "%</p>" ## Process and print top 5 IP addresses ## ... ## Process and print top 5 requested resources ## ... print "</body></html>" ## End HTML structure }
Before moving on, notice the overall flow:
BEGINprints the opening HTML tags and initializes counters.- The middle block processes each log line and updates totals.
ENDprints the final report after every line has been analyzed.
Let's examine the sorting logic for the top IPs (the resources section works the same way):
## Copy the counts to a new array for sorting
for (ip in ip_count) {
top_ips[ip] = ip_count[ip]
}
## Sort the array by value in descending order
n = asort(top_ips, sorted_ips, "@val_num_desc")
## Print the top 5 entries
for (i = 1; i <= 5 && i <= n; i++) {
## Find the original IP that matches this count
for (ip in ip_count) {
if (ip_count[ip] == sorted_ips[i]) {
print "<li>" ip ": " ip_count[ip] " requests</li>"
delete ip_count[ip]
break
}
}
}
In this script:
- The
asort()function sorts the array "@val_num_desc"is a special argument that tells it to sort numerically by value in descending order- The nested loops find and print the top 5 entries
You can think of the nested loops like this:
- the first loop decides which counts belong in the top 5
- the second loop finds which IP address or resource produced each count
- after printing one match, the script deletes that key so equal counts do not duplicate the same entry
That lookup pattern is more advanced than the previous steps, so it is normal if this is the first part of the lab that feels like real scripting instead of a one-line command.
Now, let's run our AWK script to generate the report:
awk -f log_report.awk server_logs.txt > log_report.html
The -f option tells AWK to read the script from the specified file:
-f log_report.awk- Reads the AWK script from the filelog_report.awkserver_logs.txt- Processes this file using the script> log_report.html- Redirects the output to the filelog_report.html
You can view the contents of the report using the cat command:
cat log_report.html
If the HTML output feels hard to scan in the terminal, preview just the first part first:
head -n 15 log_report.html
This report provides a summary of total requests, error rate, top 5 IP addresses, and top 5 requested resources. In a real-world scenario, you could open this HTML file in a web browser for a formatted view.
The approach we've used in this script demonstrates how AWK can be used for more complex data analysis tasks. You can extend this script to include additional statistics or different visualizations based on your specific needs.
Summary
Congratulations! You've completed this lab on using the AWK command for log analysis. Let's recap what you've learned:
- Basic AWK usage: Printing specific fields from a structured text file.
- Filtering data: Using conditions in AWK to select specific log entries.
- Counting and summarizing: Using AWK to generate statistics from log data.
- Creating reports: Writing more complex AWK scripts to generate formatted reports.
These skills will be invaluable for analyzing log files, processing data, and generating reports in your future work as a system administrator or data analyst.
Here are some additional AWK parameters and features we didn't cover in this lab:
-F: Specifies a field separator other than whitespace.-v: Assigns a value to a variable.NR: A built-in variable representing the current record number.NF: A built-in variable representing the number of fields in the current record.BEGINandENDblocks: Special patterns for initialization and finalization.- Built-in functions: Mathematical functions, string functions, and more.
Remember, practice is key to mastering AWK. Try modifying the commands and scripts from this lab to analyze different aspects of the log file or to process other types of structured text data.



