Mastering Grep, Sed, and Awk Commands

Introduction

We will mainly introduce these three commands in this lab: grep, sed, and awk. Regular expressions are a way to use these three commands.

Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL linux(("`Linux`")) -.-> linux/BasicSystemCommandsGroup(["`Basic System Commands`"]) linux(("`Linux`")) -.-> linux/InputandOutputRedirectionGroup(["`Input and Output Redirection`"]) linux(("`Linux`")) -.-> linux/TextProcessingGroup(["`Text Processing`"]) linux(("`Linux`")) -.-> linux/BasicFileOperationsGroup(["`Basic File Operations`"]) linux(("`Linux`")) -.-> linux/UserandGroupManagementGroup(["`User and Group Management`"]) linux(("`Linux`")) -.-> linux/VersionControlandTextEditorsGroup(["`Version Control and Text Editors`"]) shell(("`Shell`")) -.-> shell/BasicSyntaxandStructureGroup(["`Basic Syntax and Structure`"]) shell(("`Shell`")) -.-> shell/ControlFlowGroup(["`Control Flow`"]) shell(("`Shell`")) -.-> shell/AdvancedScriptingConceptsGroup(["`Advanced Scripting Concepts`"]) shell(("`Shell`")) -.-> shell/SystemInteractionandConfigurationGroup(["`System Interaction and Configuration`"]) linux(("`Linux`")) -.-> linux/FileandDirectoryManagementGroup(["`File and Directory Management`"]) linux/BasicSystemCommandsGroup -.-> linux/echo("`Text Display`") linux/InputandOutputRedirectionGroup -.-> linux/pipeline("`Data Piping`") linux/BasicSystemCommandsGroup -.-> linux/test("`Condition Testing`") linux/BasicSystemCommandsGroup -.-> linux/nl("`Line Numbering`") linux/TextProcessingGroup -.-> linux/grep("`Pattern Searching`") linux/TextProcessingGroup -.-> linux/sed("`Stream Editing`") linux/TextProcessingGroup -.-> linux/awk("`Text Processing`") linux/BasicFileOperationsGroup -.-> linux/cp("`File Copying`") linux/UserandGroupManagementGroup -.-> linux/passwd("`Password Changing`") linux/VersionControlandTextEditorsGroup -.-> linux/vim("`Text Editing`") shell/BasicSyntaxandStructureGroup -.-> shell/comments("`Comments`") shell/BasicSyntaxandStructureGroup -.-> shell/quoting("`Quoting Mechanisms`") shell/ControlFlowGroup -.-> shell/cond_expr("`Conditional Expressions`") shell/AdvancedScriptingConceptsGroup -.-> shell/arith_ops("`Arithmetic Operations`") shell/AdvancedScriptingConceptsGroup -.-> shell/subshells("`Subshells and Command Groups`") shell/SystemInteractionandConfigurationGroup -.-> shell/globbing_expansion("`Globbing and Pathname Expansion`") linux/FileandDirectoryManagementGroup -.-> linux/wildcard("`Wildcard Character`") subgraph Lab Skills linux/echo -.-> lab-18003{{"`Introducing Grep, Sed, and Awk Commands`"}} linux/pipeline -.-> lab-18003{{"`Introducing Grep, Sed, and Awk Commands`"}} linux/test -.-> lab-18003{{"`Introducing Grep, Sed, and Awk Commands`"}} linux/nl -.-> lab-18003{{"`Introducing Grep, Sed, and Awk Commands`"}} linux/grep -.-> lab-18003{{"`Introducing Grep, Sed, and Awk Commands`"}} linux/sed -.-> lab-18003{{"`Introducing Grep, Sed, and Awk Commands`"}} linux/awk -.-> lab-18003{{"`Introducing Grep, Sed, and Awk Commands`"}} linux/cp -.-> lab-18003{{"`Introducing Grep, Sed, and Awk Commands`"}} linux/passwd -.-> lab-18003{{"`Introducing Grep, Sed, and Awk Commands`"}} linux/vim -.-> lab-18003{{"`Introducing Grep, Sed, and Awk Commands`"}} shell/comments -.-> lab-18003{{"`Introducing Grep, Sed, and Awk Commands`"}} shell/quoting -.-> lab-18003{{"`Introducing Grep, Sed, and Awk Commands`"}} shell/cond_expr -.-> lab-18003{{"`Introducing Grep, Sed, and Awk Commands`"}} shell/arith_ops -.-> lab-18003{{"`Introducing Grep, Sed, and Awk Commands`"}} shell/subshells -.-> lab-18003{{"`Introducing Grep, Sed, and Awk Commands`"}} shell/globbing_expansion -.-> lab-18003{{"`Introducing Grep, Sed, and Awk Commands`"}} linux/wildcard -.-> lab-18003{{"`Introducing Grep, Sed, and Awk Commands`"}} end

Regular Expressions

Many programming languages support the use of regular expressions for string manipulation. For example, Perl builds a powerful regular expression engine.

In terms of form and function, regular expressions are similar to wildcards. However, there is a big difference between them, especially in the meanings of some special matching characters. I hope you are clear on these two.

Suppose we have a text file including two strings, "labex" and "exlab".

lab*

If it is a regular expression, it will only match the lab. However, if * is a wildcard, both strings (labex exlab) will be matched simultaneously. Why? Because in the regular expression, * indicates that the preceding sub-expression is matched (the character before it) zero or multiple times, such as it can match "lab", "labs", "labex", "exlab", and as a wildcard, it indicates matching any wildcard followed by any number of arbitrary characters. So it can match "labex" and "exlab".

Basic Syntax of Regular Expressions

A regular expression, regex or regexp is, in theoretical computer science and formal language theory, a sequence of characters that defines a search pattern. Usually, this pattern is then used by string searching algorithms for "find" or "find and replace" operations on strings.

A regular expression is often referred to as a pattern used to describe or match a string conforming to a syntactic rule.

What is `|`？

| separates alternate possibilities. So, for example, "boy|girl" can match "boy" or "girl".

How to limit the number of matches？

+ can match the preceding pattern element one or multiple times. For example, "goo+gle" can match "gooogle", "goooogle" and so on;
? can match the preceding pattern element zero or one time. For example, "colou?r" can match "color" or "colour";
* can match the preceding pattern element zero or more times. For example, "0*42" can match "42", "042", "0042", "00042" and so on.

Range and Priority

() can be used to define the scope and priority of the pattern string, which can be understood as a string within the parentheses as a whole. For example, "gr(a|e)y" is equivalent to "gray|grey". Likewise, "(grand)?father" can match "father" and "grandfather".

Syntax

Regular expressions have many different styles. Here are some of the commonly used rules for regular expression matching for Perl and Python programming languages:

PCRE (Perl Compatible Regular Expressions) is a regular expression library written in C by Philip Hazel. PCRE is a lightweight library of functions much smaller than the regular expression library like Boost. PCRE is straightforward to use, and the function is also very powerful, performing better than the POSIX regular expression library and some classic regular expression libraries.

\: **Mark the next character as a special or literal character. **For example, "n" matches the character "n". "\n" matches a line break. "'" matches "'" and "\(" matches "(".
^: Match the start position of the input string.
$: Match the end position of the input string.
{n}: n in {n} is a non-negative integer that means matching n times. For example, "o{2}" cannot match "o" in "Bob", but it can match two "o" in "food".
{n,}: n in {n,} is a non-negative integer that means matching at least n times. For example, "o{2,}" cannot match "o" in "Bob", but it can match all "o" in "fooooood".
{n,m}: m and n are non-negative integers, where n <= m. It can match at least n times and match utmost m times. For example, "o{1,3}" will match the first three "o" in "fooooood". "O{0,1}" is equivalent to "o?". Please note that there cannot be spaces between commas and two numbers.
*: Match the previous subexpression zero or more times. For example, "zo*" can match "z", "zo", and "zoo". * is equivalent to {0,}.
+: Match the preceding subexpression one or more times. For example, "zo+" can match "zo" and "zoo" but cannot match "z". "+" is equivalent to {1,}.
?: Match the previous subexpression zero times or once. For example, "do(es)?" can match "do" and "does". Therefore, it is Equivalent to {0,1}.
?: This matching pattern is non-greedy when the character follows any other restraining character (*, +,?, {N}, {n,}, {n, m}). The non-greedy pattern matches the search string as short as possible, while the default greedy pattern matches as long a matching string as possible. For example, for the string "oooo", "o+?" will match a single "o", and "o+" will match all "o".
.: Match any single character except "\n".
(pattern): Match the pattern and get the matching substring.
x\y: Match x or y. For example, "z\food" can match "z" or "food". "(z\f)ood" matches "zood" or "food".
[xyz]: Character class matching any of the characters contained in [ ]. For example, "[abc]" can match "a" in "plain". Only \ keeps a special meaning. Other special characters, such as * and + have ordinary meanings. If ^ appears in the first place, it means a negative character set; if it appears in the middle of the string, it's only a normal character.
[^xyz]: A negative character set. Match any character that is not listed. For example, "^abc" can match "plin" in "plain".
[a-z]: Match any character within the specified range. For example, "[a-z]" can match any lowercase alphabetic character in the range of "a" to "z".
[^a-z]: Match any character that is not within the specified range. For example, "^a-z" can match any character that is not in the range of "a" to "z".

The Priority of Operators

The priority is decreasing from top to bottom, from left to right:

\ : Escape character
(), (?:), (?=), []
*, +, ?, {n}, {n,}, {n,m} : Restrictions
^, $ : Location point
| : The choice (also known as alternation or set union) operator matches either the expression before or after the operator.

For more regular expressions, you may refer to the following link:

Regular_expression wiki

`grep`

grep is used to print the matching pattern string in the output text, which uses a regular expression as a condition for pattern matching. grep supports three regular expression engines respectively, with three parameters to specify the chosen engine:

Parameter	Description
`-E`	POSIX extended regular expression, ERE
`-G`	POSIX basic regular expression, BRE
`-P`	Perl regular expression, PCRE

In most cases, you will only use ERE and BRE.

Before using grep to work on regular expressions, let us first introduce some grep parameters:

Parameter	Description
`-b`	The offset in bytes of a matched pattern is displayed in front of the respective matched line.
`-c`	Only a count of selected lines are written to the standard output.
`-i`	Ignores case.
`-n`	Displays the line number where the matching text is located.
`-v`	Selected lines do not match the specified patterns.
`-r`	Recursively search subdirectories listed.
`-A n`	Print `n` lines of trailing context after each match.
`-B n`	Print `n` lines of leading context before each match.
`--color=auto`	Mark up the matching text with the expression stored in GREP_COLOR environment variable. The possible values of `when` can be `never`, `always`, or `auto`.

In most distributions, the color of grep is set by default. However, you can modify the GREP_COLOR environment variable by parameter.

Using POSIX Basic Regular Expression by `grep`

Position match

Find the line that starts with "labex" in /etc/group:

grep 'labex' /etc/group
grep '^labex' /etc/group

labex:project/ $ grep 'labex' /etc/group
grep '^labex' /etc/group
sudo:x:27:labex
ssl-cert:x:121:labex
labex:x:5000:
public:x:5002:labex
labex:x:5000:

Restriction match

Match all strings that begin with 'z' and end with 'o':

echo 'zero\nzo\nzoo' | grep 'z.*o'

Match strings beginning with 'z', ending with 'o', and with an arbitrary character in the middle:

echo 'zero\nzo\nzoo' | grep 'z.o'

Match strings that begin with 'z' and end with any number of 'o':

echo 'zero\nzo\nzoo' | grep 'zo*'

\n is the line break.

labex:project/ $ echo 'zero\nzo\nzoo' | grep 'z.*o'
zero
zo
zoo
labex:project/ $ echo 'zero\nzo\nzoo' | grep 'zo*'
zero
zo
zoo

Choice match

By default, grep is case-sensitive. So, for example, the command below will match all lowercase letters:

echo '1234\nabcd' | grep '[a-z]'

Match all the numbers:

echo '1234\nabcd' | grep '[0-9]'

Match all the numbers:

echo '1234\nabcd' | grep '[[:digit:]]'

Match all the lowercase letters:

echo '1234\nabcd' | grep '[[:lower:]]'

Match all the uppercase letters:

echo '1234\nabcd' | grep '[[:upper:]]'

Match all the letters and numbers, including 0-9, a-z, A-Z:

echo '1234\nabcd' | grep '[[:alnum:]]'

Match all the letters:

echo '1234\nabcd' | grep '[[:alpha:]]'

The following contains the complete list of special symbols and instructions:

Special Symbol	Description
`[:alnum:]`	Upper and lower case letters and digits (0-9, A-Z, a-z)
`[:alpha:]`	Any English uppercase and lowercase letters (A-Z, a-z)
`[:blank:]`	Blank key and [Tab]
`[:cntrl:]`	Control buttons on the top of the keyboard, including CR, LF, Tab, Del, and so on
`[:digit:]`	Numeral digits (0-9)
`[:graph:]`	All the keys except for the blank key (for example, Space) and [Tab]
`[:lower:]`	Lowercase letters (a-z)
`[:print:]`	Characters that can be printed out
`[:punct:]`	punctuation symbols (" ' ? ! ; : ## ...)
`[:upper:]`	uppercase letters ( A-Z)
`[:space:]`	Symbols include blank keys, [Tab], CR, and so on
`[:xdigit:]`	Hexadecimal digits, including 0-9, A-F, a-f

Note that for the reason that [a-z] does not apply to all cases, which is also related to the value of the host set in the LANG environment variable, we can use [:lower:] for all cases.

Exclude characters:

echo 'geek\ngood' | grep '[^o]'

Note that when ^ is placed in class (square brackets covering a pattern), it means to exclude the characters. Otherwise, ^ means the first line.

labex:project/ $ echo 'geek\ngood' | grep '[^o]'
geek
good

Using POSIX 'Extended Regular Expression' by `grep`

Using Extended Regular Expression with grep requires adding the -E parameter or using egrep.

Restriction match

Match only "zo":

echo 'zero\nzo\nzoo' | grep -E 'zo{1}'

Matches all words beginning with "zo":

echo 'zero\nzo\nzoo' | grep -E 'zo{1,}'

Choice match

Match "www.labex.io" and "www.google.com":

echo 'www.labex.io\nwww.baidu.com\nwww.google.com' | grep -E 'www\.(labex.io|google.com)'

Or match the content that does not contain "baidu":

echo 'www.labex.io\nwww.baidu.com\nwww.google.com' | grep -Ev 'www\.baidu\.com'

Since . has a special meaning, we need to use \. to escape it.

`sed`

sed is short for "stream editor for filtering and transforming text", meaning that the stream editor is for filtering and converting text.

In Linux/UNIX, the editors are very powerful, such as "vi/vim (editor of God)", "emacs (God's editor)" and "gedit". sed is unique in that it is a non-interactive editor. So here we start introducing sed.

Commonly Used Parameters of `sed`

sed [Parameters]... [Command] [File]...
## For example:
sed -i '1s/sad/happy/' test
## Replace the "sad" in the first line of the test with "happy"

Parameter	Description
`-n`	By default, each line of input is echoed to the standard output after all the commands have been applied. The `-n` option suppresses this behavior.
`-e`	Append the editing commands specified by the command argument to the list of commands.
`-f filename`	Specify to execute the commands in the `filename` file.
`-r`	Use extended regular expressions, which default to standard regular expressions.
`-i`	Directly modify the contents of the input file instead of printing to standard output.

The Execution Command of `sed`

[n1],[n2]command
[n1]~[step]command
## Some of these commands can be added to the scope of the role, such as:
sed -i 's/sad/happy/g' test ## g represents the global scope
sed -i 's/sad/happy/4' test ## 4 represents the 4th matching string

[n1],[n2] means all lines from n1 to n2. [n1]~[step] means all lines from n1, and the step size is step. command means the execution command. Here are some commonly used execution commands:

Command	Description
`s`	Replace the specified character in a line
`c`	Change the selected line to the new text
`a`	Insert the text below the current line (a=append)
`i`	Insert the text above the selected line (i=insert)
`p`	Print the row of the template block
`d`	Delete the selected row

Operation Example

First, let's choose a text file for practice:

cp /etc/passwd ~/project

Then print the specified line

## Print 2-5 lines
nl passwd | sed -n '2,5p'
## Print odd lines
nl passwd | sed -n '1~2p'

labex:project/ $
nl passwd | sed -n '2,5p'
2 daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
3 bin:x:2:2:bin:/bin:/usr/sbin/nologin
4 sys:x:3:3:sys:/dev:/usr/sbin/nologin
5 sync:x:4:65534:sync:/bin:/bin/sync
labex:project/ $
nl passwd | sed -n '1~2p'
1 root:x:0:0:root:/root:/bin/bash
3 bin:x:2:2:bin:/bin:/usr/sbin/nologin
5 sync:x:4:65534:sync:/bin:/bin/sync

Replace the specified character in a line

## Replace "labex" in the input text globally with "hehe" and print the replacement line. Note that the last "p" command cannot be omitted here.
sed -n 's/labex/hehe/gp' passwd

Change the selected line to the new text

nl passwd | grep "labex"
## change line 21
sed -n '21c\www.labex.io' passwd

labex:project/ $ sed -n 's/labex/hehe/gp' passwd
hehe:x:5000:5000::/home/hehe:/usr/bin/zsh
labex:project/ $ nl passwd | grep "labex"
sed -n '21c\www.labex.io' passwd
32 labex:x:5000:5000::/home/labex:/usr/bin/zsh
www.labex.io

If you want to learn more about the advanced use of sed, you can explore the following link sed Reference

`awk`

AWK is an excellent text processing tool, one of the most powerful data processing engines available in Linux and Unix environments. It allows you to create short programs that read input files, sort data, process data, perform calculations on input and generate reports, as well as countless other functions. Most simply, AWK is a programming language tool for handling text.

Basic Concepts of `awk`

All operations are based on the pattern-action statements, as in the following form:

pattern {action}

You can see that, as with many programming languages, all the actions are in {}. pattern is usually a "relational" or "regular expression" that represents the text used to match the input and action is the action that will be executed after the match has been made.

In a complete awk operation, you may have only one of them. If there is no pattern, the default is to match all the input text. If there is no action, the default is to print the matching content to the screen.

The Basic Format of `awk`

awk [-F fs] [-v var=value] [-f prog-file | 'program text'] [file...]

-F is used to pre-specify the field delimiter.
-v is used to specify variables for the awk program in advance.
-f is used to specify the program file to be executed by the awk command.

Operation Experience

Create a new text document using vim:

vim test

The text should include the following content:

I like linux
www.labex.io

Use awk to print the text:

awk '{print}' test

labex:project/ $ awk '{print}' test
I like linux
www.labex.io

In this operation, we have omitted the pattern. So, awk will match the entire contents of the input text by default.

Each field of the first row of the test is shown as a single line:

awk '{
if(NR==1){
print $1 "\n" $2 "\n" $3
} else {
print}
}' test

I
like
linux
www.labex.io

Here, we use the branch selection statement if. The way to use if is similar to other high-level programming languages such as C, C++, and Java. If you're equipped with basic knowledge of these languages, you will understand the code greatly.

In addition, you need to pay attention to NR and OFS. These two are awk built-in variables. NR represents the number of rows currently being processed. OFS represents the output field separator; its default value is a space.

As shown in the above figure, we set the field delimiter to \n (line breaks). $N, where N is the corresponding field number, is also awk's built-in variable. It indicates that the corresponding field is being referenced. We have only three fields in the first line.

Hence only $1, $2, and $3 are quoted. In addition, there is another variable $0, that does not appear here. It references the entire contents of the current record (current line).

Then, change the separator of the second line to a space:

awk -F'.' '{
if(NR==2){
print $1 "\t" $2 "\t" $3
}}' test

www labex io

As a beginner, we should regard awk as a programming language. We should try to enter more than one line rather than all codes written in a single line.

Some Built-in Variables of `awk`

Name	Description
`FILENAME`	If multiple files exist, only the first one is valid. If the input is from the standard input, it is NULL.
`$0`	The contents of the current record.
`$N`	N represents the field number. The maximum value is the value of the NF variable.
`FS`	Field separator, represented by a regular expression, defaults to blank.
`RS`	Input record separator (default is newline).
`NF`	Number of fields in the current record.
`NR`	Ordinal number of the current record.
`FNR`	Ordinal number of the current record in the current file.
`OFS`	Output field separator (default is blank).
`ORS`	Output record separator (default is newline).

If you want to know more about awk, please expect our follow-up courses or see the link below:

awk Guide

Summary

In this lab, you learned how to use regular expressions to search for patterns in text files. You also learned how to use the grep command to search for patterns in text files. You also learned how to use the sed command to replace patterns in text files. Finally, you learned how to use the awk command to search for patterns in text files and print the results.

Introducing Grep, Sed, and Awk Commands

Introduction

Skills Graph

Regular Expressions

Basic Syntax of Regular Expressions

What is |？

How to limit the number of matches？

Range and Priority

Syntax

The Priority of Operators

grep

Using POSIX Basic Regular Expression by grep

Using POSIX 'Extended Regular Expression' by grep

sed

Commonly Used Parameters of sed

The Execution Command of sed