Introduction
We will mainly introduce these three commands in this lab: grep
, sed
, and awk
. Regular expressions are a way to use these three commands.
We will mainly introduce these three commands in this lab: grep
, sed
, and awk
. Regular expressions are a way to use these three commands.
Many programming languages support the use of regular expressions for string manipulation. For example, Perl builds a powerful regular expression engine.
In terms of form and function, regular expressions are similar to wildcards. However, there is a big difference between them, especially in the meanings of some special matching characters. I hope you are clear on these two.
Suppose we have a text file including two strings, "labex" and "exlab".
lab*
If it is a regular expression, it will only match the lab. However, if * is a wildcard, both strings (labex exlab) will be matched simultaneously. Why? Because in the regular expression, * indicates that the preceding sub-expression is matched (the character before it) zero or multiple times, such as it can match "lab", "labs", "labex", "exlab", and as a wildcard, it indicates matching any wildcard followed by any number of arbitrary characters. So it can match "labex" and "exlab".
A regular expression, regex or regexp is, in theoretical computer science and formal language theory, a sequence of characters that defines a search pattern. Usually, this pattern is then used by string searching algorithms for "find" or "find and replace" operations on strings.
A regular expression is often referred to as a pattern used to describe or match a string conforming to a syntactic rule.
|
?|
separates alternate possibilities. So, for example, "boy|girl" can match "boy" or "girl".
+
can match the preceding pattern element one or multiple times. For example, "goo+gle" can match "gooogle", "goooogle" and so on;?
can match the preceding pattern element zero or one time. For example, "colou?r" can match "color" or "colour";*
can match the preceding pattern element zero or more times. For example, "0*42" can match "42", "042", "0042", "00042" and so on.()
can be used to define the scope and priority of the pattern string, which can be understood as a string within the parentheses as a whole. For example, "gr(a|e)y" is equivalent to "gray|grey". Likewise, "(grand)?father" can match "father" and "grandfather".
Regular expressions have many different styles. Here are some of the commonly used rules for regular expression matching for Perl and Python programming languages:
PCRE (Perl Compatible Regular Expressions) is a regular expression library written in C by Philip Hazel. PCRE is a lightweight library of functions much smaller than the regular expression library like Boost. PCRE is straightforward to use, and the function is also very powerful, performing better than the POSIX regular expression library and some classic regular expression libraries.
\
: **Mark the next character as a special or literal character. **For example, "n" matches the character "n". "\n" matches a line break. "'" matches "'" and "\(" matches "(".^
: Match the start position of the input string.$
: Match the end position of the input string.{n}
: n in {n} is a non-negative integer that means matching n times. For example, "o{2}" cannot match "o" in "Bob", but it can match two "o" in "food".{n,}
: n in {n,} is a non-negative integer that means matching at least n times. For example, "o{2,}" cannot match "o" in "Bob", but it can match all "o" in "fooooood".{n,m}
: m and n are non-negative integers, where n <= m. It can match at least n times and match utmost m times. For example, "o{1,3}" will match the first three "o" in "fooooood". "O{0,1}" is equivalent to "o?". Please note that there cannot be spaces between commas and two numbers.*
: Match the previous subexpression zero or more times. For example, "zo*" can match "z", "zo", and "zoo". * is equivalent to {0,}.+
: Match the preceding subexpression one or more times. For example, "zo+" can match "zo" and "zoo" but cannot match "z". "+" is equivalent to {1,}.?
: Match the previous subexpression zero times or once. For example, "do(es)?" can match "do" and "does". Therefore, it is Equivalent to {0,1}.?
: This matching pattern is non-greedy when the character follows any other restraining character (*, +,?, {N}, {n,}, {n, m}). The non-greedy pattern matches the search string as short as possible, while the default greedy pattern matches as long a matching string as possible. For example, for the string "oooo", "o+?" will match a single "o", and "o+" will match all "o"..
: Match any single character except "\n".(pattern)
: Match the pattern and get the matching substring.x\y
: Match x or y. For example, "z\food" can match "z" or "food". "(z\f)ood" matches "zood" or "food".[xyz]
: Character class matching any of the characters contained in [ ]. For example, "[abc]" can match "a" in "plain". Only \
keeps a special meaning. Other special characters, such as *
and +
have ordinary meanings. If ^
appears in the first place, it means a negative character set; if it appears in the middle of the string, it's only a normal character.[^xyz]
: A negative character set. Match any character that is not listed. For example, "^abc" can match "plin" in "plain".[a-z]
: Match any character within the specified range. For example, "[a-z]" can match any lowercase alphabetic character in the range of "a" to "z".[^a-z]
: Match any character that is not within the specified range. For example, "^a-z" can match any character that is not in the range of "a" to "z".The priority is decreasing from top to bottom, from left to right:
\
: Escape character()
, (?:)
, (?=)
, []
*
, +
, ?
, {n}
, {n,}
, {n,m}
: Restrictions^
, $
: Location point|
: The choice (also known as alternation or set union) operator matches either the expression before or after the operator.For more regular expressions, you may refer to the following link:
grep
grep
is used to print the matching pattern string in the output text, which uses a regular expression as a condition for pattern matching. grep
supports three regular expression engines respectively, with three parameters to specify the chosen engine:
Parameter | Description |
---|---|
-E |
POSIX extended regular expression, ERE |
-G |
POSIX basic regular expression, BRE |
-P |
Perl regular expression, PCRE |
In most cases, you will only use ERE and BRE.
Before using grep
to work on regular expressions, let us first introduce some grep parameters:
Parameter | Description |
---|---|
-b |
The offset in bytes of a matched pattern is displayed in front of the respective matched line. |
-c |
Only a count of selected lines are written to the standard output. |
-i |
Ignores case. |
-n |
Displays the line number where the matching text is located. |
-v |
Selected lines do not match the specified patterns. |
-r |
Recursively search subdirectories listed. |
-A n |
Print n lines of trailing context after each match. |
-B n |
Print n lines of leading context before each match. |
--color=auto |
Mark up the matching text with the expression stored in GREP_COLOR environment variable. The possible values of when can be never , always , or auto . |
In most distributions, the color of grep is set by default. However, you can modify the GREP_COLOR environment variable by parameter.
grep
Find the line that starts with "labex" in /etc/group
:
grep 'labex' /etc/group
grep '^labex' /etc/group
labex:project/ $ grep 'labex' /etc/group
grep '^labex' /etc/group
sudo:x:27:labex
ssl-cert:x:121:labex
labex:x:5000:
public:x:5002:labex
labex:x:5000:
Match all strings that begin with 'z' and end with 'o':
echo 'zero\nzo\nzoo' | grep 'z.*o'
Match strings beginning with 'z', ending with 'o', and with an arbitrary character in the middle:
echo 'zero\nzo\nzoo' | grep 'z.o'
Match strings that begin with 'z' and end with any number of 'o':
echo 'zero\nzo\nzoo' | grep 'zo*'
\n
is the line break.
labex:project/ $ echo 'zero\nzo\nzoo' | grep 'z.*o'
zero
zo
zoo
labex:project/ $ echo 'zero\nzo\nzoo' | grep 'zo*'
zero
zo
zoo
By default, grep
is case-sensitive. So, for example, the command below will match all lowercase letters:
echo '1234\nabcd' | grep '[a-z]'
Match all the numbers:
echo '1234\nabcd' | grep '[0-9]'
Match all the numbers:
echo '1234\nabcd' | grep '[[:digit:]]'
Match all the lowercase letters:
echo '1234\nabcd' | grep '[[:lower:]]'
Match all the uppercase letters:
echo '1234\nabcd' | grep '[[:upper:]]'
Match all the letters and numbers, including 0-9, a-z, A-Z:
echo '1234\nabcd' | grep '[[:alnum:]]'
Match all the letters:
echo '1234\nabcd' | grep '[[:alpha:]]'
The following contains the complete list of special symbols and instructions:
Special Symbol | Description |
---|---|
[:alnum:] |
Upper and lower case letters and digits (0-9, A-Z, a-z) |
[:alpha:] |
Any English uppercase and lowercase letters (A-Z, a-z) |
[:blank:] |
Blank key and [Tab] |
[:cntrl:] |
Control buttons on the top of the keyboard, including CR, LF, Tab, Del, and so on |
[:digit:] |
Numeral digits (0-9) |
[:graph:] |
All the keys except for the blank key (for example, Space) and [Tab] |
[:lower:] |
Lowercase letters (a-z) |
[:print:] |
Characters that can be printed out |
[:punct:] |
punctuation symbols (" ' ? ! ; : ## ...) |
[:upper:] |
uppercase letters ( A-Z) |
[:space:] |
Symbols include blank keys, [Tab], CR, and so on |
[:xdigit:] |
Hexadecimal digits, including 0-9, A-F, a-f |
Note that for the reason that [a-z] does not apply to all cases, which is also related to the value of the host set in the LANG
environment variable, we can use [:lower:] for all cases.
Exclude characters:
echo 'geek\ngood' | grep '[^o]'
Note that when ^
is placed in class (square brackets covering a pattern), it means to exclude the characters. Otherwise, ^
means the first line.
labex:project/ $ echo 'geek\ngood' | grep '[^o]'
geek
good
grep
Using Extended Regular Expression with grep
requires adding the -E
parameter or using egrep
.
Match only "zo":
echo 'zero\nzo\nzoo' | grep -E 'zo{1}'
Matches all words beginning with "zo":
echo 'zero\nzo\nzoo' | grep -E 'zo{1,}'
Match "www.labex.io" and "www.google.com":
echo 'www.labex.io\nwww.baidu.com\nwww.google.com' | grep -E 'www\.(labex.io|google.com)'
Or match the content that does not contain "baidu":
echo 'www.labex.io\nwww.baidu.com\nwww.google.com' | grep -Ev 'www\.baidu\.com'
Since .
has a special meaning, we need to use \.
to escape it.
sed
sed
is short for "stream editor for filtering and transforming text", meaning that the stream editor is for filtering and converting text.
In Linux/UNIX, the editors are very powerful, such as "vi/vim (editor of God)", "emacs (God's editor)" and "gedit". sed
is unique in that it is a non-interactive editor. So here we start introducing sed
.
sed
sed [Parameters]... [Command] [File]...
## For example:
sed -i '1s/sad/happy/' test
## Replace the "sad" in the first line of the test with "happy"
Parameter | Description |
---|---|
-n |
By default, each line of input is echoed to the standard output after all the commands have been applied. The -n option suppresses this behavior. |
-e |
Append the editing commands specified by the command argument to the list of commands. |
-f filename |
Specify to execute the commands in the filename file. |
-r |
Use extended regular expressions, which default to standard regular expressions. |
-i |
Directly modify the contents of the input file instead of printing to standard output. |
sed
[n1],[n2]command
[n1]~[step]command
## Some of these commands can be added to the scope of the role, such as:
sed -i 's/sad/happy/g' test ## g represents the global scope
sed -i 's/sad/happy/4' test ## 4 represents the 4th matching string
[n1],[n2]
means all lines from n1 to n2. [n1]~[step]
means all lines from n1, and the step size is step
. command
means the execution command. Here are some commonly used execution commands:
Command | Description |
---|---|
s |
Replace the specified character in a line |
c |
Change the selected line to the new text |
a |
Insert the text below the current line (a=append) |
i |
Insert the text above the selected line (i=insert) |
p |
Print the row of the template block |
d |
Delete the selected row |
First, let's choose a text file for practice:
cp /etc/passwd ~/project
Then print the specified line
## Print 2-5 lines
nl passwd | sed -n '2,5p'
## Print odd lines
nl passwd | sed -n '1~2p'
labex:project/ $
nl passwd | sed -n '2,5p'
2 daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
3 bin:x:2:2:bin:/bin:/usr/sbin/nologin
4 sys:x:3:3:sys:/dev:/usr/sbin/nologin
5 sync:x:4:65534:sync:/bin:/bin/sync
labex:project/ $
nl passwd | sed -n '1~2p'
1 root:x:0:0:root:/root:/bin/bash
3 bin:x:2:2:bin:/bin:/usr/sbin/nologin
5 sync:x:4:65534:sync:/bin:/bin/sync
Replace the specified character in a line
## Replace "labex" in the input text globally with "hehe" and print the replacement line. Note that the last "p" command cannot be omitted here.
sed -n 's/labex/hehe/gp' passwd
Change the selected line to the new text
nl passwd | grep "labex"
## change line 21
sed -n '21c\www.labex.io' passwd
labex:project/ $ sed -n 's/labex/hehe/gp' passwd
hehe:x:5000:5000::/home/hehe:/usr/bin/zsh
labex:project/ $ nl passwd | grep "labex"
sed -n '21c\www.labex.io' passwd
32 labex:x:5000:5000::/home/labex:/usr/bin/zsh
www.labex.io
If you want to learn more about the advanced use of sed
, you can explore the following link sed Reference
awk
AWK is an excellent text processing tool, one of the most powerful data processing engines available in Linux and Unix environments. It allows you to create short programs that read input files, sort data, process data, perform calculations on input and generate reports, as well as countless other functions. Most simply, AWK is a programming language tool for handling text.
awk
All operations are based on the pattern-action statements, as in the following form:
pattern {action}
You can see that, as with many programming languages, all the actions are in {}
. pattern
is usually a "relational" or "regular expression" that represents the text used to match the input and action
is the action that will be executed after the match has been made.
In a complete awk operation, you may have only one of them. If there is no pattern
, the default is to match all the input text. If there is no action
, the default is to print the matching content to the screen.
awk
awk [-F fs] [-v var=value] [-f prog-file | 'program text'] [file...]
-F
is used to pre-specify the field delimiter.-v
is used to specify variables for the awk program in advance.-f
is used to specify the program file to be executed by the awk command.Create a new text document using vim
:
vim test
The text should include the following content:
I like linux
www.labex.io
Use awk
to print the text:
awk '{print}' test
labex:project/ $ awk '{print}' test
I like linux
www.labex.io
In this operation, we have omitted the pattern. So, awk
will match the entire contents of the input text by default.
Each field of the first row of the test is shown as a single line:
awk '{
if(NR==1){
print $1 "\n" $2 "\n" $3
} else {
print}
}' test
I
like
linux
www.labex.io
Here, we use the branch selection statement if
. The way to use if
is similar to other high-level programming languages such as C, C++, and Java. If you're equipped with basic knowledge of these languages, you will understand the code greatly.
In addition, you need to pay attention to NR
and OFS
. These two are awk built-in variables. NR
represents the number of rows currently being processed. OFS
represents the output field separator; its default value is a space.
As shown in the above figure, we set the field delimiter to \n
(line breaks). $N
, where N is the corresponding field number, is also awk's built-in variable. It indicates that the corresponding field is being referenced. We have only three fields in the first line.
Hence only $1, $2, and $3 are quoted. In addition, there is another variable $0, that does not appear here. It references the entire contents of the current record (current line).
Then, change the separator of the second line to a space:
awk -F'.' '{
if(NR==2){
print $1 "\t" $2 "\t" $3
}}' test
www labex io
As a beginner, we should regard awk
as a programming language. We should try to enter more than one line rather than all codes written in a single line.
awk
Name | Description |
---|---|
FILENAME |
If multiple files exist, only the first one is valid. If the input is from the standard input, it is NULL. |
$0 |
The contents of the current record. |
$N |
N represents the field number. The maximum value is the value of the NF variable. |
FS |
Field separator, represented by a regular expression, defaults to blank. |
RS |
Input record separator (default is newline). |
NF |
Number of fields in the current record. |
NR |
Ordinal number of the current record. |
FNR |
Ordinal number of the current record in the current file. |
OFS |
Output field separator (default is blank). |
ORS |
Output record separator (default is newline). |
If you want to know more about awk
, please expect our follow-up courses or see the link below:
In this lab, you learned how to use regular expressions to search for patterns in text files. You also learned how to use the grep
command to search for patterns in text files. You also learned how to use the sed
command to replace patterns in text files. Finally, you learned how to use the awk
command to search for patterns in text files and print the results.