Sys LW-08EN Regex-Filters
Sys LW-08EN Regex-Filters
3. REPORT
Make a report about this work and send it to the teacher’s email (use a docx Report Blank).
REPORT FOR LAB WORK 08: UNIX REGULAR EXPRESSIONS AND FILTERS.
Student Name Surname Student ID (nV) Date
a) Write your Surname in the letters of the English alphabet. Must be at least 7 letters, if not enough, then add the required number of
letters from the Name (if not enough, then repeat Surname and Name).
b) Replace the first 7 letters with their ordinal numbers in the alphabet.
To find some text, sometimes it is necessary to formulate complex queries according to the pattern. Many of the utilities with editing
capabilities use the standard set of special characters when searching for a pattern. A pattern containing such special characters is called
a regular expression (RE - Regular Expression).
The concept arose in the 1950s when the American mathematician Stephen Cole Kleene formalized the description of a regular language.
RE allow you to search for the form: find all four-letter words starting with d; or find all strings containing real numbers; or find all lines
starting with the correct IP address and etc.
Components of REs
Meta-sequence - several consecutive characters that together form a specially interpreted meaning. For example, “\k” or “\.” or “\\”.
Atom - RE element that has a nonzero width: symbol, symbol class, symbol group, meta-sequence forming a symbol. For example, the
element "[a-zA-Z0-9]" or “.” is an atom, and “^” is not an atom.
Basic Regular Expression (BRE) POSIX standard, consists of RE components found in any utility / program that works with RE.
Extended Regular Expression (ERE), special components that are not present in every utility / program that uses the RE mechanism.
Element Description
c Character. Simple, not special, symbol. Corresponds to oneself.
c1c2c3 Character Sequence. A sequence of consecutive characters that does not form a meta-sequence. Corresponds to itself.
. Dot. Any character. Matches any single character.
$ Dollar End of line. If it is at the end of RE or sub-RE, then it corresponds to the position “end of line”.
Caret. Start of line or inverse of class. If it is at the beginning of RE or sub-RE, then it corresponds to the position “beginning
^ of line”. If it stands first in the description of a character class, it means the inversion of this character class. Otherwise, it
corresponds to itself.
* Star Multiplier> = 0. Corresponds to no or more instances of the atom standing directly in front of it.
\ Backslash. Very powerful symbol. It can cancel the value of any other metacharacter or, conversely, form a metaserial
together with a suitable character.
[с1с2с3] Character class. The character class specified by the enumeration. Inside the class, the action of any metacharacters, except
for “^”, “-“, “[”, “:” and “\”, is canceled. Matches one of the characters listed.
[c1-c2] Character class. A character class defined by a character range from c1 to c2. Matches a single character belonging to a
given range.
[^с1с2с3] Character class. Inverse character class specified by enumeration. Matches a single character that does not belong to the
class [c1c2c3].
[^c1-c2] Character class. Inverse character class specified by a range of characters. Matches a single character that does not belong
to the class [c1-c2].
[c1c2-c3c4] Character class. A character class defined in a mixed way.
Element Description
\< and \> Word Boundaries. Corresponding positions: beginning of the word “\ <”; end of the word “\>”; the whole word is "\ <word \>".
By "word" here is meant a sequence of non-whitened atoms.
\b Word Boundaries. Corresponds to the position between whitespace and non-whitespace, as well as the position at the
beginning or end of a line.
\B Non-word Boundaries.
( and ) Buffer grouping.
\( and \) The same as the previous one for BRE mode.
| Bar. A disjunction operator (or operation) that allows you to combine any two or more regular subexpressions so that the
resulting regular expression matches any string that matches any of the subexpressions.
\| The same as the previous one for BRE mode.
+ Plus Multiplier> 0. Corresponds to one or more instances of the atom standing directly in front of it.
\+ The same as the previous one for BRE mode.
? Question Multiplier. Availability factor. Corresponds to no or one instance of the atom standing directly in front of it.
\? The same as the previous one for BRE mode.
Universal Multiplier. Matches from n to m instances of the atom standing directly in front of it. There are restrictions on the
value of n and m, for example, in perl their value does not exceed 65535. Use cases:
{n} - strictly n repetitions of an atom;
{n,m} {n,} - n or more repetitions of an atom;
{0,} - is equivalent to the factor *;
{1,} - is equivalent to the factor +;
{0,1} - is equivalent to the factor ?.
\{n,m\} The same as the previous one for BRE mode.
\k Backreference. The operator allows you to access the substring previously stored in the buffer, which coincided with the
subpattern. k is the number of the buffer. In perl k <= 65535, for other programs <= 9.
Character class. A predefined character class. Matches a single character from a named character class. Supports
localization. For example:
[[:class:]] [[: alpha:]] - any alphabetic character, ie letter;
[^ [: xdigit:]] - any character that is not a hex-digit.
[[:blank:]] – space and tab.
[ˆ[:lower:]ABC[0-9]] – none lowercase letters and none ABC and none 0-9
Remark 1.
In all of the below, the question is, does the regular expression match the full string.
Slash (/) is the delimiter character showing where the regular expression begins and ends.
Strings to be matched start and end with non-blank characters: there are no leading or trailing blanks.
Remark 2. Select odd questions Nr for odd Variant Nr; even questions Nr for even Variant Nr
The most common use of grep (egrep) is to filter lines of text containing (or not containing) a certain string. Command egrep – extended grep.
$ cat tennis.txt
Amelie Mauresmo, Fra
Kim Clijsters, BEL
Justine Henin, Bel
Serena Williams, usa
Venus Williams, USA
$ grep Williams tennis.txt
Serena Williams, usa
Venus Williams, USA
One of the most useful options of grep is grep -i which filters in a case (ignore registry) insensitive way.
Another very useful option is grep -v which outputs lines not matching the string.
With grep -A1 one line after the result is also displayed.
With grep -B1 one line before the result is also displayed.
With grep -C1 (context) one line before and one after are also displayed. All three options (A,B, and C) can display any number of lines
(using e.g. A2, B4 or C20).
The shell script below shows lines of the file under test that contain syntactically valid IPv4 addresses from 0.0.0.0 to 255.255.255.255,
with possible leading zeros in each octet. The test file is set as a script parameter $1.
$ ./egrep-script test-file
#!/bin/sh
egrep "\<\
([0-9]|[0-9][0-9]|[01][0-9][0-9]|2[0-4][0-9]|25[0-5])\.\
([0-9]|[0-9][0-9]|[01][0-9][0-9]|2[0-4][0-9]|25[0-5])\.\
([0-9]|[0-9][0-9]|[01][0-9][0-9]|2[0-4][0-9]|25[0-5])\.\
([0-9]|[0-9][0-9]|[01][0-9][0-9]|2[0-4][0-9]|25[0-5])\
\>" $1
Here at the end of each line is the escape character “\” of the line feed character for the shell. This screening works for the shell and allows
you to arrange the RE in several lines, which makes it more readable.
Select Your sub-task Nr = (Your Variant Nr) mod 4 + 1. Example for Var.Nr=104 à 104 mod 4 + 1 = 0 + 1 = 1.
0. Use egrep command for create and test Your REs, allowing to find the correct time in the format Mm.Ss (00.00 – 59.59).
Remark. Create pattern with possible leading zeros in each field (for example, 59.01 or 59.1 or 01.00 or 01.0 or 1.0).
1. Use egrep command for create and test Your REs, allowing to find the correct 24 clock time in the format Hh:Mm:Ss (00:00:00 –
23:59:59).
Remark. Create pattern with required leading zeros in each field (for example, 23:00:07).
2. Use egrep command for create and test Your REs, allowing to find the correct 12 clock time in the format Hh:Mm.Ss (00:00.00 –
11:59.59).
Remark. Create pattern with required leading zeros in each field (for example, 11:00.07).
3. Use egrep command for create and test Your REs, allowing to find the correct 24 clock time in the format Hh:Mm (00:00 - 23:59).
Remark. Create pattern with possible leading zeros in each field (for example, 23:01 or 23:1 or 03:00 or 03:0 or 3:0).
4. Use egrep command for create and test Your REs, allowing to find the correct 12 clock time in the format Hh:Mm (00:00 - 11:59).
Remark. Create pattern with possible leading zeros in each field (for example, 11:01 or 11:1 or 01:00 or 01:0 or 1:0).
Select Your sub-task Nr = (Your Variant Nr) mod 5 + 1. Example for Var.Nr=104 à 104 mod 5 + 1 = 4 + 1 = 5.
0. Use egrep command for create and test Your REs, allowing to find the correct IPv4 address that matches with every Class networks
(0.0.0.0-255.255.255.255).
Remark. Create pattern with possible leading zeros in each octet. (See solution example before page).
1. Use egrep command for create and test Your REs, allowing to find the correct IPv4 address that matches all Class A networks
(1.0.0.0-126.255.255.255).
Remark. Create pattern with possible leading zeros in each octet.
2. Use egrep command for create and test Your REs, allowing to find the correct IPv4 address that matches all Class B networks
(128.0.0.0-191.255.255.255).
Remark. Create pattern with possible leading zeros in each octet.
3. Use egrep command for create and test Your REs, allowing to find the correct IPv4 address that matches all Class C networks
(192.0.0.0-223.255.255.255).
Remark. Create pattern with possible leading zeros in each octet.
4. Use egrep command for create and test Your REs, allowing to find the correct IPv4 address that matches all Class D networks
(224.0.0.0-239.255.255.255).
Remark. Create pattern with possible leading zeros in each octet.
5. Use egrep command for create and test Your REs, allowing to find the correct IPv4 address that matches all Class E networks
(240.0.0.0-254.255.255.255).
Remark. Create pattern with possible leading zeros in each octet.
Select Your sub-task Nr = (Your Variant Nr) mod 6 + 1. Example for Var.Nr=104 à 104 mod 6 + 1 = 2 + 1 = 3.
1. Use egrep command for create and test Your REs, allowing to find the correct date in the format Dd/Mm/YYyy (21/03/2019).
Remark. Use leap years with 365 days (February is always 28 days). Apply only years between 1000 and 9999. Create pattern with
required leading zeros in each field.
2. Use egrep command for create and test Your REs, allowing to find the correct date in the format YYyy.Mm.Dd (2019.03.21).
Remark. Use leap years with 365 days (February is always 28 days). Apply only years between 1000 and 9999. Create pattern with
required leading zeros in each field.
3. Use egrep command for create and test Your REs, allowing to find the correct date in the format Mm.Dd.yy (03.21.19).
Remark. Use leap years with 365 days (February is always 28 days). Apply only years between 1000 and 9999. Create pattern with
required leading zeros in each field.
4. Use egrep command for create and test Your REs, allowing to find the correct date in the format yy.Mm.Dd (19.03.21).
Remark. Use leap years with 365 days (February is always 28 days). Apply only years between 1000 and 9999. Create pattern with
required leading zeros in each field.
5. Use egrep command for create and test Your REs, allowing to find the correct date in the format YYyyMmDd (20190321).
Remark. Use leap years with 365 days (February is always 28 days). Apply only years between 1000 and 9999. Create pattern with
required leading zeros in each field.
6. Use egrep command for create and test Your REs, allowing to find the correct date in the format DdMmyy (210319).
Remark. Use leap years with 365 days (February is always 28 days). Apply only years between 1000 and 9999. Create pattern with
required leading zeros in each field.
Select Your sub-task Nr = (Your Variant Nr) mod 7 + 1. Example for Var.Nr=104 à 104 mod 7 + 1 = 6 + 1 = 7.
1. Use egrep command for create and test Your REs, allowing to find the valid New Visa card numbers start with a 4 and have 16 digits.
Visa put digits in sets of 4.
Remark. Create pattern with possible spaces (“ “) or dashes (“-“) in card numbers.
2. Use egrep command for create and test Your REs, allowing to find the valid American Express card numbers start with 34 or 37 and
have 15 digits. Amex use groups of 4-6-5 digits.
Remark. Create pattern with possible spaces (“ “) or dashes (“-“) in card numbers.
3. Use egrep command for create and test Your REs, allowing to find the valid Diners Club card numbers begin with 300 through 305, or
36, or 38. All have 14 digits. Diners Club use groups of 4-6-4 digits.
Remark. Create pattern with possible spaces (“ “) or dashes (“-“) in card numbers.
4. Use egrep command for create and test Your REs, allowing to find the valid Discover card numbers begin with 6011 or 65. All have 16
digits. Discover put digits in sets of 4.
Remark. Create pattern with possible spaces (“ “) or dashes (“-“) in card numbers.
5. Use egrep command for create and test Your REs, allowing to find the valid JCB cards beginning with 2131 or 1800 have 15 digits. JCB
cards beginning with 35 have 16 digits. JCB put digits in sets of 4.
Remark. Create pattern with possible spaces (“ “) or dashes (“-“) in card numbers.
6. Use egrep command for create and test Your REs, allowing to find the valid MasterCard numbers either start with the numbers 51
through 55 or with the numbers 2221 through 2720. All have 16 digits. MasterCard put digits in sets of 4.
Remark. Create pattern with possible spaces (“ “) or dashes (“-“) in card numbers.
7. Use egrep command for create and test Your REs, allowing to find the valid Universal Electronic Card (UEC) numbers either start with
the numbers 7. All have 16 digits. UEC put digits in sets of 4.
Remark. Create pattern with possible spaces (“ “) or dashes (“-“) in card numbers.
Commands that are created to be used with a pipe are often called filters. These filters are very small programs that do one specific thing
very efficiently. They can be used as building blocks. The combination of simple commands and filters in a long pipe allows you to design
elegant solutions.
cat, tac
When between two pipes, the cat command does nothing (except putting stdin on stdout). Command tac – revers cat.
tee
Writing long pipes in Unix is fun, but sometimes you may want intermediate results. This is where tee comes in handy. The tee filter puts
stdin on stdout and also into a file. So tee is almost the same as cat, except that it has two or more identical outputs.
The cut filter can select columns from files, depending on a delimiter or a count of bytes. The screenshot below uses cut to filter for the
username and userid in the /etc/passwd file. It uses the colon as a delimiter, and selects fields 1 and 3.
When using a space as the delimiter for cut, you have to quote the space.
This example uses cut to display the second to the seventh character of /etc/passwd.
You can translate characters with tr. The screenshot shows the translation of all occurrences of e to E.
$ cat count.txt
one
two
three
four
five
$ cat count.txt | tr '\n' ' '
one two three four five
$
$ cat spaces.txt
one two three
four five six
$ cat spaces.txt | tr -s ' '
one two three
four five six
$
$ cat tennis.txt | tr -d e
Amli Maursmo, Fra
Kim Clijstrs, BEL
Justin Hnin, Bl
Srna Williams, usa
Vnus Williams, USA
$
$ wc tennis.txt
5 15 100 tennis.txt
$
$ wc -l tennis.txt
5 tennis.txt
$
$ wc -w tennis.txt
15 tennis.txt
$
$ wc -c tennis.txt
100 tennis.txt
$
sort
$ cat music.txt
Queen
Brel
Led Zeppelin
Abba
$
$ sort music.txt
Abba
Brel
Led Zeppelin
Queen
$
The screenshot below shows the difference between an alphabetical sort and a numerical sort (both on the third column).
$ cat music.txt
Queen
Brel
Queen
Abba
$ sort music.txt
Abba
Brel
Queen
Queen
$ sort music.txt |uniq
Abba
Brel
Queen
Comparing streams (or files) can be done with the comm. By default comm will output three columns. In this example, Bowie and Sweet are
only in the first file, Turner is only in the second, Abba, Cure and Queen are in both lists.
The output of comm can be easier to read when outputting only a single column. The digits point out which output columns should not be
displayed.
European humans like to work with ascii characters, but computers store files in bytes. The example below creates a simple file, and then
uses od to show the contents of the file in hexadecimal bytes
$ od -b text.txt
0000000 141 142 143 144 145 146 147 012 061 062 063 064 065 066 067 012
0000020
$ od -c text.txt
0000000 a b c d e f g \n 1 2 3 4 5 6 7 \n
0000020
The stream editor sed can perform editing functions in the stream, using regular expressions.
Add g for global replacements (all occurrences of the string per line).
$ cat tennis.txt
Venus Williams, USA
Martina Hingis, SUI
Justine Henin, BE
Serena williams, USA
Kim Clijsters, BE
Yanina Wickmayer, BE
$ cat tennis.txt | sed '/BE/d'
Venus Williams, USA
Martina Hingis, SUI
Serena williams, USA
4.4.1. Put a sorted list of all bash users (from /etc/passwd file) in bashusers.txt file.
4.4.3. Make a list of all filenames in /etc/ directory that contain the string conf in their filename.
4.4.4. Make a sorted list of all files in /etc/ directory that contain the registry case insensitive string conf in their filename.
4.4.5. Look at the output of /sbin/ifconfig. Write a line that displays only ip address and the subnet mask.
$ /sbin/ifconfig | head -2 | grep 'inet ' | tr -s ' ' | cut -d' ' -f3,5
$ cat text1
This is, yes really! , a text with ?&* too many str$ange# characters ;-)
© Yuriy Shamshin, 2021 29/31
$ cat text1 | tr -d ',!$?.*&^%#@;()-'
This is yes really a text with too many strange characters
4.4.7. Write a line that receives a text file, and outputs all words on a separate line.
$ cat text2
it is very cold today without the sun
4.4.8. Write a spell checker on the command line. (There may be a dictionary in /usr/share/dict/ .)
$ cat text3 | tr 'A-Z ' 'a-z\n' | sort | uniq | comm -23 - DICT
zun
4.4.9. Here’s a way to get a sorted list of the unique file extensions in the current directory, with a count of each type.
The output shows the list of file extensions, sorted alphabetically with a count of each unique type.