[go: up one dir, main page]

0% found this document useful (0 votes)
126 views53 pages

Untitled

Uploaded by

dfgdfg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
126 views53 pages

Untitled

Uploaded by

dfgdfg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Python REGEX

A Little Guide

Scientific Programmer
This book is for sale at http://leanpub.com/pythonregex

This version was published on 2018-09-09

This is a Leanpub book. Leanpub empowers authors and


publishers with the Lean Publishing process. Lean Publishing is
the act of publishing an in-progress ebook using lightweight tools
and many iterations to get reader feedback, pivot until you have
the right book and build traction once you do.

© 2018 Scientific Programmer


Contents

Python RegEx . . . . . . . . . . . . . . . . . . . . . . . . . 1
Python regex match function . . . . . . . . . . . . . . . . 4
Python regex search function . . . . . . . . . . . . . . . . 8
Python regex match vs. search functions . . . . . . . . . 12
Python regex group functions . . . . . . . . . . . . . . . . 14
Python regex sub function for search and replace . . . . 16
Python regex split function . . . . . . . . . . . . . . . . 18
Python regex findall function . . . . . . . . . . . . . . . 19
Python regex compile function . . . . . . . . . . . . . . . 21
Python regex finditer function . . . . . . . . . . . . . . 24
Python Regex - Lookarounds and Greedy Search . . . . 25
Python Lookahead . . . . . . . . . . . . . . . . . . . . . . 27
Python Look behind . . . . . . . . . . . . . . . . . . . . . 28
Python Lazy and Greedy Search . . . . . . . . . . . . . . 29
Project 2: Parsing data from a HTML file with Python
and REGEX . . . . . . . . . . . . . . . . . . . . . . 32
Project 3: PDF scraping in Python + REGEX . . . . . . . 34
Project 4: Web scraping in Python + REGEX . . . . . . . 37
Project 5: Amazon web crawling in Python + REGEX . . 40
Quiz # 1 - REGEX Patterns . . . . . . . . . . . . . . . . . 42
Quiz # 2 Python REGEX Functions . . . . . . . . . . . . . 45
References . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
CONTENTS 1

Python RegEx
Hello coders! Let’s start our quest with regular expressions (RegEx).
In Python, the module re provides full support for Perl-like regular
expressions in Python. We need to remember that there are many
characters in Python, which would have special meaning when
they are used in regular expression. To avoid bugs while dealing
with regular expressions, we use raw strings as r'expression'.
The re module in Python provides multiple methods to perform
queries on an input string. Here are the most commonly used
methods:

• re.match()
• re.search()
• re.split()
• re.sub()
• re.findall()
• re.compile()

We will look at these function and related flags with examples in


the next section.

Python Regular Expression Patterns List

The following table lists the regular expression syntax that is avail-
able in Python. Note that any Regex can be concatenated to form
new regular expressions; if X and Y are both regular expressions,
then XY is also a regular expression.
CONTENTS 2

Pattern Description
. Matches any single character except
newline. Using m option allows it to match
newline as well.
^ Matches the start of the string, and in
re.MULTILINE (see the next lesson on how to
change to multiline) mode also matches
immediately after each newline.
$ Matches end of line. In re.MULTILINE mode
also matches before a newline.
[.] Matches any single character in brackets.
[^.] Matches any single character not in brackets.
* Matches 0 or more occurrences of preceding
expression.
+ Matches 1 or more occurrence of preceding
expression.
? Matches 0 or 1 occurrence of preceding
expression.
{n} Matches exactly n number of occurrences of
preceding expression.
{n,} Matches n or more occurrences of preceding
expression.
{n, m} Matches at least n and at most m occurrences
of preceding expression. For example, x{3,5}
will match from 3 to 5 'x' characters.

Pattern Description
xy Matches either x or y.
\d Matches digits. Equivalent to [0-9].
\D Matches nondigits.
\w Matches word characters.
\W Matches nonword characters.
\z Matches end of string.
\G Matches point where last match finished.
\b Matches the empty string, but only at the
beginning or end of a word. Boundary
between word and non-word and /B is
opposite of /b. Example r"\btwo\b" for
searching two from 'one two three'.
CONTENTS 3

Pattern Description
\B Matches nonword boundaries.
\n, \t Matches newlines, carriage returns, tabs, etc.
\s Matches whitespace.
\S Matches nonwhitespace.
\A Matches beginning of string.
\Z Matches end of string. If a newline exists, it
matches just before newline.

Groups and Lookarounds

More details later:


Pattern Description
(re) Groups regular expressions and remembers
matched text.
(?: re) Groups regular expressions without
remembering matched text. For example,
the expression (?:x{6})* matches any
multiple of six ‘x’ characters.
(?#...) Comment.
(?= ...) Matches if ... matches next, but doesn’t
consume any of the string. This is called a
lookahead assertion. For example,
Scientific (?=Python) will match
Scientific only if it’s followed by Python.
(?!...) Matches if ... doesn’t match next. This is a
negative lookahead assertion.
(?<=...) Matches if the current position in the string
is preceded by a match for ... that ends at
the current position.
CONTENTS 4

Python regex match function


The match function attempts to match a re pattern to string with
optional flags.
Here is the syntax for this function −

1 re.match(pattern, string, flags=0)

Where,

• pattern is the regular expression to be matched,


• string is the string to be searched to match the pattern at the
beginning of string and
• flags, which you can specify different flags using bitwise OR
(|).

Match Flags

Modifier Description
re.I Performs case-insensitive matching.
re.L Interprets words according to the current
locale. This interpretation affects the
alphabetic group (\w and \W), as well as
word boundary behavior (\b and \B).
re.M Makes $ match the end of a line and makes
^ match the start of any line.
re.S Makes a period (dot) match any character,
including a newline.
re.U Interprets letters according to the Unicode
character set. This flag affects the behavior
of \w, \W, \b, \B.
re.X It ignores whitespace (except inside a set []
or when escaped by a backslash and treats
unescaped # as a comment marker.
CONTENTS 5

Return values

• The re.match function returns a match object on success and


None upon failure. -
• Use group(n) or groups() function of match object to get
matched expression, e.g., group(n=0) returns entire match (or
specific subgroup n)
• The function groups() returns all matching subgroups in a
tuple (empty if there weren’t any).

Example 1

Let’s find the words before and after the word to:

1 #!/usr/bin/python
2 import re
3
4 line = "Learn to Analyze Data with Scientific Python";
5
6 m = re.match( r'(.*) to (.*?) .*', line, re.M|re.I)
7
8 if m:
9 print "m.group() : ", m.group()
10 print "m.group(1) : ", m.group(1)
11 print "m.group(2) : ", m.group(2)
12 else:
13 print "No match!!"

The first group (.*) identified the string: Learn and the next group
(*.?) identified the string: Analyze. Output:
CONTENTS 6

1 m.group() : Learn to Analyze Data with Scientific Python


2 m.group(1) : Learn
3 m.group(2) : Analyze

Example 2

groups([default]) returns a tuple containing all the subgroups of


the match, from 1 up to however many groups are in the pattern.

1 #!/usr/bin/python
2 import re
3
4 line = "Learn Data, Python";
5
6 m = re.match( r'(\w+) (\w+)', line, re.M|re.I)
7
8 if m:
9 print "m.group() : ", m.groups()
10 print "m.group (1,2)", m.group(1, 2)
11 else:
12 print "No match!!"

Output:

1 m.group() : ('Learn', 'Data')


2 m.group (1,2) ('Learn', 'Data')

Example 3

groupdict([default]) returns a dictionary containing all the named


subgroups of the match, keyed by the subgroup name.
CONTENTS 7

1 #!/usr/bin/python
2 import re
3
4 number = "124.13";
5
6 m = re.match( r'(?P<Expotent>\d+)\.(?P<Fraction>\d+)', nu\
7 mber)
8
9 if m:
10 print "m.groupdict() : ", m.groupdict()
11 else:
12 print "No match!!"

Output: m.groupdict() : {'Expotent': '124', 'Fraction':


'13'}

Example 4

Start, end. How can we match the start or end of a string? We


can use the “A” and “Z” metacharacters. We precede them with
a backslash. We match strings that start with a certain letter, and
those that end with another.

1 import re
2
3 values = ["Learn", "Live", "Python"];
4
5 for value in values:
6 # Match the start of a string.
7 result = re.match("\AL.+", value)
8 if result:
9 print("START MATCH [L]:", value)
10
11 # Match the end of a string.
12 result2 = re.match(".+n\Z", value)
CONTENTS 8

13 if result2:
14 print("END MATCH [n]:", value)

Output:

1 output
2
3 ('START MATCH [L]:', 'Learn')
4 ('END MATCH [n]:', 'Learn')
5 ('START MATCH [L]:', 'Live')
6 ('END MATCH [n]:', 'Python')

Example 5

start([group]) and end([group]) return the indices of the start


and end of the substring matched by group. See the next lesson for
an example.

Python regex search function

The Search Function


The search function searches for first occurance of a re pattern to
string with optional flags.
Here is the syntax for this function −

1 re.search(pattern, string, flags=0)

Where, * pattern is the regular expression to be matched, * string


is the string to be searched to match the pattern at the beginning
of string and * flags, which you can specify different flags using
bitwise OR (|).
CONTENTS 9

Match Flags

Modifier Description
re.I Performs case-insensitive matching.
re.L Interprets words according to the current
locale. This interpretation affects the
alphabetic group (\w and \W), as well as
word boundary behavior (\b and \B).
re.M Makes $ match the end of a line and makes
^ match the start of any line.
re.S Makes a period (dot) match any character,
including a newline.
re.U Interprets letters according to the Unicode
character set. This flag affects the behavior
of \w, \W, \b, \B.
re.X It ignores whitespace (except inside a set []
or when escaped by a backslash and treats
unescaped # as a comment marker.

Return values

• The re.search function returns a match object on success


and None upon failure. -
• Use group(n) or groups() function of match object to get
matched expression, e.g., group(n=0) returns entire match (or
specific subgroup n)
• The function groups() returns all matching subgroups in a
tuple (empty if there weren’t any).

Example 1

Let’s find the words before and after the word to:
CONTENTS 10

1 #!/usr/bin/python
2 import re
3
4 line = "Learn to Analyze Data with Scientific Python";
5
6 m = re.search( r'(.*) to (.*?) .*', line, re.M|re.I)
7
8 if m:
9 print "m.group() : ", m.group()
10 print "m.group(1) : ", m.group(1)
11 print "m.group(2) : ", m.group(2)
12 else:
13 print "No match!!"

Output

1 m.group() : Learn to Analyze Data with Scientific Python


2 m.group(1) : Learn
3 m.group(2) : Analyze

The first group (.*) identified the string: Learn and the next group
(*.?) identified the string: Analyze.

Example 2

groups([default]) returns a tuple containing all the subgroups of


the match, from 1 up to however many groups are in the pattern.
CONTENTS 11

1 #!/usr/bin/python
2 import re
3
4 line = "Learn Data, Python";
5
6 m = re.search( r'(\w+) (\w+)', line, re.M|re.I)
7
8 if m:
9 print "m.group() : ", m.groups()
10 print "m.group (1,2)", m.group(1, 2)
11 else:
12 print "No match!!"

Output m.group() : ('Learn', 'Data') m.group (1,2) ('Learn',


'Data')

Example 3

groupdict([default]) returns a dictionary containing all the named


subgroups of the match, keyed by the subgroup name.

1 #!/usr/bin/python
2 import re
3
4 number = "124.13";
5
6 m = re.search( r'(?P<Expotent>\d+)\.(?P<Fraction>\d+)', n\
7 umber)
8
9 if m:
10 print "m.groupdict() : ", m.groupdict()
11 else:
12 print "No match!!"
CONTENTS 12

output m.groupdict() : {'Expotent': '124', 'Fraction':


'13'}

Python regex match vs. search


functions
We have learned so far that Python offers two different primitive
operations:

• match
• search

So, how they are different to each other?


Note that match checks for a match only at the beginning of a string,
while search checks for a match anywhere in the string!

Example 1
Let’s try to find the word Python:

1 #!/usr/bin/python
2 import re
3
4 line = "Learn to Analyze Data with Scientific Python";
5
6 m = re.search( r'(python)', line, re.M|re.I)
7
8 if m:
9 print "m.group() : ", m.group()
10 else:
11 print "No match by obj.search!!"
12
CONTENTS 13

13 m = re.match( r'(python)', line, re.M|re.I )


14
15 if m:
16 print "m.group() : ", m.group()
17 else:
18 print "No match by obj.match"

Output

1 m.group() : Python
2 No match by obj.match

You see above that, match function won’t find the word “Python”,
but search can! Also note the use of the re.I (case insensitive) option.

Example 2:
start([group]) and end([group]) return the indices of the start
and end of the substring matched by group. We need to use search
instead of match for this example:

1 #!/usr/bin/python
2 import re
3
4 email = "hello@leremove_thisarntoanalayzedata.com ";
5
6 # m = re.match ("remove_this", email) // This will not wo\
7 rk!
8 m = re.search("remove_this", email)
9
10 if m:
11 print "email address : ", email[:m.start()] + email[m.\
12 end():]
13 else:
14 print "No match!!"
CONTENTS 14

Output

1 email address : hello@learntoanalayzedata.com

Python regex group functions


A regular expression can have named groups. This makes it easier
to retrieve those groups after calling match(). But it makes the
pattern more complex.
Following example shows a named group (first and last).

1 #!/usr/bin/python
2 import re
3
4 # A string.
5 name = "Learn Scientific"
6
7 # Match with named groups.
8 m = re.match("(?P<first>\w+)\W+(?P<last>\w+)", name)
9
10 # Print groups using names as id.
11 if m:
12 print(m.group("first"))
13 print(m.group("last"))

Output:

1 Learn
2 Scientific

We can get the first name with the string “first” and the group()
method. We use “last” for the last name.
CONTENTS 15

Group dictionary Groupdict

A regular expression with named groups can fill a dictionary. This


is done with the groupdict() method. In the dictionary, each group
name is a key and Each value is the data matched by the regular
expression. So we receive a key-value store based on groups.

1 import re
2
3 name = "Scientific Python"
4
5 # Match names.
6 m = re.match("(?P<first>\w+)\W+(?P<last>\w+)", name)
7
8 if m:
9 # Get dict.
10 d = m.groupdict()
11
12 # Loop over dictionary with for-loop.
13 for t in d:
14 print(" key:", t)
15 print("value:", d[t])

Output

1 (' key:', 'last')


2 ('value:', 'Python')
3 (' key:', 'first')
4 ('value:', 'Scientific')
CONTENTS 16

Python regex sub function for search


and replace

Python search and replace

The sub() function replaces every occurrence of a pattern with a


string or the result of a function.

Syntax
1 re.sub(pattern, repl, string, maximum=0)

This method replaces all occurrences of the re pattern in string


with repl, substituting all occurrences unless a maximum value is
provided. Finally, returns the modified string.

Example 1

Format the phone number +61-927 479-548, remove everyting


except the digits:

1 #!/usr/bin/python
2 import re
3
4 phone = "Please call the phone # +61-927 479-548"
5
6 # Remove anything other than digits
7 num = re.sub(r'\D', "", phone)
8 print "The raw phone numbe is : ", num

Output
CONTENTS 17

1 The raw phone numbe is : 61927479548

Example 2:

Let’s use the sub() function to “munge” a text, i.e., randomize the
order of all the characters in each word of a sentence except for the
first and last characters:

1 #!/usr/bin/python
2 import re
3 import random
4
5 def repl(m):
6 inner_word = list(m.group(2))
7 random.shuffle(inner_word)
8 return m.group(1) + "".join(inner_word) + m.group\
9 (3)
10
11 line = "Learn Scientific Python with Regex";
12 m = re.sub(r"(\w)(\w+)(\w)", repl, line);
13
14 if m:
15 print "munged : ", m;
16 else:
17 print "No match!!";

Output

1 munged : Laren Scifneitic Phtoyn with Rgeex


CONTENTS 18

Python regex split function

Python string splitter

The split() funtion, splits a string by the occurrences of a pattern.

Syntax
1 re.split(pattern, string, maxsplit=0, flags=0)

If maxsplit is nonzero, at most maxsplit splits occur, and the


remainder of the string is returned as the final element of the list.

Example 1

Break the string: ‘Learn, Scientific, Python’, into three elements:


Learn, Scientific, Python:

1 #!/usr/bin/python
2 import re
3
4 line = "Learn, Scientific, Python"
5
6 m = re.split('\W+', line)
7
8 if m:
9 print m
10 else:
11 print "No match!"

Output ['Learn', 'Scientific', 'Python']


CONTENTS 19

Example 2
Now let’s make a second example, where a splitter can any alphabet
from [a-z]

1 #!/usr/bin/python
2 import re
3
4 line = "+61Lean7489Scientific324234"
5
6 m = re.split('[A-Za-z]+', line, re.I)
7
8 if m:
9 print m
10 else:
11 print "No match!"

Output ['+61', '7489', '324234']

Python regex findall function

Python string findall


findall() is a powerful function in the re module. It finds all
the matches and returns them as a list of strings, with each string
representing one match.

Syntax
1 re.findall(pattern, string, flags=0)

The string is scanned left-to-right, and matches are returned in


the order found. If one or more groups are present in the pattern,
CONTENTS 20

return a list of groups. Empty matches are included in the result


unless they touch the beginning of another match.

Example 1

Find all and return the email addresses:

1 #!/usr/bin/python
2 import re
3
4 line = 'your alpha@scientificprograming.io, blah beta@sci\
5 entificprogramming.io blah user'
6
7 emails = re.findall(r'[\w\.-]+@[\w\.-]+', line)
8
9 if emails:
10 print emails
11 else:
12 print "No match!"

Output ['alpha@scientificprograming.io', 'beta@scientificprogramming.io']

Example 2: findall and Groups

Now let’s make a second example. Groups () can be combined with


findall(). If the pattern includes 2 or more parenthesis groups,
then instead of returning a list of strings, findall() returns a list of
tuples. Each tuple represents one match of the pattern, and inside
the tuple is the group(1), group(2), etc.
The following example, will find, 'alpha', 'scientificprograming.io',
'beta', and 'scientificprogramming.me'.
CONTENTS 21

1 #!/usr/bin/python
2 import re
3
4 line = 'your alpha@scientificprograming.io, blah beta@sci\
5 entificprogramming.me blah user'
6
7 tuples = re.findall(r'([\w\.-]+)@([\w\.-]+)', line)
8
9 if tuples:
10 print tuples
11 else:
12 print "No match!"

Output [('alpha', 'scientificprograming.io'), ('beta', 'scientificprogrammi

Python regex compile function


The compile function compiles a regular expression pattern into a
regular expression object, which can be used for matching using
its match(), search(), etc. methods.

Syntax
1 re.compile(pattern, flags=0)

The expression’s behaviour can be modified by specifying a


flags value (discussed earlier). Values can be any of the following
variables, combined using bitwise OR (the | operator).
For example:
CONTENTS 22

1 m = re.match(pattern, string)

is equivalent to:

1 p = re.compile(pattern)
2 m = p.match(string)

Note that the programs that use only a few regular expressions at
a time don’t need to compile regular expressions (recent patterns
are cached automatically due to re._MAXCACHE setting).

Example

Consider that you have an html file index.html like below:

1 <html>
2 <header>SP:</header>
3 <body>
4 <h1>Learn</h1>
5 <p>Scientific Programming</p>
6 </body>
7 </html>

You want to read this file and output:

1 SP:Learn Scientific Programming

The following code can do this:


CONTENTS 23

1 import re
2 import os
3 def main():
4 f = open('index.html')
5 pattern = re.compile(r'(?P<start><.+?>)(?P<content>.*\
6 ?)(</.+?>)')
7 output_text = []
8 for text in f:
9 match = pattern.match(text)
10 if match is not None:
11 output_text.append(match.group('content'))
12
13 fixed_content = ' '.join(output_text)
14
15 print fixed_content
16
17 f.close()
18
19 if __name__ == '__main__':
20 main()

Output

1 SP: Learn Scientific Programming

Learning Tasks

• How the regex pattern (?P<start><.+?>)(?P<content>.*?)(</.+?>)


matches the string <h1>Learn</h1>.
• Learn the use of the Python .append and .join functions.
Hints: join() returns a string in which the string elements
of sequence have been joined by str separator and append()
Add an item to the end of the list; equivalent to a[len(a):]
= [x].
CONTENTS 24

Python regex finditer function

Python string finditer

finditer() is a powerful function in the re module. It returns an


iterator yielding MatchObject instances over all non-overlapping
matches for the RE pattern in string.

Syntax
1 re.finditer(pattern, string, flags=0)

Here the string is scanned left-to-right, and matches are returned


in the order found. Empty matches are included in the result unless
they touch the beginning of another match.

Example 1

Here is a simple example which demonstrates the use of finditer. It


reads in a page of html text, finds all the occurrences of the word
“the” and prints “the” and the following word. It also prints the
character position of each match using the MatchObject’s start()
method.
CONTENTS 25

1 import re
2 import urllib2
3
4 html = urllib2.urlopen('https://docs.python.org/2/library\
5 /re.html').read()
6 pattern = r'\b(the\s+\w+)\s+'
7 regex = re.compile(pattern, re.IGNORECASE)
8 for match in regex.finditer(html):
9 print "%s: %s" % (match.start(), match.group(1))

Once you have the list of tuples, you can loop over it to do some
computation for each tuple.
Expected output:

1 output
2
3 3261: The Python
4 4210: the backslash
5 4451: the same
6 4474: the same
7 4651: the pattern
8 4679: the regular
9 4930: The solution
10 5937: The functions
11 6301: the standard
12 and so on...

Python Regex - Lookarounds and


Greedy Search
Lookarounds often cause confusion to new regex learners. There
are four lookarounds:
CONTENTS 26

1 (?<= … ) and (?= … ),


2
3 (?<! … ) and (?! … )

Collectively, lookbehinds and lookaheads are known as lookarounds.


Let’s see the following table of examples:

Lookaround Name What it Does


(?=learn) Lookahead Asserts that what
immediately
follows the
current position
in the string is
learn
(?<=learn) Lookbehind Asserts that what
immediately
precedes the
current position
in the string is
learn
(?!learn) Negative Asserts that what
Lookahead immediately
follows the
current position
in the string is not
learn
(?<!learn) Negative Asserts that what
Lookbehind immediately
precedes the
current position
in the string is not
learn
CONTENTS 27

Python Lookahead
Python positive lookahead matches at a position where the pattern
inside the lookahead can be matched. Matches only the position. It
does not consume any characters or expand the match.

Example

Consider the following string:


begin:learner1:scientific:learner2:scientific:learner3:end

Positive lookahead assertion can help us to find all words followed


by the word scientific.

1 import re
2
3 string = "begin:learner1:scientific:learner2:scientific:l\
4 earner3:end"
5 print re.findall(r"(\w+)(?=:scientific)", string)

Output

1 ['learner1', 'learner2']

Note the output learner1 and learner2, but not learner3, which is
followed by the word :end.

Neagative Lookahead

Similar to positive lookahead, except that negative lookahead only


succeeds if the regex inside the lookahead fails to match.
CONTENTS 28

Example

Let’s now proceed to an example, where we find the word (learner3)


followed by end.

1 import re
2
3 string = "begin:learner1:scientific:learner2:scientific:l\
4 earner3:end"
5 print re.findall(r"(learner\d+)(?!:scientific)", string)

Output

1 ['learner3']

This matched all the words, not followed by the word scientific!

Python Look behind

Positive Lookbehind

(?<=regex) Matches at a position if the pattern inside the lookbe-


hind can be matched ending at that position.

Example

Consider the following string:


begin:learner1:scientific:learner2:scientific:learner3:end

Positive lookbehind assertion can help us to find all words 'scientific',


'scientific' and 'end' preeceded by the words learner{1-3}.
CONTENTS 29

1 import re
2
3 string = "begin:learner1:scientific:learner2:scientific:l\
4 earner3:end"
5 print re.findall(r"(?<=learner\d:)(\b\w*\b)", string)

Output ['scientific', 'scientific', 'end']

Neagative Lookbehind

Similar to positive lookbehind, (?<!regex) matches at a position if


the pattern inside the lookbehind cannot be matched ending at that
position.

Example

Let’s now proceed to an example, where we find the word (begin),


not preceded by the words learner{1-3}.

1 import re
2
3 string = "begin:learner1:scientific:learner2:scientific:l\
4 earner3:end"
5 print re.findall(r"^(?<!learner\d:)(\b\w*\b)", string)

Python Lazy and Greedy Search


There are times when you want to match a ptter only optionally!
The ? character flags the group that precedes it as an optional part
of the pattern. For example, enter the following into the interactive
shell:
CONTENTS 30

1 import re
2
3 Regex = re.compile(r'(scientific )?programming')
4 m1 = Regex.search('Learn programming')
5 m2 = Regex.search('Learn scientific programming')
6
7 print m1.group()
8 print m2.group()

The output will be:

1 programming
2 scientific programming

This means that the (scientific )? part of the regular expression


means that the pattern scientific (notice the white space!) is an
optional group. The regex will match text that has zero instances
or one instance of scientific in it. This is why the regex matches
both ‘programming’ and ‘scientific programming’.
Note that the ‘*’, ‘+’, and ‘?’ qualifiers are all greedy; they match
as much text as possible. Sometimes this behavior isn’t desired;
if the RE pattern <.*> is matched against ‘<H1>Learn Scientific
Programming</H1>’, it will match the entire string, and not just
‘<H1>’. Adding ‘?’ after the qualifier makes it perform the match in
non-greedy or minimal fashion; as few characters as possible will
be matched. Using .*? in the previous expression will match only
‘<H1>’.

Project 1: Fun with DNA (REGEX Look


around)!

DNA is a sequence of bases, A, C, G, or T. They are translated into


proteins 3-bases where each sequence is called a codon. There is
CONTENTS 31

a special start codon ATG, and three stop codons, TGA, TAG, and TAA.
Example:

1 cgcgcATGcATGcgTGAcTAAcgTAGcgcgcgcgc

An opening reading frame or ORF consists of a start codon,


followed by some more codons, and ending with a stop codon.
The above example has overlapping ORFs.

• ATGcATGcgTGA and
• ATGcgTGAcTAA.

The following pattern only finds the first ORF (atgcatgcgtga').


Since it consumes the first ORF, it also consumes the beginning
of the second ORF.

1 from re import *
2
3 dna = 'cgcgcATGcATGcgTGAcTAAcgTAGcgcgcgcgc'
4 dna = dna.lower()
5 orfpat = r'(?x) ( atg (?: (?!tga|tag|taa) ... )* (?:tga\
6 |tag|taa) )'
7 print findall(orfpat,dna)

output ['atgcatgcgtga']
We want to find an ORF without consuming it, we can use a
positive lookahead assertion ((?= ( atg). We put the whole ORF
pattern inside the lookahead and find the two atgcatgcgtga and
atgcgtgactaa.
CONTENTS 32

1 from re import *
2
3 dna = 'cgcgcATGcATGcgTGAcTAAcgTAGcgcgcgcgc'
4 dna = dna.lower()
5 orfpat = r'(?x) (?= ( atg (?: (?!tga|tag|taa) ... )* (?\
6 :tga|tag|taa) ))'
7 s = findall(orfpat,dna)
8 if s:
9 print ', '.join(s)

Output

1 atgcatgcgtga, atgcgtgactaa

This project adopts and simplifies the Splitsvile examples (DNA)


from Rex Dwyer’s ipython notebook1 .

Project 2: Parsing data from a HTML


file with Python and REGEX
In this project, we want to extract tabular information from a
HTML file (see below). Our goal is to extract information available
between <td> and </td> except the first numerical index (1..6).
Consider the data.html file below:

1 https://github.com/rexdwyer/Splitsville
CONTENTS 33

1 <html>
2 <head>
3 <style>
4 table, th, td {
5 border: 1px solid black;
6 border-collapse: collapse;
7 }
8 th, td {
9 padding: 5px;
10 }
11 th {
12 text-align: left;
13 }
14 </style>
15 </head>
16 <body>
17 <table style="width:100%">
18 <tr align="center"><td>1</td> <td>England</td> <td>Englis\
19 h</td></tr>
20 <tr align="center"><td>2</td> <td>Japan</td> <td>Japanese\
21 </td></tr>
22 <tr align="center"><td>3</td> <td>China</td> <td>Chinese<\
23 /td></tr>
24 <tr align="center"><td>4</td> <td>Middle-east</td> <td>Ar\
25 abic</td></tr>
26 <tr align="center"><td>5</td> <td>India</td> <td>Hindi</t\
27 d></tr>
28 <tr align="center"><td>6</td> <td>Thailand</td> <td>Thai<\
29 /td></tr>
30 </table>
31 </body>
32 </html>

If we load the HTML file onto a browser it should look like below:
CONTENTS 34

Table: HTML tabular data scraping

Solution

In this code, we first extract HTML data (data.html) and then find
and extract the values from the HTML code. “‘ import re
with open(‘data.html’, ‘r’) as myfile: data=myfile.read().replace(‘\n’,
‘’)
result=re.findall(r’<td>\w+</td>\s<td>(\w+)</td>\s<td>(\w+)</td>’,data)
print(result) “‘
Output :

1 [('England', 'English'),
2 ('Japan', 'Japanese'),
3 ('China', 'Chinese'),
4 ('India', 'Hindi'),
5 ('Thailand', 'Thai')]

Project 3: PDF scraping in Python +


REGEX
In this project we will use a pdf file (see the screenshot below) from
the diabetes.org website. Our goal is to list all the equipment models
developed by the manufacturers names containing the word tandem
(case insensitive).
CONTENTS 35

Find all the product models by the manufacturer called Tandem

Input file

You can download the input file from here: data.pdf2 .

Solution

A complete explanation of the Python code is out of the scope for


this course (hint: learn the Python module pdfquery). It should be
easy enough for you to understand how we capture the porduct_-
name from the pdf file using bounding box function LTTextLineHorizontal:in_-
bbox("40, 48, 181, 633") and then iterate over the products and
search using regex and then only print the Tandem Manufacurers.

2 http://main.diabetes.org/dforg/pdfs/2015/2015-cg-insulin-pumps.pdf
CONTENTS 36

1 import re
2
3 import pdfquery
4 from lxml import etree
5
6
7 PDF_FILE = 'data.pdf'
8
9 pdf = pdfquery.PDFQuery(PDF_FILE)
10 pdf.load()
11
12 product_info = []
13 page_count = len(pdf._pages)
14 for pg in range(page_count):
15 data = pdf.extract([
16 ('with_parent', 'LTPage[pageid="{}"]'.format(pg+1\
17 )),
18 ('with_formatter', None),
19 ('product_name', 'LTTextLineHorizontal:in_bbox("4\
20 0, 48, 181, 633")'),
21 ])
22
23 for ix, pn in enumerate(sorted([d for d in data['prod\
24 uct_name'] if d.text.strip()], key=lambda x: x.get('y0'),\
25 reverse=True)):
26 if ix % 2 == 0:
27 product_info.append({'Manufacturer': pn.text.\
28 strip(), 'page': pg, 'y_start': float(pn.get('y1')), 'y_e\
29 nd': float(pn.get('y1'))-150})
30 if ix > 0:
31 product_info[-2]['y_end'] = float(pn.get(\
32 'y0'))+10.0
33 else:
34 product_info[-1]['Model'] = pn.text.strip()
35
CONTENTS 37

36 pdf.file.close()
37
38
39 for p in product_info:
40 s = p['Manufacturer']
41 m = re.search(r"Tandem",s,re.I)
42 if m:
43 print('Manufacturer: {}[Model {}]\n'.format(p['Ma\
44 nufacturer'],p['Model']))

We have preloaded the data onto educative.io’s server and you


should be able to run the code straight ahead and get the output
as follows:

1 Manufacturer: Tandem Diabetes Care[Model T:flex]


2 Manufacturer: Tandem Diabetes Care[Model T:slim]

From this result we can see that there are two models T:fles
and T:slim supplied by the manufacturer called ‘Tandem Diabetes
Care’. The problem solution has been adopted and simplified from
the reddit user insainodwayno3 .

Project 4: Web scraping in Python +


REGEX
Web scraping or web data extraction is data scraping used for
extracting data from websites. In this project, we will extract
tabular data from the Boone Country Sherrif’s Dept website4 of
criminal records (see the image below) and then find all the people
whose name start with the word “A”.
3 https://www.reddit.com/user/insainodwayno
4 ’http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp
CONTENTS 38

Scraping tabular data from the Boone Country Sherrif’s Dept website

Solution

The problem solution uses BeautifulSoup5 . A detailed explanation


of the code is out-of-scope for this course (hint: read the BS docs). In
this code, we first extract HTML data and format/convert into BS’s
table using BS’s BeautifulSoup() function, then find and extract
the table from the HTML code.

1 import re
2 from pprint import pprint
3 import csv
4 import requests
5 from BeautifulSoup import BeautifulSoup
6
7
8 url = 'http://www.showmeboone.com/sheriff/JailResidents/J\
9 ailResidents.asp'
10 response = requests.get(url)
5 https://www.crummy.com/software/BeautifulSoup/
CONTENTS 39

11 html = response.content
12
13 soup = BeautifulSoup(html)
14 table = soup.find('tbody', attrs={'class': 'stripe'})
15
16 list_of_rows = []
17
18 for row in table.findAll('tr')[0:]:
19 list_of_cells = []
20 for cell in row.findAll('td'):
21 text = cell.text.replace('&nbsp;', '')
22 list_of_cells.append(text)
23 list_of_rows.append(list_of_cells)
24
25 for line in list_of_rows:
26 row = '\t'.join(str(i) for i in line) # python 2
27 s=row[0:5] # Select only the Last names (1st column)
28 m = re.search(r"^A",s,re.I)
29 if m:
30 print row

Expected output (Surnames started with the letter “A”): ACTON


ANTHONY SEAN M B 25 COLUMBIA MO Details ADAM OMER SIRAJ
M B 29 COLUMBIA MO Details ALEXANDER CHARLES CODY M W 23
COLUMBIA MO Details ALLEN WILLIAM LAMAR M B 55 COLUMBIA MO
Details AVALOS-AVALOS JOSE M H 19 ST.ANN MO Details

This examples has been adopted and extended from the Python-
BeautifulSoup’s first web scraper6 , originally developed by Chase
Davis, Jackie Kazil, Sisi Wei and Matt Wynn for bootcamps held by
Investigative Reporters and Editors at the University of Missouri in
Columbia, Missouri.
6 https://first-web-scraper.readthedocs.io/en/latest/
CONTENTS 40

Project 5: Amazon web crawling in


Python + REGEX
A Web crawler, sometimes called a spider, is an Internet bot
that systematically browses the World Wide Web, typically for the
purpose of Web indexing (web spidering). In this project, we crawl
the amazon.com7 website > Movies & TV > ‘startrek’ (see the image
below). Then, we find the list of movies with ‘bonus’ content.

Amazon web scraping for Startrek DVD movies with ‘bonus’ content

Solution

The problem solution uses BeautifulSoup8 . A detailed explanation


of the code is out-of-scope for this course (hint: read the BS docs). In
this code, we first extract HTML data and format/convert into BS’s
7 https://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Dmovies-tv&field-
keywords=startrek&rh=n%3A2625373011%2Ck%3Astartrek
8 https://www.crummy.com/software/BeautifulSoup/
CONTENTS 41

table using BS’s BeautifulSoup() function, then find and extract


the movies from the HTML code.

1 import re
2 from pprint import pprint
3 import csv
4 import requests
5
6
7 import requests
8 from bs4 import BeautifulSoup
9 def crawl_amazon_web(page,WebUrl):
10 if(page>0):
11 url = WebUrl
12 code = requests.get(url)
13 plain = code.text
14 s = BeautifulSoup(plain, "html.parser")
15
16 for link in s.findAll('a', {'class':'s-access-det\
17 ail-page'}):
18 movie_title = link.get('title')
19 m = re.search( r'Bonus',movie_title)
20 if m:
21 print(movie_title)
22 html_link = link.get('href')
23 print(html_link)
24
25 crawl_amazon_web(1,'https://www.amazon.com/s/ref=nb_sb_no\
26 ss_2?url=search-alias%3Dmovies-tv&field-keywords=starwars\
27 &rh=n%3A2625373011%2Ck%3Astarwars')

Expected output (Startrek movies with bonus content) : Star Wars:


The Force Awakens (Plus Bonus Features) https://www.amazon.com/Star-Wars-For
Rogue One: A Star Wars Story (With Bonus Content) https://www.amazon.com/Rogu
Easy!
CONTENTS 42

This solution has been adopted and extended from the Dev.to
post9 written by Pranay Das.

Quiz # 1 - REGEX Patterns


Challenge yourself with the Regex pattern quizzes.

Question 1 of 10
By default, a single dot (.) matches:

1. A single char
2. Nothing
3. Unlimited numbers of chars
4. A and B

Correct answer: 1

Question 2 of 10
Regex pattern a+ matches

1. b and aaab
2. ab and aaab
3. b and a+b
4. a+b

Correct answer: 2
9 https://dev.to/pranay749254/build-a-simple-python-web-crawler
CONTENTS 43

Question 3 of 10
Regex [0-9]+ matches one or more occurrence of any digit

1. True
2. False

Correct answer: 1

Question 4 of 10
[^��� Matches any single character that is not in the class.

1. True
2. False

Correct answer: 1

Question 5 of 10
^abc matches - select multi answers

1. 123abc
2. abc123
3. aabc123
4. ^abc

Correct answer: 2 and 3


CONTENTS 44

Question 6 of 10
The vertical bar separates two or more alternatives. A match
occurs if any of the alternatives is satisfied. For example, learn
|scientific matches

1. both learn and scientific


2. only learn

Correct answer: 1

Question 7 of 10
By default, regular expressions are case-sensitive.

1. True
2. False

Correct answer: 1

Question 8 of 10
When you put a plus sign (+) after something in a regular expres-
sion,

1. it indicates that the element may be repeated more than once.


2. it indicates summation

Correct answer: 1
CONTENTS 45

Question 9 of 10
The star (*) has a similar meaning but also allows the pattern to
match zero times.

1. True
2. False

Correct answer: 1

Question 10 of 10
Putting {4} after an element, \d{4}

1. Requires digits to occur exactly four times.


2. None
3. Four alphabets needs to be printed
4. Requires a mix of four alphabets and digits

Correct answer: 1

Quiz # 2 Python REGEX Functions


Challenge yourself with the Py Regex (re) module quizzes.
CONTENTS 46

Question 1 of 5
When you deal with the HTML and XML you may need:

1. XML
2. Verbose
3. Non-greedy matching
4. HTML code

Correct answer: 3
Question 2 of 5
Split the string into a list, splitting it wherever the RE matches

1. splitter()
2. sub()
3. splitn()
4. split()

Correct answer: 4
Question 3 of 5
We use RE compile

1. All the time


2. Only when there are many regex(s)

Correct answer: 2
Question 4 of 5
What is (?=...) - Multiple correct

1. Positive lookahead assertion


2. Lookahead assertion
3. lookaround
CONTENTS 47

4. Look-behind

Correct answers: 1-3


Question 4 of 5
Groups are marked by the metacharacters:

1. ( and )
2. < and >
3. None
4. 1 and 2

Correct answers: 1

References
• Python REGEX
– Python Regular expression operations10 - Official doco.
– Regular Expression HOWTO11 by A.M. Kuchling
• Educative courses
– Python 3: An interactive deep dive12 by Mark Pilgrim,
implemented by the Educaitve Team
– Python 101: Interactively learn how to program with
Python 313 by Michael Driscoll
– Python 201 - Interactively Learn Advanced Concepts in
Python 314 by Michael Driscoll
10 https://docs.python.org/2/library/re.html
11 https://docs.python.org/2/howto/regex.html
12 https://www.educative.io/collection/10370001/5705097937944576?authorName=
Educative
13 https://www.educative.io/collection/5663684521099264/5707702298738688?authorName=
Michael%20Driscoll
14 https://www.educative.io/collection/5663684521099264/5693417237512192?authorName=
Michael%20Driscoll
CONTENTS 48

• Online Python Courses


– edx’s Introduction to Computer Science and Program-
ming Using Python15 . The companion book can be
found here16 .
– MIT Open Courseware also offers a gentler “lead-in”
course designed for those with no programming back-
ground that you can take beforetaking the above: Build-
ing Programming Experience: A Lead-In to 6.00117 .
– MIT Open Courseware’s A Gentle Introduction to Pro-
gramming sing Python18
– Coursera’s Programming for Everybody (Python)19 For
beginners.
– Codecademy’s beginners; tends to focus primarily on
syntax Python track20 .
– Udacity’s Programming Foundations with Python21 . Fo-
cuses on object-oriented programming.
– Team Treehouse’s Python course22 .
– Udemy’s Python Regular Expressions with Data Scrap-
ing Projects23 , this book - video version!
• Interactive resources:
– LearnPython24 An interactive online guide that teaches
basic Python.
– Try Python25 Another interactive online guide.
15 https://www.edx.org/course/introduction-computer-science-mitx-6-00-1x-0#.VJw5pv-
kAA
16 http://mitpress.mit.edu/books/introduction-computation-and-programming-using-
python-0
17 http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-090-
building-programming-experience-a-lead-in-to-6-001-january-iap-2005/
18 http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-189-a-
gentle-introduction-to-programming-using-python-january-iap-2011/
19 https://www.coursera.org/course/pythonlearn
20 http://www.codecademy.com/tracks/python
21 https://www.udacity.com/course/ud036
22 http://teamtreehouse.com/features/python
23 https://www.udemy.com/python-regular-expressions-with-data-scraping-projects/
24 http://learnpython.org
25 http://www.trypython.org/
CONTENTS 49

– Educative’s Python Regular Expressions with Data Scrap-


ing Projects26 , this book - interactive version!
• Books and tutorials (online):
– Learn Python the Hard Way27 Part of the “Learn X the
Hard Way” series. Despite its name, this is one of the
easiest introductions to Python available.
– Automate the Boring Stuff with Python28 .From the
Invent with Python29 author.
– How to Think Like a Computer Scientist (Python 2
version30 and Python 3 version31 )
– Think Python32 Comprehensive introductory text on
Python.
– The official Python tutorial (for Python 233 and Python
334 ). Moves a little quickly, but is very comprehensive
and thorough.
– Problem Solving with Algorithms and Data Structures35
– Dive into Python 336 An accelerated introduction to
Python.
– Program Arcade Games With Python And Pygame37
– The Hitchhiker’s Guide to Python38
– pycrumbs39 A huge list of many useful articles, tutorials,
and snippits on Python, ranging from basic to advanced.
26 https://www.educative.io/collection/5183519089229824/5682462386552832
27 http://learnpythonthehardway.org/book/
28 http://automatetheboringstuff.com/
29 http://inventwithpython.com/
30 http://www.openbookproject.net/thinkcs/python/english2e/
31 http://www.openbookproject.net/thinkcs/python/english3e/
32 http://www.greenteapress.com/thinkpython/
33 https://docs.python.org/2/tutorial/
34 https://docs.python.org/3/tutorial/
35 http://interactivepython.org/runestone/static/pythonds/index.html
36 http://www.diveintopython3.net/
37 http://ProgramArcadeGames.com
38 https://python-guide.readthedocs.org/en/latest/
39 http://resrc.io/list/4/pycrumbs/
CONTENTS 50

– PyMOTW40 A tour of the Python standard library through


short examples.
– Import Python41 A catalog of Python books (some are
free)
• Exercises:
– Pyschools42 Exercises and challenges in Python.

40 http://pymotw.com/
41 http://importpython.com/books/
42 http://www.pyschools.com/

You might also like