0% found this document useful (0 votes)

3 views31 pages

06_regex

This document outlines a course based on the book 'Python for Biologists', focusing on using Python for analyzing biological data, particularly through regular expressions. It covers various topics including pattern matching, modules, and string manipulation, with practical examples related to DNA sequences and enzyme recognition sites. Additionally, it includes exercises to reinforce learning about regex and its applications in bioinformatics.

Uploaded by

faranpourali1383

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views31 pages

06_regex

Uploaded by

faranpourali1383

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Overview

day one today

0. getting set up 6. regular expressions
1. text output and manipulation 7. dictionaries
day two day five
2. reading and writing files 8. files, programs and user
input
3. lists and loops
day six
day three
4. writing functions 9. biopython
5. conditional statements

This course (apart from chapter 9) is based on the book "Python for Biologists":
http://pythonforbiologists.com/
from scratch
A primer for scientists working with Next-Generation-
Sequencing data

CHAPTER 6

Regular expressions
Overview

day one day four

0. getting set up 6. regular expressions
1. text output and manipulation 7. dictionaries
day two day five
2. reading and writing files 8. files, programs and user
input
3. lists and loops
day six
today
4. writing functions 9. biopython
5. conditional statements

This course (apart from chapter 9) is based on the book "Python for Biologists":
http://pythonforbiologists.com/
Searching for patterns in strings

In biological data, many things of interest can be regarded as

string patterns:
● protein domains
● DNA transcription factor binding motifs
● restriction enzyme cut sites
● degenerate PCR primer sites
● runs of mononucleotides

Regular expressions are a tool for searching patterns in

strings, also referred to as pattern matching
Regular expressions in the Unix world

Linux/Unix tools that use regular expressions:

● grep → basically a regex search engine
● awk
actually programming languages, but
● sed can be called from the command line
● Perl
● less
Regular expressions are powerful

We already searched for simple patterns in chapter 5:

if name.startswith('k') and name.endswith('h'):
[...]

Testing the same conditions using a regular expression:

if re.match("k.*h", name):
[...]
Python modules

● regular expressions are made available as a module

● modules are a way to provide additional functionality when
you need it
● advantages of modules:
– the python language is extensible
– avoiding overhead → better performance
(most programs do only need basic functionality)
● modules contain
– functions
– data types (+ methods)
Using modules

● to use a module, first import it:

import re

● now we can use functions defined in the module by using

the module name as a prefix:
if re.match("k.*h", name):
[...]
Raw strings

● brevity comes at a price: we need a lot of special

characters
● to avoid ambiguities in regex's, we can tell python to ignore
special characters:

print("\t\n") vs. print(r"\t\n")

(the 'r' means "raw")

● this way, regex special characters cannot clash with
python ones
Searching for patterns

● Let's find a restriction enzyme's recognition site within a

DNA sequence (EcoRI cuts the pattern "GAATTC"):
dna = "CTTAGAATTCCG"
if re.search(r"GAATTC", dna):
print("EcoRI recognition site found!")

● a more interesting case (AvaII cuts the pattern "GGWCC"):

dna = "CTTAGAATTCCG"
if re.search(r"GG(A|T)CC", dna):
print("AvaII recognition site found!")

→ "(A|T)" = "A" or "T"

Character classes

● This regex will match BisIs recognition site

(BisI cuts the pattern "GCNGC"):
dna = "CTTAGAATTCCG"
if re.search(r"GC(A|C|G|T)GC", dna):
print("BisI recognition site found!")

● But we can make it shorter by using a character class:

dna = "CTTAGAATTCCG"
if re.search(r"GC[ACGT]GC", dna):
print("BisI recognition site found!")

→ "[ACGT]" = "A" or "C" or "G" or "T"

Matching arbitrary characters

● If we do not care about a certain position, we use a dot ('.')

which matches any character:
dna = "CTTAGAATTCCG"
if re.search(r"GC.GC", dna):
print("BisI recognition site found!")
●
"GC.GC" matches the same strings as "GC[ACGT]GC", but
also other ones, e.g.
– "GCNGC", "GCWGC", "GCXGC", "GC8GC", "GC#GC"

→ "." = any alphanumeric character

(letters, numbers, symbols)
except newlines ('\n')
Excluding characters

● Sometimes you want to specify what not to match; this can be

done by negating a character class with a caret ('^'):
dna = "CTTAGAATTCCG"
if re.search(r"GC[^N]GC", dna):
print("BisI recognition site found!")
●
"GC[^N]GC" matches the same strings as "GC.GC",
except it specifically excludes "GCNGC"

→ "[^N]" = any alphanumeric character

(letters, numbers, symbols)
except 'N' (and newlines ('\n')...)
Quantifiers

● what if we have (sub)patterns of variable length?

● quantifiers enable us to specify how often a certain part of
a pattern can be repeated
● remember our earlier example:
if re.match("k.*h", name):
[...]

matches a string starting with 'k' and ending with 'h',

having an arbitrary number (0 to infinity) of characters in
between
● quantifiers (in this case '*') are specified after the pattern
(here: '.') they belong to
More quantifiers

The following operators can be used in simple comparisons:

quantifier ocurrences examples

? 0 to 1 "ACG?T" ~ "ACT", "ACGT"

"ACG*T" ~ "ACT", "ACGT",
* 0 to infinity
"ACGGT", "ACGGGT", ...
"ACG+T" ~ "ACGT", "ACGGT",
+ 1 to infinity
"ACGGGT", ...
{n} exactly n times "ACG{2}T ~ "ACGGT"
"ACG{2,4} ~ "ACGGT",
{n,m} n to m times
"ACGGGT", "ACGGGGT"
{n,} n to infinity times "ACG{2,4} ~ "ACGGT",
"ACGGGT", "ACGGGGT", ...
Positions

● Sometimes it is important where the pattern is found

● There are two positional operators:
quantifier matches

^ beginning of string

$ end of string

● These operators make sense when used with re.search

but not with re.match (always matches entire string or not)
re.match vs. re.search

● re.match matches the entire string (or not):

if re.match("k.*h", name):
print(name + " starts with k and ends with h")

● re.search finds the given pattern as part of the string

if re.search("k.*h", name):
print("pattern 'k.*h' occurs in " + name)
Combinations

● the true power of regexes lies in the combination of the

different operators
● Here's an example regex that identifies eukariotic
messenger RNA sequences:
"^AUG[ACGU]{30,1000}A{5,10}$"
● Read from left to right:
– ^AUG: starts with AUG
– [ACGU]{30,1000}: followed by 30 to 1000 nucleotides
– A{5,10}$: ended by 5 to 10 A's
Quantifiers

● what if we have (sub)patterns of variable length?

● quantifiers enable us to specify how often a certain part of
a pattern can be repeated
● remember our earlier example:
if re.match("k.*h", name):
[...]

matches a string starting with 'k' and ending with 'h',

having an arbitrary number (0 to infinity) of characters in
between
● quantifiers (in this case '*') are specified after the pattern
(here: '.') they belong to
Extracting matches

● When using re.search one is usually interested in getting

the matching substrings:
●
re.search returns a match object containing more
informations about the match
● the match object can be queried with the group method:

dna = "CTTAGCAGCTTACG"

# store the match object in the variable m

m = re.search(r"GC[^N]GC", dna)
print("BisI recognition site found: " + m.group())
Capturing parts of matches

● What if we want to extract parts from a string specifically?

●
re.search can return (or capture) many results as groups
in the match object
● parts that should be captured are indicated in the regex
with parantheses:

dna = "ATGACGTACGTACGACTG"
# store the match object in the variable m
m = re.search(r"GA([ATGC]{3})AC([ATGC]{2})AC", dna)
print("entire match: " + m.group())
print("first bit: " + m.group(1))
print("second bit: " + m.group(2))
Getting the position of a match

● the match object can tell us what matched our regex but
also where the match occurred in the string
● the match object provides the methods start and end to
get positional information:
dna = "CTTAGCAGCTTACG"
m = re.search(r"GC[^N]GC", dna)
print("match starts at: " + str(m.start()))
print("match ends at: " + str(m.end()))

● Caution! This code would produce an error if no match is

found (the match object is then not defined → m == None)
Getting the position of multiple matches

● as with group, we can access information about each

match by addressing it by its index:
dna = "ATGACGTACGTACGACTG"
m = re.search(r"GA([ATGC]{3})AC([ATGC]{2})AC", dna)
print("whole match: "+str(m.start())+" to "+str(m.end()))
print("1st bit: "+str(m.start(1))+" to "+str(m.end(1)))
print("2nd bit: "+str(m.start(2))+" to "+str(m.end(2)))

● Caution! This code would produce an error if no match is

found (the match object is then not defined → m == None)
Splitting a string with a regex

● the re module provides a split function which takes a

regex as delimiter to split by:
dna = "ACTNGCATRGCTACGTYACGATSCGAWTCG"
runs = re.split(r"[^ATGC]", dna)
print(runs)

→ "[^ATGC]" = any character except 'A','T','G' or 'C'

● This code prints out
['ACT', 'GCAT', 'GCTACGT', 'ACGAT', 'CGA', 'TCG']
Finding all occurrences in a text

● the search function returns only the first match found in

the string
● using findall we can extract all occurrences of a pattern
dna = "ACTGCATTATATCGTACGAAATTATACGCGCG"
runs = re.findall(r"[AT]{4,100}", dna)
print(runs)

→ "[AT]{4,100}" = runs of only 'A' or 'T' bases,

4-100 bp long
● runs will contain the following list:
['ATTATAT', 'AAATTATA']
Finding all matches in a text

●
findall only returns the matching substrings, but what if
we need the positions of the substrings?
●
finditer returns a sequence of match objects
(called an iterator) that can be used in a for loop:
dna = "ACTGCATTATATCGTACGAAATTATACGCGCG"
runs = re.finditer(r"[AT]{4,100}", dna)
for match in runs:
run_start = match.start()
run_end = match.end()
print("AT run from "+str(run_start)+" to "+str(run_end))
Recap

In this unit you learned about:

● the notion of modules in python
● the concept and syntax of regular expressions
● using raw strings to avoid special character clashes
● how to use the re module to find occurrences of
patterns in a text
● how to extract information from match objects
● splitting strings with regular expressions
Recap

Here is a shortlist of functions provided by the re module:

function application
re.match returns a match object if the pattern matches the
whole string
re.search returns the first matching substring for a pattern

re.findall returns a list of all substrings matching the pattern

returns a sequence of match objects representing all

re.finditer
occurrences of the pattern
re.split splits a text at each occurrence of the pattern
Recap

These methods can be used to extract information from a

match object:
method application
group returns the substring matching the pattern
(indexable, i.e. if more than one group was captured by
the regex, each group can be addressed by its index)
start returns the start position of a (sub)match (indexable)

end returns the end position of a (sub)match (indexable)

Exercise 6-1: Accession names

● Write a script that reads a set of succession names from file

"accessions.txt" (comma-separated list)
● print the accessions fulfilling the following criteria:
a) contain the number 5
b) contain the letter d or e
c) contain the letters d and e (in that order)
d) contain the letters d and e (in that order) with a letter in between
e) contain the letters d and e (in any order)
f) start with x or y and end with e
g) contain 3 or more numbers in a row
h) end with d followed by either a, r or p
Exercise 6-2: Double digest

● read the DNA sequence in file dna.txt

● predict the fragments lengths we would get by digesting
the sequence with the (made-up) restriction enzymes
– AbcI: cutting site "ANT*AAT"
– AbcII: cutting site "GCRW*TG"
(asterisks indicate where the enzyme cuts the DNA)

Test
No ratings yet
Test
7 pages
SAP TM - Charge Management
100% (2)
SAP TM - Charge Management
33 pages
Xep21021 Trainingsbook V2.9 en Us V2
No ratings yet
Xep21021 Trainingsbook V2.9 en Us V2
77 pages
GX1100S Gx1100e GX1150 SM Usa Exp Ce Em037n90f
100% (2)
GX1100S Gx1100e GX1150 SM Usa Exp Ce Em037n90f
48 pages
13B RegExp
No ratings yet
13B RegExp
38 pages
Untitled
No ratings yet
Untitled
53 pages
Python 201 - (Slightly) Advanced Python Topics
No ratings yet
Python 201 - (Slightly) Advanced Python Topics
69 pages
UNIT - 4 REGEX
No ratings yet
UNIT - 4 REGEX
28 pages
Module5_RegularExpressions
No ratings yet
Module5_RegularExpressions
10 pages
Python Regex
No ratings yet
Python Regex
8 pages
Python Regular Expressions
No ratings yet
Python Regular Expressions
14 pages
String Functions and Regular Expressions: Anastasis Oulas Evangelos Pafilis Jacques Lagnel
No ratings yet
String Functions and Regular Expressions: Anastasis Oulas Evangelos Pafilis Jacques Lagnel
37 pages
17_Regular Expression
No ratings yet
17_Regular Expression
20 pages
Regex Case Interview Guide
No ratings yet
Regex Case Interview Guide
10 pages
Python Reg Expressions PDF
No ratings yet
Python Reg Expressions PDF
8 pages
Python Re
No ratings yet
Python Re
18 pages
python_reg_expressions
No ratings yet
python_reg_expressions
8 pages
Python Re
No ratings yet
Python Re
101 pages
3.III-Regular Expression Part-I & II 2022-23
No ratings yet
3.III-Regular Expression Part-I & II 2022-23
14 pages
Chapter - 11 - Regular Expressions
100% (1)
Chapter - 11 - Regular Expressions
10 pages
Regular Expressions Python
No ratings yet
Regular Expressions Python
26 pages
Regular Expressions - Regexes in Python (Part 1) - Real Python
No ratings yet
Regular Expressions - Regexes in Python (Part 1) - Real Python
44 pages
UNIT4
No ratings yet
UNIT4
67 pages
632223462-unit-3-python
No ratings yet
632223462-unit-3-python
72 pages
Unit-3 Python
No ratings yet
Unit-3 Python
72 pages
Lecture 6 Re Basics
No ratings yet
Lecture 6 Re Basics
12 pages
II-MSC-PYTHON-UNIT-V-NOTES
No ratings yet
II-MSC-PYTHON-UNIT-V-NOTES
18 pages
RegExp
No ratings yet
RegExp
10 pages
Python Assignment Date: 08-11-2021: Name-Navjeet Kaur Sap ID-500076160 Roll No - R134219065
No ratings yet
Python Assignment Date: 08-11-2021: Name-Navjeet Kaur Sap ID-500076160 Roll No - R134219065
3 pages
Howto Regex
No ratings yet
Howto Regex
20 pages
Howto Regex
No ratings yet
Howto Regex
17 pages
UNIT-4 (Regular Expressions)
No ratings yet
UNIT-4 (Regular Expressions)
25 pages
Howto Regex PDF
No ratings yet
Howto Regex PDF
20 pages
Module3 RegularExpressions
No ratings yet
Module3 RegularExpressions
8 pages
Regular Expressions: Regular Expression Syntax in Python
No ratings yet
Regular Expressions: Regular Expression Syntax in Python
11 pages
Unit-3 - Regular Expression
No ratings yet
Unit-3 - Regular Expression
15 pages
Lecture 9 Python
No ratings yet
Lecture 9 Python
8 pages
Regular Expression Howto: A.M. Kuchling
No ratings yet
Regular Expression Howto: A.M. Kuchling
20 pages
Unit 2
No ratings yet
Unit 2
69 pages
Howto Regex
No ratings yet
Howto Regex
20 pages
regular ex complete notes - Jupyter Notebook
No ratings yet
regular ex complete notes - Jupyter Notebook
13 pages
Summary Python 1
No ratings yet
Summary Python 1
36 pages
Regular Expression
No ratings yet
Regular Expression
17 pages
RegEx-in-Python
No ratings yet
RegEx-in-Python
5 pages
Python Regular Expressions Quick Reference
No ratings yet
Python Regular Expressions Quick Reference
2 pages
Beginners Tutorial For Regular Expressions in Python - Python Learning
No ratings yet
Beginners Tutorial For Regular Expressions in Python - Python Learning
23 pages
howto-regex
No ratings yet
howto-regex
20 pages
Module 3 Regular Expressions
No ratings yet
Module 3 Regular Expressions
8 pages
Regular-Expressions-Cheat-Sheet
No ratings yet
Regular-Expressions-Cheat-Sheet
5 pages
Regular Expressions
No ratings yet
Regular Expressions
9 pages
9.RegEx
No ratings yet
9.RegEx
57 pages
Structuring with regix
No ratings yet
Structuring with regix
49 pages
6 Python Regex Search Function
No ratings yet
6 Python Regex Search Function
4 pages
Regular Expression 4
No ratings yet
Regular Expression 4
16 pages
regular exp
No ratings yet
regular exp
10 pages
Regular Expression HOWTO: Guido Van Rossum and The Python Development Team
No ratings yet
Regular Expression HOWTO: Guido Van Rossum and The Python Development Team
18 pages
CHAPTER 10
No ratings yet
CHAPTER 10
28 pages
9.RegEx (1)
No ratings yet
9.RegEx (1)
57 pages
Python RegEx
No ratings yet
Python RegEx
11 pages
Regular Expression 01
No ratings yet
Regular Expression 01
48 pages
Regular Expression HOWTO: Guido Van Rossum Fred L. Drake, JR., Editor
No ratings yet
Regular Expression HOWTO: Guido Van Rossum Fred L. Drake, JR., Editor
18 pages
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Rust Package 100 Knocks: One-Hour Mastery Series 2024 Edition
From Everand
Rust Package 100 Knocks: One-Hour Mastery Series 2024 Edition
Kanto
No ratings yet
Neural Networks
No ratings yet
Neural Networks
5 pages
OPRE 6301-SYSM 6303 Chapter 04 - Students
No ratings yet
OPRE 6301-SYSM 6303 Chapter 04 - Students
17 pages
Load Combinations
No ratings yet
Load Combinations
6 pages
DM399s2020 - Guidelines For The Deployment Delivery of Various IT Packages Under FY 2019 DCP
No ratings yet
DM399s2020 - Guidelines For The Deployment Delivery of Various IT Packages Under FY 2019 DCP
22 pages
PS31
No ratings yet
PS31
4 pages
Repairing Crashes in Android Apps
No ratings yet
Repairing Crashes in Android Apps
12 pages
Sangam Offer Letter
No ratings yet
Sangam Offer Letter
2 pages
CIPer Model 30 System Engineering User Guide - 31-00237
No ratings yet
CIPer Model 30 System Engineering User Guide - 31-00237
420 pages
19 Lição - Pronomes Interrogativos
No ratings yet
19 Lição - Pronomes Interrogativos
8 pages
Fill in the Blank I Networking
No ratings yet
Fill in the Blank I Networking
3 pages
Strain Gauge Based Accelerometer
0% (1)
Strain Gauge Based Accelerometer
13 pages
DNV Webinar Hourly Modeling Corrections For Accurate Solar Energy Assessment
No ratings yet
DNV Webinar Hourly Modeling Corrections For Accurate Solar Energy Assessment
26 pages
Student Workbook An Introduction To Digital Communications PDF
No ratings yet
Student Workbook An Introduction To Digital Communications PDF
38 pages
SC - COMELEC May Be Compelled To Allow Witnessing of Printing Ballots, Disclose VCM Transmission Diagram - Supreme Court of The Philippines
No ratings yet
SC - COMELEC May Be Compelled To Allow Witnessing of Printing Ballots, Disclose VCM Transmission Diagram - Supreme Court of The Philippines
7 pages
Irlr3705Zpbf Irlu3705Zpbf: Features
No ratings yet
Irlr3705Zpbf Irlu3705Zpbf: Features
11 pages
Chp22 Auto Flight
No ratings yet
Chp22 Auto Flight
250 pages
Semiconductors Review
No ratings yet
Semiconductors Review
42 pages
Chapter 5: Queues: (Data Structures and Algorithms)
No ratings yet
Chapter 5: Queues: (Data Structures and Algorithms)
39 pages
Ador Corona
82% (22)
Ador Corona
32 pages
HZS 180 Concrete Batching Plant
No ratings yet
HZS 180 Concrete Batching Plant
2 pages
By Maria Gabriela Mino
No ratings yet
By Maria Gabriela Mino
13 pages
Class Diagram
No ratings yet
Class Diagram
1 page
Digital Electronic Principles I, II
No ratings yet
Digital Electronic Principles I, II
60 pages
Lo-Star-Led-Chiefs---Crotch
No ratings yet
Lo-Star-Led-Chiefs---Crotch
15 pages
WB2S Module Datasheet - Tuya Smart - Docs
No ratings yet
WB2S Module Datasheet - Tuya Smart - Docs
28 pages
2098-du003_-en-p
No ratings yet
2098-du003_-en-p
12 pages