[go: up one dir, main page]

0% found this document useful (0 votes)
3 views31 pages

06_regex

This document outlines a course based on the book 'Python for Biologists', focusing on using Python for analyzing biological data, particularly through regular expressions. It covers various topics including pattern matching, modules, and string manipulation, with practical examples related to DNA sequences and enzyme recognition sites. Additionally, it includes exercises to reinforce learning about regex and its applications in bioinformatics.

Uploaded by

faranpourali1383
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views31 pages

06_regex

This document outlines a course based on the book 'Python for Biologists', focusing on using Python for analyzing biological data, particularly through regular expressions. It covers various topics including pattern matching, modules, and string manipulation, with practical examples related to DNA sequences and enzyme recognition sites. Additionally, it includes exercises to reinforce learning about regex and its applications in bioinformatics.

Uploaded by

faranpourali1383
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Overview

day one today


0. getting set up 6. regular expressions
1. text output and manipulation 7. dictionaries
day two day five
2. reading and writing files 8. files, programs and user
input
3. lists and loops
day six
day three
4. writing functions 9. biopython
5. conditional statements

This course (apart from chapter 9) is based on the book "Python for Biologists":
http://pythonforbiologists.com/
from scratch
A primer for scientists working with Next-Generation-
Sequencing data

CHAPTER 6

Regular expressions
Overview

day one day four


0. getting set up 6. regular expressions
1. text output and manipulation 7. dictionaries
day two day five
2. reading and writing files 8. files, programs and user
input
3. lists and loops
day six
today
4. writing functions 9. biopython
5. conditional statements

This course (apart from chapter 9) is based on the book "Python for Biologists":
http://pythonforbiologists.com/
Searching for patterns in strings

In biological data, many things of interest can be regarded as


string patterns:
● protein domains
● DNA transcription factor binding motifs
● restriction enzyme cut sites
● degenerate PCR primer sites
● runs of mononucleotides

Regular expressions are a tool for searching patterns in


strings, also referred to as pattern matching
Regular expressions in the Unix world

Linux/Unix tools that use regular expressions:


● grep → basically a regex search engine
● awk
actually programming languages, but
● sed can be called from the command line
● Perl
● less
Regular expressions are powerful

We already searched for simple patterns in chapter 5:


if name.startswith('k') and name.endswith('h'):
[...]

Testing the same conditions using a regular expression:


if re.match("k.*h", name):
[...]
Python modules

● regular expressions are made available as a module


● modules are a way to provide additional functionality when
you need it
● advantages of modules:
– the python language is extensible
– avoiding overhead → better performance
(most programs do only need basic functionality)
● modules contain
– functions
– data types (+ methods)
Using modules

● to use a module, first import it:


import re

● now we can use functions defined in the module by using


the module name as a prefix:
if re.match("k.*h", name):
[...]
Raw strings

● brevity comes at a price: we need a lot of special


characters
● to avoid ambiguities in regex's, we can tell python to ignore
special characters:

print("\t\n") vs. print(r"\t\n")

(the 'r' means "raw")


● this way, regex special characters cannot clash with
python ones
Searching for patterns

● Let's find a restriction enzyme's recognition site within a


DNA sequence (EcoRI cuts the pattern "GAATTC"):
dna = "CTTAGAATTCCG"
if re.search(r"GAATTC", dna):
print("EcoRI recognition site found!")

● a more interesting case (AvaII cuts the pattern "GGWCC"):


dna = "CTTAGAATTCCG"
if re.search(r"GG(A|T)CC", dna):
print("AvaII recognition site found!")

→ "(A|T)" = "A" or "T"


Character classes

● This regex will match BisIs recognition site


(BisI cuts the pattern "GCNGC"):
dna = "CTTAGAATTCCG"
if re.search(r"GC(A|C|G|T)GC", dna):
print("BisI recognition site found!")

● But we can make it shorter by using a character class:


dna = "CTTAGAATTCCG"
if re.search(r"GC[ACGT]GC", dna):
print("BisI recognition site found!")

→ "[ACGT]" = "A" or "C" or "G" or "T"


Matching arbitrary characters

● If we do not care about a certain position, we use a dot ('.')


which matches any character:
dna = "CTTAGAATTCCG"
if re.search(r"GC.GC", dna):
print("BisI recognition site found!")

"GC.GC" matches the same strings as "GC[ACGT]GC", but
also other ones, e.g.
– "GCNGC", "GCWGC", "GCXGC", "GC8GC", "GC#GC"

→ "." = any alphanumeric character


(letters, numbers, symbols)
except newlines ('\n')
Excluding characters

● Sometimes you want to specify what not to match; this can be


done by negating a character class with a caret ('^'):
dna = "CTTAGAATTCCG"
if re.search(r"GC[^N]GC", dna):
print("BisI recognition site found!")

"GC[^N]GC" matches the same strings as "GC.GC",
except it specifically excludes "GCNGC"

→ "[^N]" = any alphanumeric character


(letters, numbers, symbols)
except 'N' (and newlines ('\n')...)
Quantifiers

● what if we have (sub)patterns of variable length?


● quantifiers enable us to specify how often a certain part of
a pattern can be repeated
● remember our earlier example:
if re.match("k.*h", name):
[...]

matches a string starting with 'k' and ending with 'h',


having an arbitrary number (0 to infinity) of characters in
between
● quantifiers (in this case '*') are specified after the pattern
(here: '.') they belong to
More quantifiers

The following operators can be used in simple comparisons:


quantifier ocurrences examples

? 0 to 1 "ACG?T" ~ "ACT", "ACGT"


"ACG*T" ~ "ACT", "ACGT",
* 0 to infinity
"ACGGT", "ACGGGT", ...
"ACG+T" ~ "ACGT", "ACGGT",
+ 1 to infinity
"ACGGGT", ...
{n} exactly n times "ACG{2}T ~ "ACGGT"
"ACG{2,4} ~ "ACGGT",
{n,m} n to m times
"ACGGGT", "ACGGGGT"
{n,} n to infinity times "ACG{2,4} ~ "ACGGT",
"ACGGGT", "ACGGGGT", ...
Positions

● Sometimes it is important where the pattern is found


● There are two positional operators:
quantifier matches

^ beginning of string

$ end of string

● These operators make sense when used with re.search


but not with re.match (always matches entire string or not)
re.match vs. re.search

● re.match matches the entire string (or not):


if re.match("k.*h", name):
print(name + " starts with k and ends with h")

● re.search finds the given pattern as part of the string

if re.search("k.*h", name):
print("pattern 'k.*h' occurs in " + name)
Combinations

● the true power of regexes lies in the combination of the


different operators
● Here's an example regex that identifies eukariotic
messenger RNA sequences:
"^AUG[ACGU]{30,1000}A{5,10}$"
● Read from left to right:
– ^AUG: starts with AUG
– [ACGU]{30,1000}: followed by 30 to 1000 nucleotides
– A{5,10}$: ended by 5 to 10 A's
Quantifiers

● what if we have (sub)patterns of variable length?


● quantifiers enable us to specify how often a certain part of
a pattern can be repeated
● remember our earlier example:
if re.match("k.*h", name):
[...]

matches a string starting with 'k' and ending with 'h',


having an arbitrary number (0 to infinity) of characters in
between
● quantifiers (in this case '*') are specified after the pattern
(here: '.') they belong to
Extracting matches

● When using re.search one is usually interested in getting


the matching substrings:

re.search returns a match object containing more
informations about the match
● the match object can be queried with the group method:

dna = "CTTAGCAGCTTACG"

# store the match object in the variable m


m = re.search(r"GC[^N]GC", dna)
print("BisI recognition site found: " + m.group())
Capturing parts of matches

● What if we want to extract parts from a string specifically?



re.search can return (or capture) many results as groups
in the match object
● parts that should be captured are indicated in the regex
with parantheses:

dna = "ATGACGTACGTACGACTG"
# store the match object in the variable m
m = re.search(r"GA([ATGC]{3})AC([ATGC]{2})AC", dna)
print("entire match: " + m.group())
print("first bit: " + m.group(1))
print("second bit: " + m.group(2))
Getting the position of a match

● the match object can tell us what matched our regex but
also where the match occurred in the string
● the match object provides the methods start and end to
get positional information:
dna = "CTTAGCAGCTTACG"
m = re.search(r"GC[^N]GC", dna)
print("match starts at: " + str(m.start()))
print("match ends at: " + str(m.end()))

● Caution! This code would produce an error if no match is


found (the match object is then not defined → m == None)
Getting the position of multiple matches

● as with group, we can access information about each


match by addressing it by its index:
dna = "ATGACGTACGTACGACTG"
m = re.search(r"GA([ATGC]{3})AC([ATGC]{2})AC", dna)
print("whole match: "+str(m.start())+" to "+str(m.end()))
print("1st bit: "+str(m.start(1))+" to "+str(m.end(1)))
print("2nd bit: "+str(m.start(2))+" to "+str(m.end(2)))

● Caution! This code would produce an error if no match is


found (the match object is then not defined → m == None)
Splitting a string with a regex

● the re module provides a split function which takes a


regex as delimiter to split by:
dna = "ACTNGCATRGCTACGTYACGATSCGAWTCG"
runs = re.split(r"[^ATGC]", dna)
print(runs)

→ "[^ATGC]" = any character except 'A','T','G' or 'C'


● This code prints out
['ACT', 'GCAT', 'GCTACGT', 'ACGAT', 'CGA', 'TCG']
Finding all occurrences in a text

● the search function returns only the first match found in


the string
● using findall we can extract all occurrences of a pattern
dna = "ACTGCATTATATCGTACGAAATTATACGCGCG"
runs = re.findall(r"[AT]{4,100}", dna)
print(runs)

→ "[AT]{4,100}" = runs of only 'A' or 'T' bases,


4-100 bp long
● runs will contain the following list:
['ATTATAT', 'AAATTATA']
Finding all matches in a text


findall only returns the matching substrings, but what if
we need the positions of the substrings?

finditer returns a sequence of match objects
(called an iterator) that can be used in a for loop:
dna = "ACTGCATTATATCGTACGAAATTATACGCGCG"
runs = re.finditer(r"[AT]{4,100}", dna)
for match in runs:
run_start = match.start()
run_end = match.end()
print("AT run from "+str(run_start)+" to "+str(run_end))
Recap

In this unit you learned about:


● the notion of modules in python
● the concept and syntax of regular expressions
● using raw strings to avoid special character clashes
● how to use the re module to find occurrences of
patterns in a text
● how to extract information from match objects
● splitting strings with regular expressions
Recap

Here is a shortlist of functions provided by the re module:

function application
re.match returns a match object if the pattern matches the
whole string
re.search returns the first matching substring for a pattern

re.findall returns a list of all substrings matching the pattern

returns a sequence of match objects representing all


re.finditer
occurrences of the pattern
re.split splits a text at each occurrence of the pattern
Recap

These methods can be used to extract information from a


match object:
method application
group returns the substring matching the pattern
(indexable, i.e. if more than one group was captured by
the regex, each group can be addressed by its index)
start returns the start position of a (sub)match (indexable)

end returns the end position of a (sub)match (indexable)


Exercise 6-1: Accession names

● Write a script that reads a set of succession names from file


"accessions.txt" (comma-separated list)
● print the accessions fulfilling the following criteria:
a) contain the number 5
b) contain the letter d or e
c) contain the letters d and e (in that order)
d) contain the letters d and e (in that order) with a letter in between
e) contain the letters d and e (in any order)
f) start with x or y and end with e
g) contain 3 or more numbers in a row
h) end with d followed by either a, r or p
Exercise 6-2: Double digest

● read the DNA sequence in file dna.txt


● predict the fragments lengths we would get by digesting
the sequence with the (made-up) restriction enzymes
– AbcI: cutting site "ANT*AAT"
– AbcII: cutting site "GCRW*TG"
(asterisks indicate where the enzyme cuts the DNA)

You might also like