06_regex
06_regex
This course (apart from chapter 9) is based on the book "Python for Biologists":
http://pythonforbiologists.com/
from scratch
A primer for scientists working with Next-Generation-
Sequencing data
CHAPTER 6
Regular expressions
Overview
This course (apart from chapter 9) is based on the book "Python for Biologists":
http://pythonforbiologists.com/
Searching for patterns in strings
^ beginning of string
$ end of string
if re.search("k.*h", name):
print("pattern 'k.*h' occurs in " + name)
Combinations
dna = "CTTAGCAGCTTACG"
dna = "ATGACGTACGTACGACTG"
# store the match object in the variable m
m = re.search(r"GA([ATGC]{3})AC([ATGC]{2})AC", dna)
print("entire match: " + m.group())
print("first bit: " + m.group(1))
print("second bit: " + m.group(2))
Getting the position of a match
● the match object can tell us what matched our regex but
also where the match occurred in the string
● the match object provides the methods start and end to
get positional information:
dna = "CTTAGCAGCTTACG"
m = re.search(r"GC[^N]GC", dna)
print("match starts at: " + str(m.start()))
print("match ends at: " + str(m.end()))
●
findall only returns the matching substrings, but what if
we need the positions of the substrings?
●
finditer returns a sequence of match objects
(called an iterator) that can be used in a for loop:
dna = "ACTGCATTATATCGTACGAAATTATACGCGCG"
runs = re.finditer(r"[AT]{4,100}", dna)
for match in runs:
run_start = match.start()
run_end = match.end()
print("AT run from "+str(run_start)+" to "+str(run_end))
Recap
function application
re.match returns a match object if the pattern matches the
whole string
re.search returns the first matching substring for a pattern