An Introduction To Regular Expressions (9781492082569)
An Introduction To Regular Expressions (9781492082569)
Expressions
Thomas Nield
An Introduction to Regular Expressions
by Thomas Nield
Copyright © 2019 O’Reilly Media. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North,
Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales
promotional use. Online editions are also available for most titles
(http://oreilly.com). For more information, contact our corporate/institutional
sales department: 800-998-9938 or corporate@oreilly.com.
Setting Up
You can test these examples I am about to walk through in a number of
places. I recommend using Regular Expressions 101, a free web-based
application to test a regular expression against text inputs. As we go through
these examples, type in the regular expression pattern in the “Regular
Expression” field, and a sample text in the “Test String” field. You will then
immediately see in the right panel whether a full or partial match succeeded,
as well as a broken down explanation of what your regex is doing (see Figure
1).
Figure 1-1. The regex101.com site is a helpful tool to test regular expressions against text inputs.
For Python, you can also import and use the native re package as shown
below. The fullmatch() function will accept a regex pattern and an input
string to test against. It will return a match object if a full match exists.
import re
result = re.fullmatch(pattern="[A-Z]{2}", string="TX")
if result:
print("match")
else:
print("Doesn't match")
Now that you are set up, we will walk through all the major functionalities
offered by regular expressions.
REGEX: TX
INPUT: TX
MATCH: true
REGEX: TX
INPUT: AZ
MATCH: false
REGEX: Lorem\sIpsum
INPUT: Lorem Ipsusm
MATCH: true
Character Ranges
For a given position in a string, we can qualify only a range of characters. To
match a string containing a character of 0, 1, or 3 followed by an F, X, or B,
we can declare a regular expression with character ranges inside square
brackets [].
REGEX: [013][FXB]
INPUT: 1X
MATCH: true
REGEX: [013][FXB]
INPUT: 1Z
MATCH: false
REGEX: [1-4][A-Z]
INPUT: 1X
MATCH: true
REGEX: [1-4][A-Z]
INPUT: 51
MATCH: false
You can also qualify multiple ranges on a single character. For instance, we
can qualify the first character in a two-character string to be an uppercase
letter, a lowercase letter, or a number.
REGEX: [A-Za-z0-9][0-9]
INPUT: i5
MATCH: true
REGEX: [A-Za-z0-9][0-9]
INPUT: 1X
MATCH: false
REGEX: [^AEIOU]
INPUT: X
MATCH: true
REGEX: [^AEIOU]
INPUT: E
MATCH: false
If you want a literal dash - character to be part of the character range, declare
it first in the range.
REGEX: [-0-9][0-9]
INPUT: -9
MATCH: true
REGEX: [-0-9][0-9]
INPUT: 99
MATCH: true
Anchors
Sometimes you will want to qualify the start ^ and end $ of a line or string.
This can be handy if you are searching a document and want to qualify the
start or end of a line as part of your regular expression. You can use this
regular expression to match all numbers that start a line in a document as
shown here:
^[0-9]
Figure 1-2. Using Atom Editor to search for numbers that start a line.
[0-9]$
REGEX: [0-9][0-9]
INPUT: 1432
MATCH: true
REGEX: ^[0-9][0-9]$
INPUT: 1432
MATCH: false
Quantifiers
A critical feature of regular expressions is quantifiers, which repeat the
preceding clause of a regular expression.
For instance, it is a bit redundant to express [A-Z] three times to match three
uppercase letters.
Fixed Repetitions
REGEX: [A-Z][A-Z][A-Z]
INPUT: YCA
MATCH: true
Instead, we can follow the [A-Z] with a quantifier {3} to specify repeating
that character range three times, as in [A-Z]{3}. This accomplishes the same
task as [A-Z][A-Z][A-Z], but more succinctly expresses it as three
repetitions.
REGEX: [A-Z]{3}
INPUT: YCA
MATCH: true
We can use the regular expression below to match a 10-digit phone number
with dashes.
REGEX: [0-9]{3}-[0-9]{3}-[0-9]{4}
INPUT: 470-127-7501
MATCH: true
REGEX: [0-9]{3}-[0-9]{3}-[0-9]{4}
INPUT: 75663-2372
MATCH: false
REGEX: [A-Z]{2,3}
INPUT: YCA
MATCH: true
REGEX: [A-Z]{2,3}
INPUT: AZ
MATCH: true
Leaving the second argument empty and having a comma still present will
result in an infinite maximum, and therefore specify a minimum. Below, we
have a regex that will match on a minimum of two alphanumeric characters.
REGEX: [A-Za-z0-9]{2,}
INPUT: YZ1
MATCH: true
REGEX: [A-Za-z0-9]{2,}
INPUT: YZSDjhfhSBH2342SDFSDFsdfw123412
MATCH: true
REGEX: [0-9]?[A-Z]{2}
INPUT: 3BC
MATCH: true
REGEX: [0-9]{3}-?[0-9]{3}-?[0-9]{4}
INPUT: 470-127-7501
MATCH: true
REGEX: [0-9]{3}-?[0-9]{3}-?[0-9]{4}
INPUT: 4701277501
MATCH: true
1 or More Repetitions
A + is a shorthand for {1,}, which requires a minimum of 1 repetition, but
will capture any number of repetitions after that.
REGEX: [XYZ]+
INPUT: Z
MATCH: true
REGEX: [XYZ]+
INPUT: XYZZZYZXXX
MATCH: true
REGEX: [XYZ]+[0-9]+
INPUT: XYZZZYZXXX2374676128963453452990
MATCH: true
0 or More Repetitions
A * is a shorthand for {0,}, which makes whatever it is quantifying
completely optional, but will capture as many repetitions it can if they do
exist.
REGEX: [0-3]+[XYZ]*
INPUT: 34
MATCH: true
REGEX: [0-3]+[XYZ]*
INPUT: 34YYXZZ
MATCH: true
Wildcards
A dot . is a wildcard for any character, making it the broadest operator you
can use. It will match not just alphabetic or numeric characters, but also
whitespaces, newlines, punctuation, and any other symbols.
REGEX: ...
INPUT: B/C
MATCH: true
REGEX: .{3}
INPUT: B/C
MATCH: true
REGEX: H.{3}O
INPUT: HELLO
MATCH: true
A common operation you may see is .*, which allows 0 or more repetitions of
any character. This is often used to match any text, making it function as an
“everything” wildcard. This can be helpful when using regular expressions as
qualifiers, and if you do not want that parameter to restrict anything just
make it a .*.
REGEX: .*
INPUT: AsdfSJDFJSVdsfBLKJXCasdBNVJWB$TJ$@#ASDFSD@
MATCH: true
REGEX: .*
INPUT: Alpha
MATCH: true
Grouping
It can be helpful to group up parts of a regular expression in parentheses,
often to use a quantifier on that whole group. For instance, if you want to
qualify an uppercase letter followed by three numeric digits, but want to
repeat that whole operation with a quantifier, you can do so like this:
REGEX: ([A-Z][0-9]{3})+
INPUT: A563
MATCH: true
REGEX: ([A-Z][0-9]{3})+
INPUT: A563X264
MATCH: true
REGEX: ([A-Z][0-9]{3}-?)+
INPUT: A563-X264-C578
MATCH: true
If we wanted to identify phone numbers (with optional dashes -), but make
the area code (the first three digits) optional, we can do so like this:
REGEX: ([0-9]{3}-)?[0-9]{3}-?[0-9]{4}
INPUT: 470-127-7501
MATCH: true
REGEX: ([0-9]{3}-?)?[0-9]{3}-?[0-9]{4}
INPUT: 127-7501
MATCH: true
Alternation
Alternation is expressed with a | and essentially operates as an “OR”. It
alternates two or more valid patterns where at least one of those patterns must
match in that position.
For instance, if we want to capture 5-digit U.S. ZIP codes that end in “35” or
“75,” we can tail a repeated numeric range with a (35|75). We must group
that in parentheses so the | does not mangle the 35 with the repeated numeric
range.
REGEX: [0-9]{3}(35|75)
INPUT: 75035
MATCH: true
REGEX: [0-9]{3}(35|75)
INPUT: 75062
MATCH: false
REGEX: ALPHA|BETA|GAMMA
INPUT: BETA
MATCH: true
REGEX: ALPHA|BETA|GAMMA
INPUT: DELTA
MATCH: false
REGEX: (?<=[A-Z]+)[0-9]+
INPUT: ALPHA12
MATCH: 12
REGEX: (?<=[A-Z]+)[0-9]+
INPUT: 167
MATCH: false
A suffix works similarly, but matches a tail without including that tail.
REGEX: [0-9]+(?=[A-Z]+)
INPUT: 12ALPHA
MATCH: 12
REGEX: [0-9]+(?=[A-Z]+)
INPUT: 167
MATCH: false
Conclusions
It is important to remember that you often only need to make a regular
expression as specific as it needs to be, depending on how predictable your
data is. Qualifying a number with [0-9.]+ will work to match an IP
address such as 172.18.83.200. But keep in mind it will also match
237476231.345342342334.23423756756856234, which is definitely not an
IP address. If you do not know your data well, you should probably err on
being more specific, as demonstrated in this Stack Overflow question.
Regular expressions may seem niche, but they can rise up heroically to the
most unexpected tasks in your day-to-day work. Hopefully this article has
helped you feel more comfortable with regular expressions and find them
useful. They can assist in data munging, qualification, categorization, and
parsing as well as document editing.