0 ratings0% found this document useful (0 votes) 35 views3 pagesRegular Expression
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
6 Regular Expressions
Let’s now take a different approach to categorizing problems. Instead of focusing on the power of « computing device,
let's look at the task that we need to perform. In particular, let's consider problems in which our goal is to match
finite or repeating pattems. For example, consider:
© The first step of compiling a program: this step is called lexical analysis. Its job is to break the source code into
meaningful units such as keywords, variables, and numbers. For example, the string void may be a keyword,
while the string 232-12 should be recognized as a floating point number
+ Filtering email for spam.
+ Sorting email into appropriate mailboxes based on sender and/or content words and phrases.
# Searching a complex directory structure by specifying patterss that are known to occur in the file we want.
In this chapter, we will define a simple pattern language. It has limitations. But its strength, as we will soon see, is
that we ean implement pattern matching for this language using finite state machines
In his classic book, A Pattern Language B, Christopher Alexander described common patterns
that can be found in successful buildings, towns and cities. Software engineers read Alexander’s
‘work and realized that the same is true of successful programs and systems. Pattems are
ubiquitous in our world,
6.1 What is a Regular Expression?
“The repular expression language that we are about to deserbe is built onan alphabet that contains two kinds of
symbols:
+ asel ofspecial symbols to which we will atach particular meanings when they occur ina regular expression, These
symbols are , U,¢,(,), *, and ~
# an alphabet 2, which contains the symbols that regular expressions will match against
A regular expression © is string that can be formed according to the following rules:
is a regular expression,
is a regular expression,
Every element in ¥ isa regular expression,
Given two regular expressions a. and B, a is a regular expression,
Given two regular expressions a. and ,, 0. U B is a regular expression
Given a regular expression a, a is a regular expression,
Given a regular expression a, ais a regular expression,
Given a regular expression a, (a) is a regular expression,
So, ifwe let ¥ = {a, b}, the following strings are regular expressions:
Bea,
Jb)", abba Ue.
.
‘The language of regular expressions, as we have just defined it, is useful because every regular expression has a
‘meaning (just like every English sentence and every Java program). In the case of regular expressions, the meaning,
‘ofa string is another language. In other words, every string a (such as alba U e) in the regular expression language
‘has, as its meaning, some new language that contains exactly the strings that match the pattern specified in a,
To make it possible to determine that meaning, we need to describe a semantic interpretation function for regular
expressions. Fortunately, the regular expressions language is simple, So designing a compositional semantic
interpretation function (as defined in Section 2.2.6) for itis straightforward, As you read the definition that we are
about to present, it will become clear why we chose the particular symbol alphabet we did. In particular, you will,
Chapter 6 92 Regular Expressionsnotice the similarity between the operations that are allowed in regular expressions and the operations that we defined
‘on languages in Section 2.2
Define the following semantic interpretation function L for the language of regular expressions:
1, L(@)= 2, the language that contains no strings,
2. L(@)= {e} the language that contains just the empty string,
3. Forany ¢ © E, L(c) = {c}, the language that contains the single, one-character string ¢
4, For any regular expressions «and B, L(«B) = L(«) L(B). In other words, o form the meening of the concatenation
of two regular expressions, first determine the meaning of each of the constituents, Both meanings will be
Tanguages. Then concatenate the two languages together. Recall that the concatenation of two languages £ and
La is {w= xy, whore x € Ly and y © Lz}. Note that, if either L(at) or LB) is equal to Z, then the concatenation
will also be equal to 2.
5. Forany regular expressions a and B, (ce B) = L(a) UW L(P). Again we form the meaning ofthe larger expression
by first determining the meaning of cach of the constituents, Each of them is a language. The meaning of aU B
then, as suggested by our choice of the character U as an operator, isthe union of the two constituent languages.
6. For any regular expression a, L(«*) = (L(a))*, where * is the Kleene star operator defined in Section 2.2.5. So
L(a*) isthe language that is formed by concatenating together zero or more stings drawn from L(q)
7. For any regular expression 0, La’) ~ L(aa*) = L(a) (L(a))*. If La) is equal to Z%, then L(a") is also equal to
D. Otherwise L(a) is the language that ig formed by concatenating together one or more strings drawn from
a,
8. For any regular expression a, L((a)) ~ L(@). In other words, parentheses have no effect on meaning except to
‘group the constituents in an expression,
Ifthe meaning of a regular expression a is the language L, then we say that a defines or describes L.
The definition that we have just given for the regular expression language contains three kinds of rules:
+ Rules 1, 3, 4 5, and 6 give the language its power to define sets, starting with the basic sets defined by rules 1 and
3, and then building larger sets using the operators defined by rules 4,5, and 6.
# Rule 8 has as its only role grouping other operators
Rules 2 and 7 appear to add functionality to the regular expression language. But in fact they don; they serve only
to provide convenient shorthands for languages that can be defined using only rules 1, 3-6, and 8. Let's see why.
First consider rule 2: the language of regular expressions does not need the symbol ¢ because it has an alternative
mechanism for describing L(c). Observe that L(@*) ~ {w: wis formed by concatenating together zero or more strings
from @}. But how many ways are there fo concatenate together zero or more strings from @? If we select zero strings
to concatenate, we get c. We cannot select more than zero since there aren’t any to choose from. So L(2*) = (c}
‘Thus, whenever we would like to write ¢, we could instead write @*. Ibis much elearerto write e, and we shall. But,
‘whenever we wish to make a formal statement about regular expressions or the languages they define, we need not
‘consider rule 2 since we can rewrite any regular expression that contains © as an equivalent one that contains Z*
instead,
Next consider rule 7: as we showed in the statement of rule 7 itself, the regular expression a’ is equivalent to the
slightly longer regular expression at*. The form a” is @ convenient shorteut, and we will use it, But we need not
‘consider rule 7 in any analysis that we may choose to do of regular expressions or the languages that they generate.
‘The compositional semantic interpretation function that we just defined lets us map between regular expressions and
the languages that they define. We begin by analyzing the smallest subexpressions and then work outwatds to larger
and larger expressions,
Chapter 6 93 Regular ExpressionsExample 6.1 Analyzing a Simple Regular Expression
(a U bb) = La vs)*) Lo)
(Li(a U b)))* Lb)
(2) J E(o))* Le)
{a} U ()* {}
~ {a,)* (0)
So the meaning of the regular expression (a U b)*b is the set of al strings over the alphabet (2, } that end in b.
(One straightforward way to read a regular expression and determine its meaning is to imagine it as a procedure that
‘generates strings, Read it left to right and imagine it generating a string left to right. As you are doing tha, think of
any expression that is enclosed in a Kleene star asa loop that can be executed zero or more times. Each time through
the loop, choose any one of the alternatives listed in the expression. So we can read the regular expression of the last
‘example, (2 U b)*, a8, “Go through a loop zero or more times, picking a single a orb each time. Then concatenate
)*),
bb.” Any string that can be generated by this procedure is in L( (a
Regular expressions can be used to sean text and pick out email addresses, € 793
Example 6.2 Another Simple Regular Expression
UW ((aUb) (avd) )a(aUD)*) = 1((av ba Ub) La) Lau )*)
=L(aubyaub)) fa} L(avry)*
=a Ub) (av) {a} {a,b}
= {a,b} {a,b} {a} (a,b}*
So the meaning of the regular expression ( (a Ub) (a Ub) ) a (2 Ub) is
(way: x and y ate strings of a's and js and [x|=2}.
Alternatively, its the language that contains all strings of a’s and b's such that there exists a third character and it is
Example 6.3 Given a Language, Find a Regular Expression
Let L = {w < {a,1}* : he]is even}. There are two simple regular expressions both of which define L:
(@ubiauby* This one ean be read as, “Go through a loop zero or more times.
Each time through, choose an a or b, then choose a second
character (a or b).”
(aa Uab Uba Ubb)* ‘This one can be read as, “Go through a loop zero or more times.
Each time through, choose one of the two-character sequences.”
From this example, itis elear that the semantic interpretation function we have defined for regular expressions is not
‘one-to-one. In fact, given any language L, if there is one regular expression that defines i, there is an infinite number
that do. This is trivially tue since, for any regular expression a, the regular expression a U ot defines the same
language a does.
Recall from our discussion in Section 2.2.6 that this is not unusual. Semantic interpretation functions for English and
for Java are not one-to-one, The practical consequence of this phenomenon for regular expressions is that, if we are
‘ying to design a regular expression that describes some particular language, there will be more than one right answer,
We will generally seek the simplest one that works, both for clarity and to make pattern matching fast
Chapter 6 94 Regular Expressions