CSE528
Natural Language Processing
Venue:ADB-405 Topic: Regular Expressions & Automata
Pr o f. Tu l asi Pr a sa d S a ri ki ,
S CSE, V I T C h ennai Ca mpus
www. l earnersdesk.weeb l y. com
Contents
NLP Example: Chat with Alice
Regular Expressions
Regular Expression Patterns
Operator precedence
Applications
Regular Expressions in MS-Word
Finite Automata
FSA / FST
Applications of FSA & FST
REGULAR EXPRESSIONS AND AUTOMATA 2
NLP Example: Chat with Alice
A.L.I.C.E. (Artificial Linguistic Internet Computer Entity) is an award-
winning free natural language artificial intelligence chat robot. The
software used to create A.L.I.C.E. is available as free ("open source")
Alicebot and AIML software.
http://www.alicebot.org/about.html
REGULAR EXPRESSIONS AND AUTOMATA 3
Regular Expressions
In computer science, RE is a language used for specifying text search string.
A regular expression is a formula in a special language that is used for
specifying a simple class of string.
Formally, a regular expression is an algebraic notation for characterizing a
set of strings.
RE search requires
◦ a pattern that we want to search for, and
◦ a corpus of texts to search through.
REGULAR EXPRESSIONS AND AUTOMATA 4
Regular Expressions
A RE search function will search through the corpus returning all
texts that contain the pattern.
◦ In a Web search engine, they might be the entire documents or Web
pages.
◦ In a word-processor, they might be individual words, or lines of a
document.
◦ E.g., the UNIX grep command
Regular expressions are case sensitive.
We will use Perl based syntax for representation.
REGULAR EXPRESSIONS AND AUTOMATA 5
Regular Expressions
Disjunctions [abc]
Ranges [A-Z]
Negations [^Ss]
Optional characters ? and *
Wild cards .
Anchors ^ and $, also \b and \B
Disjunction, grouping, and precedence |
REGULAR EXPRESSIONS AND AUTOMATA 6
Regular Expression Patterns
regular expression example pattern matched
/woodchucks/ “interesting links to woodchucks and lemurs”
/a/ “Mary Ann stopped by Mona’s”
/Claire says,/ Dagmar, my gift please,” Claire says,”
/song/ “all our pretty songs”
/!/ “You’ve left the burglar behind again!” said Nori
REGULAR EXPRESSIONS AND AUTOMATA 7
Regular Expression Patterns
The use of the brackets [] to specify a disjunction of characters.
Regular Expression Match
/[wW]oodchuck/ Woodchuck or woodchuck
/[abc]/ “a”, “b”, or “c”
/[0123456789]/ Any digit
REGULAR EXPRESSIONS AND AUTOMATA 8
Regular Expression Patterns
The use of the brackets [] plus the dash - to specify a range.
Regular expression match sample pattern
/[A-Z]/ any uppercase letter this is Linguistics 5981
/[0-9]/ any single digit this is Linguistics 5981
/[1 2 3 4 5 6 7 8 9 0]/ any single digit this is Linguistics 5981
REGULAR EXPRESSIONS AND AUTOMATA 9
Regular Expression Patterns
To search for negation, i.e. a character that I do NOT want to find we use
the caret: [^]
Regular expression match sample pattern
/[^A-Z]/ not an uppercase letter this is Linguistics 5981
/[^L l]/ neither L nor l this is Linguistics 5981
/[^\.]/ not a period this is Linguistics 598
Special characters:
\* an asterisk “L*I*N*G*U*I*S*T*I*C*S”
\. a period “Dr.Doolittle”
\? a question mark “Is this Linguistics 5981 ?”
\n a newline
\t a tab
REGULAR EXPRESSIONS AND AUTOMATA 10
Regular Expression Patterns
To search for optional characters we use the question mark: [?]
Regular expression match sample pattern
/colou?r/ colour or color beautiful colour
To search for any number of a certain character we use the Kleene star: [*]
Regular expression match
/a*/ any string of zero or more “a”s
/aa*/ at least one a but also any number of “a”s
REGULAR EXPRESSIONS AND AUTOMATA 11
Regular Expression Patterns
To look for at least one character of a type we use the Kleene “+”:
Regular expression match
/[0-9]+/ a sequence of digits
Any combination is possible
Regular expression match
/[ab]*/ zero or more “a”s or “b”s
/[0-9] [0-9]*/ any integer (= a string of digits)
REGULAR EXPRESSIONS AND AUTOMATA 12
Regular Expression Patterns
The “.” is a very special character -> so-called wildcard
Regular expression match sample pattern
/b.ll/ any character between b and ll ball, bell, bull, bill
The /. / symbol is called a wildcard : it matches any single character. For example, the
regular expression /s.ng/ matches the following English words:
sang, sing, song, sung.
Note that /./ will match and not only alphabetic characters, but also numeric and
whitespace characters. Consequently, /s.ng/ will also match non-words such as s3ng.
The pattern /....berry/ finds words like cranberry.
REGULAR EXPRESSIONS AND AUTOMATA 13
Regular Expression Patterns
Anchors (start of line: “^”, end of line:”$”)
Regular expression match sample pattern
/^Linguistics/ “Linguistics” at the beginning of a line Linguistics is fun.
/linguistics\.$/ “linguistics” at the end of a line We like linguistics.
Anchors (word boundary: “\b”, non-boundary:”\B”)
Regular expression match sample pattern
/\bthe\b/ “the” alone This is the place.
/\Bthe\B/ “the” included This is my mother.
REGULAR EXPRESSIONS AND AUTOMATA 14
Regular Expression Patterns
More on alternative characters: the pipe symbol: “|” (disjunction)
Regular expression match sample pattern
/colou?r/ colour or color beautiful colour
/progra(m|mme)/ program or programme linguistics program
REGULAR EXPRESSIONS AND AUTOMATA 15
Predefined Character class
Character class Description
\d A digit. Equivalent to[0-9].
\D A non-digit. Equivalent to [^0-9].
\s A whitespace character. Equivalent to [ \t\n\x0B\f\r].
\S A nonwhitespace character. Equivalent to[^\s].
\w A word character. Equivalent to [a-zA-Z_0-9].
\W A non-word character. Equivalent to [^\w].
REGULAR EXPRESSIONS AND AUTOMATA 16
Boundary matchers
Boundary Matcher Description
^ The beginning of a line
$ The end of a line
\b A word boundary
\B A nonword boundary
\A The beginning of the text
\G The end of the previous match
\Z The end of the text (but for the final line terminator, if any)
\z The end of the text
REGULAR EXPRESSIONS AND AUTOMATA 17
Quantifiers
Character Description
{n} n is a nonnegative integer. Matches exactly n times. For example, 'o{2}' does
not match the 'o' in "Bob," but matches the two o's in "food".
{n,} n is a nonnegative integer. Matches at least n times. For example, 'o{2,}'
does not match the "o" in "Bob" and matches all the o's in "foooood". 'o{1,}'
is equivalent to 'o+'. 'o{0,}' is equivalent to 'o*'.
{n,m} M and n are nonnegative integers, where n <= m. Matches at least n and at
most m times. For example, "o{1,3}" matches the first three o's in
"fooooood". 'o{0,1}' is equivalent to 'o
REGULAR EXPRESSIONS AND AUTOMATA 18
Operator precedence
A regular expression is evaluated from left to right and follows an order of
precedence, much like an arithmetic expression.
The following table illustrates, from highest to lowest, the order of precedence
of the various regular expression operators:
Operator(s) Description
\ Escape
(), (?:), (?=), [] Parentheses and Brackets
*, +, ?, {n}, {n,}, {n,m} Quantifiers
^, $, \anymetacharacter, anycharacter Anchors and Sequences
| Alternation
REGULAR EXPRESSIONS AND AUTOMATA 19
Operator precedence
Characters have higher precedence than the alternation operator, which
allows 'm|food' to match "m" or "food". To match "mood" or "food", use
parentheses to create a subexpression, which results in '(m|f)ood'.
REGULAR EXPRESSIONS AND AUTOMATA 20
Applications
Regular Expressions for the Java Programming Language
• java.util.regex for enabling the use of regular expressions
Applications
• Simple word replacement
• Email validation
• Removal of control characters from a file
• File searching
REGULAR EXPRESSIONS AND AUTOMATA 21
Example
write a Perl regular expression to match the English article “the”:
/the/ missed ‘The’
/[tT]he/ included ‘the’ in ‘others’
/\b[tT]he\b/ Missed ‘the25’ ‘the_’
/[^a-zA-Z][tT]he[^a-zA-Z]/ Missed ‘The’ at the beginning of a line
/(^|[^a-zA-Z])[tT]he[^a-zA-Z]/
REGULAR EXPRESSIONS AND AUTOMATA 22
Example
Write a regular expression that will match “any PC with more than
500MHz and 32 Gb of disk space for less than $1000”:
Price
◦ /$[0-9]+/ # whole dollars
◦ /$[0-9]+\.[0-9][0-9]/ # dollars and cents
◦ /$[0-9]+(\.[0-9][0-9])?/ #cents optional
◦ /\b$[0-9]+(\.[0-9][0-9])?\b/ #word boundaries
REGULAR EXPRESSIONS AND AUTOMATA 23
Example
Specifications for processor speed
◦ /\b[0-9]+ *(MHz|[Mm]egahertz|Ghz|[Gg]igahertz)\b/
Memory size
◦ /\b[0-9]+ *(Mb|[Mm]egabytes?)\b/
◦ /\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/
Vendors
◦ /\b(Win95|WIN98|WINNT|WINXP *(NT|95|98|2000|XP)?)\b/
◦ /\b(Mac|Macintosh|Apple)\b/
REGULAR EXPRESSIONS AND AUTOMATA 24
Example
Expression Matches
/^\s*$/ Match a blank line.
/\d{2}-\d{5}/ Validate an ID number consisting of 2
digits, a hyphen, and an additional 5 digits.
/<\s*(\S+)(\s[^>]*)?>[\s\S]*<\s*\/\1\s*>/ Match an HTML tag.
REGULAR EXPRESSIONS AND AUTOMATA 25
Regular Expressions in MS-Word
? and *
The two most basic wildcard characters are ? and *.
? is used to represent a single character and * represents any number of characters.
s?t will find sat, set, sit, sat and any other combination of 3 characters
beginning with “s” and ending with “t”. Ex: inset.
s*t will find all the above, but will also find “secret”, “serpent”, “sailing
boat” and“sign over document”, etc.
@
@ is used to find one or more occurrences of the previous character.
For example, lo@t will find lot or loot, ful@ will find ful or full etc.
REGULAR EXPRESSIONS AND AUTOMATA 26
Regular Expressions in MS-Word
<>
<s*t> would find “secret” and “serpent” and “sailing boat”, but not “sailing boats”
or “sign over documents”. It will also find “'set” in “tea-set” , but not “set” in
“toolset”.
The <> tags can be used in pairs, as above; or individually.
ful@> will find “full” and the appropriate part of “wilful”, but will not find “wilfully”.
REGULAR EXPRESSIONS AND AUTOMATA 27
Regular Expressions in MS-Word
[]
Square brackets are always used in pairs and are used to identify specific characters
or ranges of characters.
[abc] will find any of the letters a, b, or c.
[ A-Z] will find any upper case letter.
[ 13579] will find any odd digit.
\
If you wish to search for a character that has a special meaning in wildcard searches
– the obvious example being “?” – then you can do so by putting a backslash in front
of it.
[\?] will not find “\” followed by any character; but will find “?”
REGULAR EXPRESSIONS AND AUTOMATA 28
Regular Expressions in MS-Word
[!]
[!] is very similar to [ ] except in this case it finds any character not listed in the box
so [!o] would find every character except “o”.
You can use ranges of characters in exactly the same was as with [ ], thus [!A-Z] will
find everything except upper case letters.
REGULAR EXPRESSIONS AND AUTOMATA 29
Regular Expressions in MS-Word
{}
Curly brackets are used for counting occurrences of the previous character or
expression.
{n} This finds exactly the number “n” of occurrences of the previous character (so
for example, a{2} will find “aa”).
{n,m} finds text containing between “n” and “m” occurrences of the previous
character or expression; so a{2,3} will find “aa” and “aaa”, but only the first 3
characters in “aaaa” ).
REGULAR EXPRESSIONS AND AUTOMATA 30
Regular Expressions in MS-Word
()
Round brackets have no effect on the search pattern, but are used to divide the
pattern into logical sequences where you wish to re-assemble those sequences in a
different order during the replace – or to replace only part of that sequence.
They must be used in pairs and are addressed by number in the replacement.
Eg: (Tulasi) (Prasad) replaced by \2 \1 (note the spaces in the search and replace
strings) – will produce Prasad Tulasi or replaced by \2 alone will give Prasad.
^
The ^ (“caret”) character is not specific to wildcard searches but it sometimes has to
be used slightly differently from normal, when searching for wildcards.
REGULAR EXPRESSIONS AND AUTOMATA 31
Finite Automata
The regular expression is more than just a convenient meta-language for text
searching.
Any regular expression can be implemented as a finite-state automaton.
Symmetrically, any finite-state automaton can be described with a regular
expression.
Regular expression is one way of characterizing a particular kind of formal
language called a regular language.
Both regular expressions and finite-state automata can be used to describe
regular languages.
REGULAR EXPRESSIONS AND AUTOMATA 32
Finite Automata
The relationship between finite state automata, regular expression, and
regular language
Finite state automata
(Computataional Device)
Regular Expression Regular language
(Descriptive Notation) (Set of Objects)
REGULAR EXPRESSIONS AND AUTOMATA 33
What is a Finite-State Automaton?
An alphabet of symbols,
A finite set of states,
A transition function from states and symbols to states,
A distinguished member of the set of states called the start state, and
A distinguished subset of the set of states called final states.
FSA recognize the regular languages represented by regular expressions
Directed graph with labeled nodes and arc transitions
REGULAR EXPRESSIONS AND AUTOMATA 34
Formally
FSA is a 5-tuple consisting of
Q: a finite set of N states q0, q1, …, qN
: a finite input alphabet of symbols
q0: the start state
F: the set of final states, F Q
(q,i): a transition function mapping Q x to Q
REGULAR EXPRESSIONS AND AUTOMATA 35
FSA Accepter
Input
String
Output
“Accept”
Finite
or
Automaton
“Reject”
REGULAR EXPRESSIONS AND AUTOMATA 36
Transition Graph
abba -Finite Accepter a, b
q5
a a, b
b a b
initial q0 a q1 b q2 b q3 a q4
state
transition final state “accept”
state
REGULAR EXPRESSIONS AND AUTOMATA 37
Initial Configuration
Input String
a b b a
a, b
q5
a a, b
b a b
q0 a q1 b q2 b q3 a q4
REGULAR EXPRESSIONS AND AUTOMATA 38
Reading Input
Input String
a b b a
a, b
q5
a a, b
b a b
q0 a q1 b q2 b q3 a q4
REGULAR EXPRESSIONS AND AUTOMATA 39
Reading Input
Input String
a b b a
a, b
q5
a a, b
b a b
q0 a q1 b q2 b q3 a q4
REGULAR EXPRESSIONS AND AUTOMATA 40
Reading Input
Input String
a b b a
a, b
q5
a a, b
b a b
q0 a q1 b q2 b q3 a q4
REGULAR EXPRESSIONS AND AUTOMATA 41
Reading Input
Input String
a b b a
a, b
q5
a a, b
b a b
q0 a q1 b q2 b q3 a q4
REGULAR EXPRESSIONS AND AUTOMATA 42
Reading Input
Input String
a b b a
a, b
q5
a a, b
b a b
q0 a q1 b q2 b q3 a q4 Output: Accepted
REGULAR EXPRESSIONS AND AUTOMATA 43
Using an FSA to Recognize Sheep talk
REGULAR EXPRESSIONS AND AUTOMATA 44
Using an FSA to Recognize Sheep Talk
Sheep language can be defined as any string from the following (infinite) set:
The regular expression for this kind of sheeptalk is
/baa+!/ baa!
baaa!
All RE can be represented as FSA baaaa!
baaaaa!
....
a
b a a !
q0 q1 q2 q3 q4
REGULAR EXPRESSIONS AND AUTOMATA 45
State Transition Table for Sheep Talk
Input
State
a b a !
b a a ! 0(null) 1 Ø Ø
q0 q1 q2 q3 1 Ø 2 Ø
q4
2 Ø 3 Ø
3 Ø 3 4
4: Ø Ø Ø
REGULAR EXPRESSIONS AND AUTOMATA 46
Algorithm
function D-RECOGNIZE(tape,machine) returns accept or reject
index <- Beginning of tape
current-state <- Initial state of machine
loop
if End of input has been reached then
if current-state is an accept state then
return accept
else
return reject
elseif transition-table[current-state,tape[index]] is empty then
return reject
else
current-state <- transition-table[current-state,tape[index]]
index <- index +1
REGULAR EXPRESSIONS AND AUTOMATA 47
Using an FSA to Recognize Sheep Talk
FSA recognizes (accepts) strings of a regular language
baa! a
baaa! b a a !
baaaa! q0 q1 q2 q3 q4
…
Tracing the execution of FSA on some sheep talk
q q q q q q
0 1 2 3 3 4
... ... ... b a a a ! ... ... ... ... ... ... ... ...
REGULAR EXPRESSIONS AND AUTOMATA 48
Adding a fail state to FSA
a
b a a !
q q q q q
0 1 2 3 4
! ! b
b ! b !
b
?
c a a
qf
REGULAR EXPRESSIONS AND AUTOMATA 49
Adding an else arch
REGULAR EXPRESSIONS AND AUTOMATA 50
Adding ϵ Transition
b a a !
q q q q q4
0 1 2 3
REGULAR EXPRESSIONS AND AUTOMATA 51
Example FSA
An FSA for the words of English numbers 1-99
REGULAR EXPRESSIONS AND AUTOMATA 52
FSA for NLP
Word Recognition
Dictionary Lookup
Spelling Conventions
REGULAR EXPRESSIONS AND AUTOMATA 53
Word Recognition
A word recognizer takes a string of characters as input and returns “yes”
or “no” according as the word is or is not in a given set.
Solves the membership problem.
e.g. Spell Checking, Scrabble(Un-ordered Concatenation)
Approximate methods
Has right set of letters (any order).
Has right sounds (Soundex).
Random (suprimposed) coding (Unix Spell)
REGULAR EXPRESSIONS AND AUTOMATA 54
Word Recognition
Exact Methods
Hashing
Search (linear, binary ...)
Digital search (“Tries”)
Finite-state automata
REGULAR EXPRESSIONS AND AUTOMATA 55
Dictionary Lookup
Dictionary lookup takes a string of characters as input and returns “yes” or “no”
according as the word is or is not in a given set and returns information about
the word.
Lookup Methods
Approximate — guess the information
If it ends in “ed”, it’s a past-tense verb.
Exact — store the information for finitely many words
Table Lookup
Hash
Search
REGULAR EXPRESSIONS AND AUTOMATA 56
Finite State Transducers
A finite state transducer essentially is a finite state automaton that works
on two (or more) tapes. The most common way to think about
transducers is as a kind of ``translating machine''. They read from one of
the tapes and write onto the other.
a:b
q0
a:b at the arc means that in this transition the transducer reads a from the
first tape and writes b onto the second.
REGULAR EXPRESSIONS AND AUTOMATA 57
Finite State Transducers
Transducer behaves as follows in the different modes.
generation mode: It writes a string of as on one tape and a string bs on
the other tape. Both strings have the same length.
recognition mode: It accepts when the word on the first tape consists of
exactly as many as as the word on the second tape consists of bs.
translation mode (left to right): It reads as from the first tape and writes
an b for every a that it reads onto the second tape.
translation mode (right to left): It reads bs from the second tape and
writes an a for every f that it reads onto the first tape.
relator mode: Computes relations between sets
REGULAR EXPRESSIONS AND AUTOMATA 58
FST vs FSA
FSA can act as a FST can act as a
Recognizer Recognizer
Generator Generator
5 tuple Representation Translator
Equivalent to regular languages Set relator
7 tuple Representation
Equivalent to regular relations
REGULAR EXPRESSIONS AND AUTOMATA 59
FST Operations
Inversion: Switching input and output labels
If T maps from I to O, T-1 maps from O to I
Composition:
If T1 is a transducer from I1 to O1 and T2 is a transducer from I2 to O2,
then T1 T2 is a transducer from I1 to O2.
REGULAR EXPRESSIONS AND AUTOMATA 60
FST for NLP
Tokenization
Morphological analysis
Transliteration
Parsing
Translation
Speech recognition
Spoken language understanding
REGULAR EXPRESSIONS AND AUTOMATA 61
REGULAR EXPRESSIONS AND AUTOMATA 62