Scanner
source tokens
code scanner parser IR
errors
• maps characters into tokens – the basic unit of syntax
x = x + y;
becomes
<id, x> = <id, x> + <id, y> ;
• character string value for a token is a lexeme
• typical tokens: number, id, +, -, *, /, do, end
• eliminates white space (tabs, blanks, comments)
• a key issue is speed
⇒ use specialized recognizer (as opposed to lex)
1
Specifying patterns
A scanner must recognize the units of syntax
Some parts are easy:
white space
<ws> ::= <ws> ’ ’
| <ws> ’\t’
| ’ ’
| ’\t’
keywords and operators
specified as literal patterns: do, end
comments
opening and closing delimiters: /* · · · */
2
Specifying patterns
A scanner must recognize the units of syntax
Other parts are much harder:
identifiers
alphabetic followed by k alphanumerics ( , $, &, . . . )
numbers
integers: 0 or digit from 1-9 followed by digits from 0-9
decimals: integer ’.’ digits from 0-9
reals: (integer or decimal) ’E’ (+ or -) digits from 0-9
complex: ’(’ real ’,’ real ’)’
We need a powerful notation to specify these patterns
3
Operations on languages
Operation Definition
union of L and M L ∪ M = {s | s ∈ L or s ∈ M}
written L ∪ M
concatenation of L and M LM = {st | s ∈ L and t ∈ M}
written LM
Kleene closure of L L∗ = ∞ i
S
i=0 L
written L∗
+ S∞
positive closure of L L = i=1 Li
written L+
4
Regular expressions
Patterns are often specified as regular languages
Notations used to describe a regular language (or a regular set) include
both regular expressions and regular grammars
Regular expressions (over an alphabet Σ):
1. ε is a RE denoting the set {ε}
2. if a ∈ Σ, then a is a RE denoting {a}
3. if r and s are REs, denoting L(r) and L(s), then:
(r) is a RE denoting L(r)
(r) | (s) is a RE denoting L(r) L(s)
S
(r)(s) is a RE denoting L(r)L(s)
(r)∗ is a RE denoting L(r)∗
If we adopt a precedence for operators, the extra parentheses can go
away. We assume closure, then concatenation, then alternation as the
order of precedence.
5
Examples
identifier
letter → (a | b | c | ... | z | A | B | C | ... | Z)
digit → (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)
id → letter ( letter | digit )∗
numbers
integer → (+ | − | ε) (0 | (1 | 2 | 3 | ... | 9) digit ∗)
decimal → integer . ( digit )∗
real → ( integer | decimal ) E (+ | −) digit ∗
complex → ’(’ real , real ’)’
Numbers can get much more complicated
Most programming language tokens can be described with REs
We can use REs to build scanners automatically
6
Algebraic properties of REs
Axiom Description
r|s = s|r | is commutative
r|(s|t) = (r|s)|t | is associative
(rs)t = r(st) concatenation is associative
r(s|t) = rs|rt concatenation distributes over |
(s|t)r = sr|tr
εr = r ε is the identity for concatenation
rε = r
r∗ = (r|ε)∗ relation between ∗ and ε
r∗∗ = r∗ ∗ is idempotent
7
Examples
Let Σ = {a, b}
1. a|b denotes {a, b}
2. (a|b)(a|b) denotes {aa, ab, ba, bb}
i.e., (a|b)(a|b) = aa|ab|ba|bb
3. a∗ denotes {ε, a, aa, aaa, . . .}
4. (a|b)∗ denotes the set of all strings of a’s and b’s (including ε)
i.e., (a|b)∗ = (a∗b∗)∗
5. a|a∗b denotes {a, b, ab, aab, aaab, aaaab, . . .}
8
Recognizers
From a regular expression we can construct a
deterministic finite automaton (DFA)
Recognizer for identifier :
letter
digit
letter other
0 1 2
digit accept
other
3
error
identifier
letter → (a | b | c | ... | z | A | B | C | ... | Z)
digit → (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)
id → letter ( letter | digit )∗
9
Code for the recognizer
char ← next char();
state ← 0; /* code for state 0 */
done ← false;
token value ← "" /* empty string */
while( not done ) {
class ← char class[char];
state ← next state[class,state];
switch(state) {
case 1: /* building an id */
token value ← token value + char;
char ← next char();
break;
case 2: /* accept state */
token type = identifier;
done = true;
break;
case 3: /* error */
token type = error;
done = true;
break;
}
}
return token type;
10
Tables for the recognizer
Two tables control the recognizer
a−z A−Z 0−9 other
char class:
value letter letter digit other
class 0 1 2 3
letter 1 1 — —
next state:
digit 3 1 — —
other 3 2 — —
To change languages, we can just change tables
11
Automatic construction
Scanner generators automatically construct code from RE-like
descriptions
• construct a DFA
• use state minimization techniques
• emit code for the scanner
(table driven or direct code )
A key issue in automation is an interface to the parser
lex is a scanner generator supplied with UNIX
• emits C code for scanner
• provides macro definitions for each token
(used in the parser)
12
Grammars for regular languages
Can we place a restriction on the form of a grammar to ensure that it
describes a regular language?
Provable fact:
For any RE r, ∃ a grammar g such that L(r) = L(g)
Grammars that generate regular sets are called regular grammars:
They have productions in one of 2 forms:
1. A → aA
2. A → a
where A is any non-terminal and a is any terminal symbol
These are also called type 3 grammars (Chomsky)
13
More regular languages
Example: the set of strings containing an even number of zeros and an
even number of ones
1
s0 s1
1
0 0 0 0
1
s2 s3
1
The RE is (00 | 11)∗((01 | 10)(00 | 11)∗(01 | 10)(00 | 11)∗)∗
14
More regular expressions
What about the RE (a | b)∗abb ?
a jb
a b b
s0 s1 s2 s3
State s0 has multiple transitions on a!
⇒ nondeterministic finite automaton
a b
s0 {s0, s1} {s0}
s1 – {s2}
s2 – {s3}
15
Finite automata
A non-deterministic finite automaton (NFA) consists of:
1. a set of states S = {s0, . . . , sn}
2. a set of input symbols Σ (the alphabet)
3. a transition function move mapping state-symbol pairs to sets of
states
4. a distinguished start state s0
5. a set of distinguished accepting or final states F
A Deterministic Finite Automaton (DFA) is a special case of an NFA:
1. no state has a ε-transition, and
2. for each state s and input symbol a, there is at most one edge labelled
a leaving s
A DFA accepts x iff. ∃ a unique path through the transition graph from s0 to
a final state such that the edges spell x.
16
DFAs and NFAs are equivalent
1. DFAs are clearly a subset of NFAs
2. Any NFA can be converted into a DFA, by simulating sets of
simultaneous states:
• each DFA state corresponds to a set of NFA states
• possible exponential blowup
17
NFA to DFA using the subset construction: example 1
ajb
a b b
s0 s1 s2 s3
a b
{s0} {s0, s1} {s0}
{s0, s1} {s0, s1} {s0, s2}
{s0, s2} {s0, s1} {s0, s3}
{s0, s3} {s0, s1} {s0}
b
b
a
a b b
fs0 g fs0 ; s1 g fs0 ; s2 g fs0 ; s3 g
a
a
18
Constructing a DFA from a regular expression
DFA
minimized
RE DFA
NFA
ε moves
RE →NFA w/ε moves
build NFA for each term
connect them with ε moves
NFA w/ε moves to DFA
construct the simulation
the “subset” construction
DFA → minimized DFA
merge compatible states
DFA → RE
construct Rkij = Rk−1
ik (Rk−1 ∗ k−1 S k−1
kk ) Rk j Ri j
19
RE to NFA
ε
N(ε)
a
N(a)
N(A) A
ε ε
N(A|B)
ε ε
N(B) B
N(AB) N(A) A N(B) B
N(A∗) ε
ε
ε
N(A) A
20
RE to NFA: example
N(A) A
ε ε
a|b
ε ε
N(B) B
a
2 3
ε ε
(a|b)∗ 0
ε
1 6
ε
7
ε ε
4 5
b
a b b
abb 7 8 9 10
21
NFA to DFA: the subset construction
Input: NFA N
Output: A DFA D with states Dstates and transitions Dtrans such that L(D) = L(N)
Method: Let s be a state in N and T be a set of states, and using the following operations:
Operation Definition
ε-closure(s) set of NFA states reachable from NFA state s on ε-transitions
alone
ε-closure(T ) set of NFA states reachable from some NFA state s in T on
ε-transitions alone
move(T, a) set of NFA states to which there is a transition on input symbol
a from some NFA state s in T
add state T = ε-closure(s0) unmarked to Dstates
while ∃ unmarked state T in Dstates
mark T
for each input symbol a
U = ε-closure(move(T, a))
if U 6∈ Dstates then add U to Dstates unmarked
Dtrans[T, a] = U
endfor
endwhile
ε-closure(s0) is the start state of D
A state of D is final if it contains at least one final state in N
22
NFA to DFA using subset construction: example 2
ε
a
2 3
ε ε
ε ε a b b
0 1 6 7 8 9 10
ε ε
4 5
b
a b
A B C
A = {0, 1, 2, 4, 7} D = {1, 2, 4, 5, 6, 7, 9}
B B D
B = {1, 2, 3, 4, 6, 7, 8} E = {1, 2, 4, 5, 6, 7, 10}
C B C
C = {1, 2, 4, 5, 6, 7}
D B E
E B C
b
b
b
a
a
a b b
A B D E
a
a
23
Limits of regular languages
Not all languages are regular
One cannot construct DFAs to recognize these languages:
• L = {pk qk }
• L = {wcwr | w ∈ Σ∗}
Note: neither of these is a regular expression!
(DFAs cannot count!)
But, this is a little subtle. One can construct DFAs for:
• alternating 0’s and 1’s
(ε | 1)(01)∗(ε | 0)
• sets of pairs of 0’s and 1’s
(01 | 10)+
24
Ramification - Internet Protocol
How does your browser establish a connection with a web server?
• The client sends a SYN message to the server.
• In response, the server replies with a SYN-ACK.
• Finally the client sends an ACK back to the server.
This is done through two DFAs in the client and server, respectively.
25
Ramification - Intrusion Detection
Code Operating System
FILE * f;
f=fopen("demo", "r"); ------> SYS_OPEN
strcpy(...); //vulnerability
if (!f)
printf("Fail to open\n"); ------> SYS_WRITE
else
fgets(f, buf); ------> SYS_READ
...
A DFA will be exercised simultaneously with the program on the OS side
to detect intrusion.
26