1
COMPILER CONSTRUCTION(CS-636)
Gul Sher Ali
MS(CS & Tech.)-China
______________________________________________________
GIMS- PMAS Arid Agriculture University, Gujrat Campus
B-
Formal Languages & Automata2
Theory (a review in one slide)
Alphabet: a finite set of symbols
String: a finite, possibly empty sequence of
symbols from an alphabet
Language: a set, often infinite, of strings
Finite specifications of (possibly infinite)
languages
Automaton – a recognizer; a machine that
accepts all strings in a language (and rejects all
other strings)
Grammar – a generator; a system for producing
all strings in the language (and no other strings)
A particular language may be specified by
many different grammars and automata
A grammar or automaton specifies only one
language
Lexical Analyzer
Reads the input characters of the source program and
produces a sequence of tokens for each lexeme.
Source Program Lexical Tokens Parser
Analyzer (Syntax analyzer)
(Scanner)
Get next Token
Symbol
Table
3
Lexical Analyzer
Token : A pair consisting of token name & attribute
value.
E.g. :
<id , a>
<keyword, int >
< keyword , if >
< number , 123>
4
Lexical Analyzer
Pattern: Describes the form that a lexeme can take.
Id [R.E]
Letter {letter + digit )*
Letter = a-z
Digit = 0-9
Lexeme: A sequence of characters that matches the pattern of a
token.
Literal: Any thing surrounded by double quotes “”.
5
Lexical Analyzer
Task of Lexical analyzer:
1. Deletion of comments, white spaces, blanks , tabs, etc.
2. Produces the sequence of tokens.
3. Keeps the tracks of lines in case of errors.
6
Tokens 7
Example:
if( i == j )
z = 0;
else
z = 1;
Tokens 8
Input is just a sequence of characters:
i f ( \b i \b = = \b j \n \t ....
Tokens 9
Goal:
partition input string into
substrings
classify them according to
their role
Tokens 10
A token is a syntactic
category
Natural language:
“He wrote the program”
Words: “He”, “wrote”,
“the”, “program”
Tokens 11
Programming language:
“if(b == 0) a = b”
Words:
“if”, “(”, “b”, “==”, “0”,
“)”, “a”, “=”, “b”
Tokens 12
Identifiers:x y11 maxsize
Keywords: if else while
for
Integers: 2 1000 -44 5L
Floats: 2.0 0.0034 1e5
Symbols: ( ) + * / { } < > ==
Strings: “enter x” “error”
Scanning 13
When a token is found:
It is passed to the next phase of compiler.
Sometimes values associated with the token,
called attributes, need to be calculated.
Some tokens, together with their attributes,
must be stored in the symbol/literal table.
it is necessary to check if the token is already in the
table
Examples of attributes
Attributes of a variable are name, address,
type, etc.
An attribute of a numeric constant is its value.
How to construct a 14
scanner
Define tokens in the source language.
Describe the patterns allowed for
tokens.
Write regular expressions describing
the patterns.
Construct an FA for each pattern.
Combine all FA’s which results in an
NFA.
Convert NFA into DFA
Write a program simulating the DFA.
Ad-hoc Lexer 15
Hand-write code to
generate tokens.
Partition the input string
by reading left-to-right,
recognizing one token at
a time
Ad-hoc Lexer 16
Look-ahead required to
decide where one token
ends and the next token
begins.
Ad-hoc Lexer 17
Problems:
Do not know what kind of
token we are going to
read from seeing first
character.
Ad-hoc Lexer 18
Problems:
If token begins with “i”, is
it an identifier “i” or
keyword “if”?
If token begins with “=”, is
it “=” or “==”?
Ad-hoc Lexer 19
Need a more principled
approach
Use lexer generator that
generates efficient
tokenizer automatically.
How to Describe Tokens? 20
Regular
Languages are the
most popular for specifying
tokens
Simple and useful theory
Easy to understand
Efficient implementations
Languages 21
Let S be a set of
characters. S is called
the alphabet.
A language over S is set
of strings of characters
drawn from S.
Example of Languages 22
Alphabet = English characters
Language = English sentences
Alphabet = ASCII
Language = C++ programs,
Java, C#
Notation 23
Languages are sets of
strings (finite sequence of
characters)
Need some notation for
specifying which sets we
want
Notation 24
Forlexical analysis we
care about regular
languages.
Regular languages can
be described using
regular expressions.
Regular Languages 25
Each regular expression is
a notation for a regular
language (a set of words).
If A is a regular expression,
we write L(A) to refer to
language denoted by A.
Regular Expression 26
A regular expression (RE)
is defined inductively
a ordinary character
from S
e the empty string
Regular Expression 27
R|S = either R or S
RS = R followed by S
(concatenation)
R* = concatenation of R
zero or more times
(R*= e |R|RR|RRR...)
RE Extentions 28
R? = e | R (zero or one R)
R + = RR* (one or more R)
(R) = R (grouping)
RE Extentions 29
[abc] = a|b|c (any of
listed)
[a-z] = a|b|....|z (range)
[^ab] = c|d|... (anything
but
‘a’‘b’)
Regular Expression 30
RE Strings in L(R)
a “a”
ab “ab”
a|b “a” “b”
(ab)* “” “ab” “abab” ...
(a|e)b “ab” “b”
Example: integers 31
integer: a non-empty
string
of digits
digit =
‘0’|’1’|’2’|’3’|’4’|
’5’|’6’|’7’|’8’|’9’
integer = digit digit*
Example: identifiers 32
identifier:
string or letters or digits
starting with a letter
C identifier:
[a-zA-Z_][a-zA-Z0-9_]*
Recap 33
Tokens:
strings of characters
representing lexical units
of programs such as
identifiers, numbers,
operators.
Recap 34
Regular Expressions:
concise description of
tokens. A regular
expression describes a
set of strings.
Recap 35
Language L(R):
set of strings represented
by a regular expression
R. L(R) is the language
denoted by regular
expression R.
How to Use REs 36
We need mechanism to
determine if an input
string w belongs to L(R),
the language denoted
by regular expression R.
Acceptor 37
Sucha mechanism is called
an acceptor.
input w
string yes, if w e L
acceptor
no, if w e L
language L
Finite Automata (FA) 38
Specification:
Regular Expressions
Implementation:
Finite Automata
Finite Automata
Finite Automaton consists of
An input alphabet (S)
A set of states
A start (initial) state
A set of transitions
A set of accepting (final) states
39
Finite Automaton 40
State Graphs
A state
The start state
An accepting
state
Finite Automaton 41
State Graphs
a
A transition
Finite Automata 42
A finite automaton
accepts a string if we can
follow transitions labelled
with characters in the
string from start state to
some accepting state.
FA Example 43
A FA that accepts only “1”
1
FA Example 44
A FA that accepts any number of 1’s followed by a single 0
1
0
FA Example 45
A FA that accepts ab*a
Alphabet: {a,b}
b
a a
Table Encoding of FA
Transition b
table a a
0 1 2
a b
0 1 err
1 2 1
2 err err
46
RE → Finite Automata 47
Can we build a finite
automaton for every
regular expression?
Yes, – build FA inductively
based on the definition of
Regular Expression
NFA 48
Nondeterministic Finite
Automaton (NFA)
Can have multiple
transitions for one input
in a given state
Can have e - moves
Epsilon Moves 49
ε – moves
machine can move from state A to state B without consuming
input
e
A B
NFA 50
operation of the automaton is
not completely defined by input
1
0 1
A B C
On input “11”, automaton could be
in either state
Execution of FA 51
A NFA can choose
Whether to make e-
moves.
Which of multiple
transitions to take for a
single input.
Acceptance of NFA 52
NFA can get into multiple states
Rule: NFA accepts if it can get in a final state
1
0 1
A B C
0
DFA and NFA 53
Deterministic Finite
Automata (DFA)
One transition per input
per state.
No e - moves
Execution of FA 54
A DFA
can take only one path
through the state graph.
Completely determined by
input.
NFA vs DFA 55
NFAs and DFAs recognize
the same set of
languages (regular
languages)
DFAs are easier to
implement – table driven.
NFA vs DFA 56
Fora given language,
the NFA can be simpler
than the DFA.
DFA can be
exponentially larger than
NFA.
NFA vs DFA 57
NFAsare the key to
automating RE → DFA
construction.
RE → NFA Construction 58
Thompson’s construction
(CACM 1968)
Build an NFA for each RE
term.
Combine NFAs with
e-moves.
RE → NFA Construction 59
Subset construction
NFA → DFA
Build the simulation.
Minimize number of
states in DFA (Hopcroft’s
algorithm)
RE → NFA Construction 60
Key idea:
NFA pattern for each
symbol and each
operator.
Join them with e-moves in
precedence order.
RE → NFA Construction 61
a
s0 s1
NFA for a
s0 a
s1 e s3 b s4
NFA for ab
RE → NFA Construction 62
a
NFA for a s0 s1
RE → NFA Construction 63
a
NFA for a s0 s1
b
NFA for b s3 s4
RE → NFA Construction 64
a
NFA for a s0 s1
b
NFA for b s3 s4
s0 a b
s1 s3 s4
RE → NFA Construction 65
a
NFA for a s0 s1
b
NFA for b s3 s4
s0 a
s1 e s3 b s4
NFA for ab
RE → NFA Construction 66
a
e s1 s2 e
s0 s5
e b
s3 s4 e
NFA for a | b
RE → NFA Construction 67
s1 a
s2
NFA for a
RE → NFA Construction 68
s1 a
s2
s3 b s4
NFA for a and b
RE → NFA Construction 69
a
e s1 s2 e
s0 s5
e b
s3 s4 e
NFA for a | b
RE → NFA Construction 70
s0 e s1 a
s2 e s4
e
NFA for a*
RE → NFA Construction 71
s1 a
s2
NFA for a
RE → NFA Construction 72
s0 e s1 a
s2 e s4
e
NFA for a*
Example RE → NFA 73
NFA for a ( b|c )* e
b
e e e s4 s5 e
a e
s0 s1 s2 s3 s8 s9
e s6 c
s7 e
e
Example RE → NFA 74
building NFA for a ( b|c )*
a
s0 s1
Example RE → NFA 75
NFA for a, b and c
b
s4 s5
a
s0 s1
s6 c
s7
Example RE → NFA 76
NFA for a and b|c
b
e s4 s5 e
a
s0 s1 s3 s8
e s6 c
s7 e
Example RE → NFA 77
NFA for a and ( b|c )*
e
b
e e s4 s5 e
a e
s0 s1 s2 s3 s8 s9
e s6 c
s7 e
e
Example RE → NFA 78
NFA for a ( b|c )* e
b
e e e s4 s5 e
a e
s0 s1 s2 s3 s8 s9
e s6 c
s7 e