Compiler
(CS3104)
Syntax Analysis,
Parsing
Taken from: https://cse.iitkgp.ac.in/~bivasm/compiler2024.html
Lex – example-1
Input file – input_first
if + 78 else 0
Tokens: if, else, op (+,-), number, other
Parsing
► Every programming language has precise grammar rules that describe the
syntactic structure of well-formed programs
► In C, the rules states a program consists of functions, a function
consist of declarations and statements, a statement consists of
expressions, and so on.
► The task of a parser is to
(a) Obtain strings of tokens from the lexical analyzer and verify that the string
follows the rules of the source language
(b) Parser reports errors and sometimes recovers from it
• Type checking,
semantic analysis and
translation actions can
be interlinked with
parsing
• Implemented as a
single module.
Parsing
► Two major classes of parsing
► top-down and bottom-up
► Input to the parser is scanned from left to right, one symbol at a
time.
► The syntax of programming language constructs can be specified
by context-free grammars
► Grammars systematically describe the syntax of programming
language constructs like expressions and statements.
► Quick recall
Context free grammar
► A CFG is denoted as G = (N, T, P, S)
N : Finite set of non-terminals -- syntactic variables (stmt, expr)
T : Finite set of terminals ---- Tokens, basic symbols from which strings and
programs are formed
S : The start symbol -- set of strings it generates is the language generated
by the grammar
P : Finite set of productions -- specify the manner in which the terminals and
nonterminals can be combined to form strings
Productions
Start symbol:
head body
Task of a parser
Output of the parser is some representation of the parse tree for the stream of
tokens as input, that comes from the lexical analyzer.
• Top-down parser works for LL grammar
• Bottom-up parser works for LR grammars
• Only subclasses of grammars
• But expressive enough to describe most of the syntactic constructs
of modern programming languages.
Concentrate on parsing expressions
• Constructs that begin with keywords like while or int are relatively easy to
parse
• because the keyword guides the parsing decisions
• We therefore concentrate on expressions, which present more of challenge,
because of the associativity and precedence of operators
Derivations
The construction of a parse tree can be conceptualized as derivations
Derivation: Beginning with the start symbol, each rewriting step replaces a
nonterminal by the body of one of its productions.
A sentence of G is a sentential form with no nonterminals.
The language L(G) generated by a grammar G is its set of sentences.
Derivations
The construction of a parse tree can be conceptualized as derivations
Beginning with the start symbol, each rewriting step replaces a nonterminal by
the body of one of its productions.
Consider a grammar G
Derivation
1. Derivation of –(id+id) from start symbol E
2. –(id+id) is a sentence of G
3. At each step in a derivation, there are two choices to be made.
• Which nonterminal to replace? : leftmost derivations
• Accordingly we must choose a production
Derivations-- Rightmost derivations
Consider a grammar G
1. Derivation of –(id+id) from E
2. –(id+id) is a sentence of G
3. At each step in a derivation, there are two choices to be made.
• Which nonterminal to replace?
• Accordingly we must pick a production → Rightmost derivations,
Parse trees
► A parse tree is a graphical representation of a derivation that
exhibits
► the order in which productions are applied to replace
non-terminals
► The internal node is a non-terminal A in the head of the
production
► The children of the node are labelled, from left to right,
by the symbols in the body of the production by which
A was replaced during the derivation
► Same parse tree for leftmost and rightmost derivations
Sentential form
(leaves of a
parse tree)
parse tree for - ( id + id)
Ambiguity
► A grammar that produces more than one parse tree for some sentence is
said to be ambiguous
► An ambiguous grammar is one that produces more than one leftmost
derivation or more than one rightmost derivation for the same sentence.
Two distinct leftmost derivations for the sentence id + id * id
Ambiguity
Ambiguity
Unambiguous grammar
Top-Down Parsing
• Top-down parsing can be viewed as the problem of
• Constructing a parse tree for the input string,
• starting from the root and creating the nodes of the parse tree in
preorder
• Top-down parsing can be viewed as finding a leftmost derivation for an input
string
Top-Down Parsing
parse tree for - ( id + id)
Derivation
parse tree for - (+ id) ???
Top-Down Parsing
Left recursive
Non-Left recursive
Eliminating left recursion.
Generalization
Immediate left recursion
Eliminating left recursion.
Top-Down Parsing
Eliminating left recursion.
Eliminating left recursion.
Unfolding all the left recursions
Top-Down Parsing
Challenges:
At each step of a top-down parse, the key problem is that of determining
the production to be applied for a nonterminal, say A.
(a) Recursive descent parsing: May require backtracking to find the correct
A-production to be applied
(b) Predictive parsing: No backtracking!
looking ahead at the input a fixed number of symbols (next symbols) – LL(k),
LL(1) grammars
Choose the correct
production
Recursive-Descent Parsing
Nondeterministic
Try other productions!
(a) A recursive-descent parsing consists of a set of procedures, one for each
nonterminal.
(b) Execution begins with the procedure for the start symbol S,
(c) Halts and announces success if S() returns and its procedure body scans the
entire input string.
(d) Backtracking: may require repeated scans over the input
The leftmost leaf, labeled c, matches the first
symbol of input w (i.e. c), so we advance the
input pointer to a
• We have a match for the second input symbol, a,
• So we advance the input pointer to d, the third
input symbol
• Compare d against the next leaf, labeled b
Failure !! Backtrack!
we must reset the input pointer to position a
• The leaf a matches the second input symbol of w (i.e. a)
and the leaf d matches the third input symbol d
• Since S() returns and we have scanned w and produced a
parse tree for w,
• We halt and announce successful completion of parsing
Top-Down Parsing
Challenges:
At each step of a top-down parse, the key problem is that of determining
the production to be applied for a nonterminal, say A.
(a) Recursive descent parsing: May require backtracking to find the correct
A-production to be applied
(b) Predictive parsing: No backtracking!
looking ahead at the input a fixed number of symbols (next symbols) –
LL(k), LL(1) grammars
Basic concept of Predictive parsing
Input string w=abcd
Grammar productions
One sentential form 1. X-> bA…
S=> aXY….
First symbol
2. X->cP ……
Grammar productions
Another sentential form
1. X-> €
S=> aXb 2. X-> ……
We know that b Follows X in any
sentential form
First(A)
How to compute First(X)
Basic concept of Predictive parsing
Input string w=abcd
Grammar productions
One sentential form 1. X-> bA…
S=> aXY….
First symbol
2. X->cP ……
Grammar productions
Another sentential form
1. X-> €
S=> aXb 2. X-> ……
We know that b Follows X in any
sentential form
Follow(A)
How to compute Follow(A)
S-> xAyz y in Follow(A)
S-> xAy Follow(A)=y
->xαBy Follow(B)=Follow(A)
Follow(F)=Follow(T)
Predictive parsing
Challenges:
At each step of a top-down parse, the key problem is that of determining
the production to be applied for a nonterminal, say A.
(a) Recursive descent parsing: May require backtracking to find the correct
A-production to be applied
(b) Predictive parsing: No backtracking!
looking ahead at the input a fixed number of symbols (next symbols) –
LL(k), LL(1) grammars
Predictive parsing
Parsing table M
LL(1) grammar => avoid confusion!!
First(α) and First(β) Disjoint sets
Basic concept of Predictive parsing
Input string w=abcd
One sentential form Grammar productions
S=> aXY…. 1. X-> bA…
First symbol
2. X-> bY……
Grammar productions
Another sentential form
1. X-> €
S=> aXb 2. X-> ……
3. X->bY….
We know that b Follows X in any
sentential form …. Follow(X)=b
Left Factoring
Left Factoring
Parsing table M
Obvious
Input string w=bacd
One sentential form Grammar productions
S=> bAY…. 1. A-> aX…
First symbol
2. A-> ……
Input string w=abcd
One sentential form Grammar productions
1. A-> α=>€
S=> aAb 2. A-> ……
We know that b Follows A in any
sentential form …. Follow(A)=b
First(FT’)={(,id}
First(*FT’)={*}
Follow(T’)={+,),$}
First((E))={(} First(id)={id}
Example of Non-LL(1) grammar
• For every LL(1) grammar, each parsing-table entry uniquely identifies a
production or signals an error.
• left-recursive or ambiguous grammars are not LL(1)
Input string
ibtibtaea
if b
then
if b
then
a
else
a
Example of Non-LL(1) grammar
Predictive Parsing
• Non-recursive version
• maintaining a stack explicitly, rather than implicitly via
recursive calls
Initial configuration
Recursive-Descent Parsing
Nondeterministic
Try other productions!
(a) A recursive-descent parsing consists of a set of procedures, one for each
nonterminal.
(b) Execution begins with the procedure for the start symbol S,
(c) Halts and announces success if S() returns and its procedure body scans the
entire input string.
(d) Backtracking: may require repeated scans over the input
Predictive Parsing
• Non-recursive version
• maintaining a stack explicitly, rather than implicitly via
recursive calls
Initial configuration
Predictive Parsing
Initial configuration
• The parser considers (i) the symbol
on top of the stack X, and (ii) the
current input symbol a.
• If X is a nonterminal, the parser
chooses an X-production from M[X, a] of
the parsing table.
• Otherwise, it checks for a match
between the terminal X and current
input symbol a.
a
Y1
Leftmost derivation
Predictive Parsing
The stack contains a sequence of grammar symbols
w
Bottom Up Parsing
• A bottom-up parse corresponds to the construction of a parse tree for an
input string
• Beginning at the leaves (the bottom) and working up towards the
root (the top)
Input
Choose the correct
production
Bottom Up Parsing Sentential forms
Derivation --- Rightmost derivation
Bottom-up parsing is therefore to construct a rightmost derivation in
reverse
Reduction
• A specific substring of input matching the body of a production
• Replaced by the nonterminal at the head of that production.
Production
Bcdxy=>Axy
A-> Bcd
• Bottom-up parsing as the process of "reducing" a string w to
the start symbol of the grammar
Challenges
(a) when to reduce and
(b) what production to apply, as the parse proceeds.
Reduction
E*id =>???
Reduction steps
Challenges
(a) when to reduce and
(b) what production to apply, as the parse proceeds.
Handle
• “Handle" is a substring of input that matches the body of a production
• Allows reduction => Towards start symbol=>reverse of a rightmost derivation
Production
Bcdxy=>Axy A-> Bcd
Right sentential forms
Terminals
handle
Identifying the handle is a challenge
Shift Reduce parsing
Bottom-up parsing in which
(a) Stack holds grammar symbols and
(b) Input buffer holds the rest of the string to be parsed.
(c) handle always appears at the top of the stack
Initial config. Final config.
Shift Reduce parsing
Handle always appears at the top of the
stack
A-> βBy
B->ɣ
Handle always appears at the top of the
stack
A-> y
B->ɣ
Conflict
Shift/reduce conflict: Cannot decide whether to shift or to reduce
Reduce/reduce conflict: Cannot decide which of several reductions to
make
Shift/reduce conflict
Shift Reduce parsing
LR Parsing
Challenges in shift-reduce parsing
(a) when to reduce and
(b) what production to apply, as the parse proceeds.
Examples:
Simple LR, LR(1), LALR
• LR parser makes shift-reduce decisions by LR(0) automaton
and maintaining states
• State represent sets of items
Items
Intuitively, an item indicates how much of a production body we
have seen at a given point in the parsing process.
Indicates that we hope to see a string derivable from
XYZ on the next input
Indicates that we have just seen on the input a string
derivable from X and that we hope next to see a string
derivable from YZ
Indicates that we have seen the body XYZ on input
string and that it may be time to reduce XYZ to A
Canonical LR(0) collection
• Sets of items => One state
• Collection of sets of items=> canonical LR(0) collection => Collection
of states
LR(0) automaton: Construct a deterministic finite automaton that is
used to make parsing decisions
To construct the canonical LR(0) collection for a grammar G,
we define (a) augmented grammar and (b) two functions, CLOSURE and
GOTO
Augmented grammar: If G is a grammar with start symbol S, then the
augmented grammar G'
Closure of Item Sets
Similar to I
Closure of Item Sets
Closure of Item Sets
Augmentation
Closure (I)
Can be easily derived from Kernel items
Closure of Item Sets
GOTO of Item Sets
• The second useful function is GOTO(I, X) where I is a set of items and X
is a grammar symbol.
• Defines the transitions in the LR(0) automaton
Assume that
GOTO(I,X) specifies the
I2=GOTO(I1, X) transition from the state for
X I under input X
I1 I2
I3
X I2=GOTO(I3, X)
GOTO of Item Sets
• The second useful function is GOTO(I, X) where I is a set of items and X
is a grammar symbol.
• Defines the transitions in the LR(0) automaton
Assume that
I5 X I8 I2=GOTO(I1, X)
GOTO(I,X) specifies the
transition from the state for
X I under input X
I1 I2
I3
X I2=GOTO(I3, X)
GOTO of Item Sets
I1 set
Canonical LR(0) collection
LR(0) automaton: Construct a deterministic finite automaton that is
used to make parsing decisions
• Sets of items => One state
• Collection of sets of items=> canonical LR(0) collection => Collection
of states
To construct the canonical LR(0) collection for a grammar G,
we define (a) augmented grammar and (b) two functions, CLOSURE and
GOTO
Augmented grammar: If G is a grammar with start symbol S, then the
augmented grammar G'
Canonical collection of sets of items
LR(0) automaton
I1
I1
C I2 C
LR(0) automaton
(a) The states of this
automaton are the
sets of items from the
canonical LR(0)
collection,
(b) the transitions are
given by the GOTO
function
We say "state j" to refer to
the state corresponding to the
set of items Ij.
Symbol
representation : X
LR-Parsing Algorithm
S0
Parsing table
The stack holds a sequence of states
Where a shift-reduce parser shifts a symbol, an LR parser shifts a state
Top of the stack state (s_m) represents the state of the parser
Role of LR(0) automata in shift-reduce decisions
Key Idea Input: w=yaα
Consider we are in state j (maybe after scanning y symbols)
Next input symbol a
• If state j has a transition on a.
• Shift (to state k) on next input symbol a
• Otherwise, we choose to reduce;
• The items in state j will tell us which production to use
a
j k
• All transitions to state k must be for the same grammar symbol a. Thus,
each state has a unique grammar symbol associated with it (except the
start state 0)
• Multiple states may have same grammar symbol
Key Idea States
Reduction
With symbols,
Reduction is implemented by popping the body of the production (the
body is id) from the stack and pushing the head of the production (in
this case, F).
With states, (a) we pop state 5, which brings state 0 to the top and
(b) look for a transition on F, the head of the production.
(c) we push state 3
Shift Reduce parsing
Bottom-up parsing in which
(a) Stack holds grammar symbols and
(b) Input buffer holds the rest of the string to be parsed.
(c) handle always appears at the top of the stack
Initial config. Final config.
Shift Reduce parsing
LR(0) automaton
Pop and push
i A j
SLR Parsing
table
LR-parsing algorithm.
S0
Optional
SLR Parsing
table
LR(0) automaton
Constructing SLR-Parsing Tables
• LR parser using an SLR-parsing table as an SLR parser
• Same for LR(1), LALR parser
• Step 1: Given a grammar, G, we augment G to produce G', with
a new start symbol S‘
• Step 2: Construct LR(0) items and LR(0) automata
• We construct canonical collection of sets of items for
G' together with the GOTO function.
• Step 3: Construct the parsing table
• Determine the ACTION and GOTO entries
SLR-Parsing Table: Algorithm
Input string
Ii Ij
a
A->αa.β
Stack: …αa looking for an handle
Key Idea States
SLR-Parsing Table: Algorithm
Input string
α
Ii
Stack: …α… *May* detected a handle!!
If this is a sentential form.
S=>..Aa…=>αa
Input string
α
Ii
Stack: …α.. *May* detected a handle!!
• If this is a sentential form.
S=>..Aa…=>αa • a follows A
Input string
α
Ii
Stack: …αa.. *May* detected a handle!!
• If this is a sentential form.
S=>..Aa…=>αa • a follows A
• a in Follow(A)!
SLR-Parsing Table: Algorithm
Input string
S
Ii
Done!!
SLR-Parsing Table: Algorithm
SLR Parsing
table
SLR-Parsing Table: Algorithm
SLR-Parsing Table: Example
LR(0) automaton
SLR-Parsing Table: Example
Non-SLR: Example
Grammar
Conflicting action!!
Non-SLR: Where is the problem?
id=*id
Right sentential derivation
S-> L=R -> L=L -> L=*R -> L=*L -> L=*id -> id=*id
SLR parsing
Stack: $ Input string: id=*id$
Stack: $ id Input string: =*id$
Stack: $ L Input string: =*id$ (Reduction with R->L??)
Stack: $ L= Input string: *id$
Stack: $ L=*id Input string: $
Stack: $ S Input string: $
Non-SLR: Where is the problem?
id=*id
Right sentential derivation
S-> L=R -> L=L -> L=*R -> L=*L -> L=*id -> id=*id
SLR parsing
Stack: $ Input string: id=*id$
Stack: $ id Input string: =*id$
Stack: $ L Input string: =*id$ (Reduction with R->L??)
Stack: $ R Input string: =*id$ Incorrect!
Stack: $ L=*id Input string: $
Stack: $ S Input string: $
Viable Prefixes
• The LR(0) automaton characterizes the strings of grammar symbols
that can appear on the stack of a shift-reduce parser for the
grammar.
• The stack contents must be a prefix of a right-sentential form.
• If the stack holds α and the rest of the input is x, then a sequence of
reductions will take αx to S.
Not all prefixes of right-sentential forms can appear on the stack
The prefixes of right sentential forms that can appear on the stack of
a shift reduce parser are called viable prefixes.
Handle always appears at the top of the
stack prefix
A-> βBy
B->ɣ
Viable prefix E+T
LR(0) automaton
Viable Prefixes
• the set of valid items for a viable prefix γ is
• Set of items reached from the initial state S along the
path labeled γ in the LR(0) automaton
SLR parsing is based on the fact that LR(0) automata recognize
viable prefixes and valid items.
Viable prefix
Shift
Reduction
SLR says…
βα
stack
IO βα
α Invalid right sentential form
Ii
β A→α.
.....
… S≠> …βAa..=> βαa ......
SLR says…
βα
Avoid
reduction!
stack
IO βα
α Invalid right sentential form
Ii
β A→α
… .....
S≠> …βAa..=> βαa Invalid item ......
Non-SLR: Where is the problem?
id=*id
Right sentential derivation
S-> L=R -> L=L -> L=*R -> L=*L -> L=*id -> id=*id
SLR parsing
Stack: $ Input string: id=*id$
Stack: $ id Input string: =*id$
Stack: $ L Input string: =*id$ (Reduction with R->L??)
Stack: $ L= Input string: *id$
Stack: $ L=*id Input string: $
Stack: $ S Input string: $
Non-SLR: Where is the problem?
id=*id
Right sentential derivation
S-> L=R -> L=L -> L=*R -> L=*L -> L=*id -> id=*id
SLR parsing
Stack: $ Input string: id=*id$
Stack: $ id Input string: =*id$
Stack: $ L Input string: =*id$ (Reduction with R->L??)
Stack: $ R Input string: =*id$ Incorrect!
Stack: $ L=*id Input string: $
Stack: $ S Input string: $
Non-SLR: Example
Grammar
Conflicting action!!
Non-SLR: Where is the problem?
Since ……… *id=id
It is possible to carry extra information in the state that will
allow us to rule out some of these invalid reductions
LR(1) Parser, CLR
Since
It is possible to carry extra information in the state that will
allow us to rule out some of these invalid reductions
• Splitting states
• Each state of an LR parser indicates exactly which input symbols
can follow a handle α for which there is a possible reduction to A
• This extra information is incorporated into the state by redefining
items to include a terminal symbol as a second component.
LR(1) Parser
Look-ahead a is implicit for SLR
LR(1) Sets of Items
by
Closure of Item Sets – LR(1)
Closure of Item Sets
LR(1) automation -- GOTO
LR(1) automation
I1
I1
C I2 C
LR(1) automation
No redundant states
LR(1) Parsing table
b is not important
Input string
Ii Ij
a
A->αa.β
Stack: …αa expecting an handle
LR(1) Parsing table
acc Input: dd
Stack Input
$0 dd$
$0 4 d$
$0 2 d$ C->d
$0 2 7 $
$0 2 5 $ C->d
$0 1 $ S->CC
LALR
• Considerably smaller than the canonical LR tables
• Most common syntactic constructs of programming languages
can be expressed conveniently by an LALR grammar
Same core
items, different
lookahead
• Sets of LR(1) items having the same core, that is, set of first
components,
• Merge these sets with common cores into one set of LALR
items.
Merge
,$
,$
,$
LALR -- GOTO
C
I3 , I6 I8 , I9
• Since the core of GOTO(I,X) depends only on the core,
• Goto's of merged sets can themselves be merged.
• Thus, there is no problem revising the GOTO function as we
merge sets of items.
LR(1) automation -- GOTO
LALR Parsing table
LALR conflicts
LALR item
Shift reduce conflict on a
• Shares same core in LR(1)!!
• Same conflict for LR(1)!
LR(1) items LALR item!
Reduce-reduce conflict on d, e!
No Reduce-reduce conflict on d, e
Efficient Construction of LALR Parsing Tables
We must attach the proper lookaheads to the LR(0) items in the
kernels, to create the kernels of the sets of LALR(l) items.
LR(1) Sets of Items
by
We must attach the proper lookaheads to the LR(0) items in the
kernels, to create the kernels of the sets of LALR(l) items.
LR(1) automation -- GOTO
• We are now ready to attach lookaheads to the kernels of the sets of
LR(0) items to form the sets of LALR(l) items.
• First, we know that $ is a lookahead for S'-> .S in the initial set of
LR(0) items.
• Algorithm gives us all the lookaheads generated spontaneously.
• After listing all those lookaheads, we must allow them to propagate
until no further propagation is possible.
• Keep track of "new“ lookaheads that have propagated into an item
but which have not yet propagated out.
Using Ambiguous Grammars Unambiguous grammar
• This grammar is ambiguous because it does not specify the associativity or
precedence of the operators + and *.
• The unambiguous grammar gives + lower precedence than *, and makes both
operators left associative.
• we might prefer to use the ambiguous grammar
• the parser for the unambiguous grammar will spend a substantial
fraction of its time reducing by the productions E -> T and T -> F,
• whose sole function is to enforce associativity and precedence.
• The parser for the ambiguous grammar will not waste time reducing by these
single productions (productions whose body consists of a single nonterminal).
Conflicts
Follow(E)={+,*}
Conflict resolution
However, these problems can be resolved using the precedence and
associativity information for + and *.
Consider the input id + id * id, which causes a parser to enter state 7 after
processing id + id;
In particular the parser reaches a configuration
If * takes precedence over +, the parser should shift * onto the stack
Thus the relative precedence of + followed by * uniquely determines how
the parsing action conflict between reducing E -> E + E and shifting on * in
state 7 should be resolved.
Conflict resolution
Problems can be resolved using the associativity information for +.
Consider the input id + id + id, which causes a parser to enter state 7 after
processing id + id;
In particular the parser reaches a configuration
Conflict resolution
However, these problems can be resolved using the precedence and associativity
information for + and *.
Consider the input id * id + id, which causes a parser to enter state 8 after
processing id * id;
In particular the parser reaches a configuration
Conflicts
Follow(E)={+,*}
Conflict resolution
The answer is that we should shift else, because it is "associated"
with the previous then.
We conclude that the shift/reduce conflict should be
resolved in favor of shift on input else