[go: up one dir, main page]

0% found this document useful (0 votes)
11 views174 pages

3 1 Parsing Syntax Analysis

The document discusses syntax analysis and parsing in compiler design, detailing the role of parsers in verifying the syntactic structure of programming languages through context-free grammars. It covers top-down and bottom-up parsing techniques, including derivations, parse trees, and the challenges associated with each method, such as ambiguity and left recursion. Additionally, it introduces predictive parsing and shift-reduce parsing, emphasizing the importance of grammar productions and parsing tables.

Uploaded by

12drstrange
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views174 pages

3 1 Parsing Syntax Analysis

The document discusses syntax analysis and parsing in compiler design, detailing the role of parsers in verifying the syntactic structure of programming languages through context-free grammars. It covers top-down and bottom-up parsing techniques, including derivations, parse trees, and the challenges associated with each method, such as ambiguity and left recursion. Additionally, it introduces predictive parsing and shift-reduce parsing, emphasizing the importance of grammar productions and parsing tables.

Uploaded by

12drstrange
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 174

Compiler

(CS3104)
Syntax Analysis,
Parsing

Taken from: https://cse.iitkgp.ac.in/~bivasm/compiler2024.html


Lex – example-1
Input file – input_first

if + 78 else 0

Tokens: if, else, op (+,-), number, other


Parsing
► Every programming language has precise grammar rules that describe the
syntactic structure of well-formed programs
► In C, the rules states a program consists of functions, a function
consist of declarations and statements, a statement consists of
expressions, and so on.

► The task of a parser is to


(a) Obtain strings of tokens from the lexical analyzer and verify that the string
follows the rules of the source language
(b) Parser reports errors and sometimes recovers from it

• Type checking,
semantic analysis and
translation actions can
be interlinked with
parsing

• Implemented as a
single module.
Parsing
► Two major classes of parsing
► top-down and bottom-up
► Input to the parser is scanned from left to right, one symbol at a
time.

► The syntax of programming language constructs can be specified


by context-free grammars
► Grammars systematically describe the syntax of programming
language constructs like expressions and statements.

► Quick recall
Context free grammar
► A CFG is denoted as G = (N, T, P, S)
N : Finite set of non-terminals -- syntactic variables (stmt, expr)
T : Finite set of terminals ---- Tokens, basic symbols from which strings and
programs are formed
S : The start symbol -- set of strings it generates is the language generated
by the grammar
P : Finite set of productions -- specify the manner in which the terminals and
nonterminals can be combined to form strings

Productions

Start symbol:
head body
Task of a parser
Output of the parser is some representation of the parse tree for the stream of
tokens as input, that comes from the lexical analyzer.

• Top-down parser works for LL grammar


• Bottom-up parser works for LR grammars
• Only subclasses of grammars
• But expressive enough to describe most of the syntactic constructs
of modern programming languages.

Concentrate on parsing expressions


• Constructs that begin with keywords like while or int are relatively easy to
parse
• because the keyword guides the parsing decisions
• We therefore concentrate on expressions, which present more of challenge,
because of the associativity and precedence of operators
Derivations
The construction of a parse tree can be conceptualized as derivations

Derivation: Beginning with the start symbol, each rewriting step replaces a
nonterminal by the body of one of its productions.

A sentence of G is a sentential form with no nonterminals.


The language L(G) generated by a grammar G is its set of sentences.
Derivations
The construction of a parse tree can be conceptualized as derivations

Beginning with the start symbol, each rewriting step replaces a nonterminal by
the body of one of its productions.

Consider a grammar G

Derivation

1. Derivation of –(id+id) from start symbol E


2. –(id+id) is a sentence of G
3. At each step in a derivation, there are two choices to be made.
• Which nonterminal to replace? : leftmost derivations
• Accordingly we must choose a production
Derivations-- Rightmost derivations

Consider a grammar G

1. Derivation of –(id+id) from E


2. –(id+id) is a sentence of G
3. At each step in a derivation, there are two choices to be made.
• Which nonterminal to replace?
• Accordingly we must pick a production → Rightmost derivations,
Parse trees
► A parse tree is a graphical representation of a derivation that
exhibits
► the order in which productions are applied to replace
non-terminals

► The internal node is a non-terminal A in the head of the


production
► The children of the node are labelled, from left to right,
by the symbols in the body of the production by which
A was replaced during the derivation

► Same parse tree for leftmost and rightmost derivations


Sentential form
(leaves of a
parse tree)

parse tree for - ( id + id)


Ambiguity
► A grammar that produces more than one parse tree for some sentence is
said to be ambiguous
► An ambiguous grammar is one that produces more than one leftmost
derivation or more than one rightmost derivation for the same sentence.

Two distinct leftmost derivations for the sentence id + id * id


Ambiguity
Ambiguity
Unambiguous grammar
Top-Down Parsing
• Top-down parsing can be viewed as the problem of
• Constructing a parse tree for the input string,
• starting from the root and creating the nodes of the parse tree in
preorder
• Top-down parsing can be viewed as finding a leftmost derivation for an input
string
Top-Down Parsing

parse tree for - ( id + id)

Derivation

parse tree for - (+ id) ???


Top-Down Parsing

Left recursive

Non-Left recursive
Eliminating left recursion.

Generalization

Immediate left recursion


Eliminating left recursion.
Top-Down Parsing

Eliminating left recursion.


Eliminating left recursion.

Unfolding all the left recursions


Top-Down Parsing
Challenges:
At each step of a top-down parse, the key problem is that of determining
the production to be applied for a nonterminal, say A.

(a) Recursive descent parsing: May require backtracking to find the correct
A-production to be applied
(b) Predictive parsing: No backtracking!
looking ahead at the input a fixed number of symbols (next symbols) – LL(k),
LL(1) grammars
Choose the correct
production
Recursive-Descent Parsing
Nondeterministic

Try other productions!

(a) A recursive-descent parsing consists of a set of procedures, one for each


nonterminal.
(b) Execution begins with the procedure for the start symbol S,
(c) Halts and announces success if S() returns and its procedure body scans the
entire input string.
(d) Backtracking: may require repeated scans over the input
The leftmost leaf, labeled c, matches the first
symbol of input w (i.e. c), so we advance the
input pointer to a

• We have a match for the second input symbol, a,


• So we advance the input pointer to d, the third
input symbol
• Compare d against the next leaf, labeled b
Failure !! Backtrack!
we must reset the input pointer to position a

• The leaf a matches the second input symbol of w (i.e. a)


and the leaf d matches the third input symbol d

• Since S() returns and we have scanned w and produced a


parse tree for w,
• We halt and announce successful completion of parsing
Top-Down Parsing
Challenges:
At each step of a top-down parse, the key problem is that of determining
the production to be applied for a nonterminal, say A.

(a) Recursive descent parsing: May require backtracking to find the correct
A-production to be applied
(b) Predictive parsing: No backtracking!
looking ahead at the input a fixed number of symbols (next symbols) –
LL(k), LL(1) grammars
Basic concept of Predictive parsing

Input string w=abcd


Grammar productions
One sentential form 1. X-> bA…
S=> aXY….
First symbol
2. X->cP ……

Grammar productions
Another sentential form
1. X-> €
S=> aXb 2. X-> ……

We know that b Follows X in any


sentential form
First(A)
How to compute First(X)
Basic concept of Predictive parsing

Input string w=abcd


Grammar productions
One sentential form 1. X-> bA…
S=> aXY….
First symbol
2. X->cP ……

Grammar productions
Another sentential form
1. X-> €
S=> aXb 2. X-> ……

We know that b Follows X in any


sentential form
Follow(A)
How to compute Follow(A)
S-> xAyz y in Follow(A)

S-> xAy Follow(A)=y


->xαBy Follow(B)=Follow(A)
Follow(F)=Follow(T)
Predictive parsing
Challenges:
At each step of a top-down parse, the key problem is that of determining
the production to be applied for a nonterminal, say A.

(a) Recursive descent parsing: May require backtracking to find the correct
A-production to be applied
(b) Predictive parsing: No backtracking!
looking ahead at the input a fixed number of symbols (next symbols) –
LL(k), LL(1) grammars
Predictive parsing

Parsing table M
LL(1) grammar => avoid confusion!!

First(α) and First(β) Disjoint sets


Basic concept of Predictive parsing

Input string w=abcd


One sentential form Grammar productions
S=> aXY…. 1. X-> bA…

First symbol
2. X-> bY……

Grammar productions
Another sentential form
1. X-> €
S=> aXb 2. X-> ……
3. X->bY….
We know that b Follows X in any
sentential form …. Follow(X)=b
Left Factoring
Left Factoring
Parsing table M

Obvious

Input string w=bacd


One sentential form Grammar productions
S=> bAY…. 1. A-> aX…

First symbol
2. A-> ……
Input string w=abcd
One sentential form Grammar productions
1. A-> α=>€
S=> aAb 2. A-> ……
We know that b Follows A in any
sentential form …. Follow(A)=b
First(FT’)={(,id}

First(*FT’)={*}

Follow(T’)={+,),$}

First((E))={(} First(id)={id}
Example of Non-LL(1) grammar

• For every LL(1) grammar, each parsing-table entry uniquely identifies a


production or signals an error.
• left-recursive or ambiguous grammars are not LL(1)

Input string
ibtibtaea
if b
then
if b
then
a
else
a
Example of Non-LL(1) grammar
Predictive Parsing
• Non-recursive version
• maintaining a stack explicitly, rather than implicitly via
recursive calls

Initial configuration
Recursive-Descent Parsing
Nondeterministic

Try other productions!

(a) A recursive-descent parsing consists of a set of procedures, one for each


nonterminal.
(b) Execution begins with the procedure for the start symbol S,
(c) Halts and announces success if S() returns and its procedure body scans the
entire input string.
(d) Backtracking: may require repeated scans over the input
Predictive Parsing
• Non-recursive version
• maintaining a stack explicitly, rather than implicitly via
recursive calls

Initial configuration
Predictive Parsing

Initial configuration

• The parser considers (i) the symbol


on top of the stack X, and (ii) the
current input symbol a.
• If X is a nonterminal, the parser
chooses an X-production from M[X, a] of
the parsing table.
• Otherwise, it checks for a match
between the terminal X and current
input symbol a.
a

Y1
Leftmost derivation
Predictive Parsing
The stack contains a sequence of grammar symbols

w
Bottom Up Parsing
• A bottom-up parse corresponds to the construction of a parse tree for an
input string
• Beginning at the leaves (the bottom) and working up towards the
root (the top)

Input
Choose the correct
production
Bottom Up Parsing Sentential forms

Derivation --- Rightmost derivation

Bottom-up parsing is therefore to construct a rightmost derivation in


reverse
Reduction
• A specific substring of input matching the body of a production
• Replaced by the nonterminal at the head of that production.

Production
Bcdxy=>Axy
A-> Bcd

• Bottom-up parsing as the process of "reducing" a string w to


the start symbol of the grammar

Challenges
(a) when to reduce and
(b) what production to apply, as the parse proceeds.
Reduction
E*id =>???

Reduction steps

Challenges
(a) when to reduce and
(b) what production to apply, as the parse proceeds.
Handle
• “Handle" is a substring of input that matches the body of a production
• Allows reduction => Towards start symbol=>reverse of a rightmost derivation

Production
Bcdxy=>Axy A-> Bcd
Right sentential forms
Terminals

handle

Identifying the handle is a challenge


Shift Reduce parsing

Bottom-up parsing in which


(a) Stack holds grammar symbols and
(b) Input buffer holds the rest of the string to be parsed.
(c) handle always appears at the top of the stack
Initial config. Final config.
Shift Reduce parsing
Handle always appears at the top of the
stack
A-> βBy
B->ɣ
Handle always appears at the top of the
stack
A-> y
B->ɣ
Conflict
Shift/reduce conflict: Cannot decide whether to shift or to reduce

Reduce/reduce conflict: Cannot decide which of several reductions to


make
Shift/reduce conflict
Shift Reduce parsing
LR Parsing
Challenges in shift-reduce parsing
(a) when to reduce and
(b) what production to apply, as the parse proceeds.

Examples:
Simple LR, LR(1), LALR

• LR parser makes shift-reduce decisions by LR(0) automaton


and maintaining states

• State represent sets of items


Items

Intuitively, an item indicates how much of a production body we


have seen at a given point in the parsing process.
Indicates that we hope to see a string derivable from
XYZ on the next input
Indicates that we have just seen on the input a string
derivable from X and that we hope next to see a string
derivable from YZ
Indicates that we have seen the body XYZ on input
string and that it may be time to reduce XYZ to A
Canonical LR(0) collection
• Sets of items => One state
• Collection of sets of items=> canonical LR(0) collection => Collection
of states

LR(0) automaton: Construct a deterministic finite automaton that is


used to make parsing decisions

To construct the canonical LR(0) collection for a grammar G,


we define (a) augmented grammar and (b) two functions, CLOSURE and
GOTO

Augmented grammar: If G is a grammar with start symbol S, then the


augmented grammar G'
Closure of Item Sets
Similar to I
Closure of Item Sets
Closure of Item Sets

Augmentation
Closure (I)

Can be easily derived from Kernel items


Closure of Item Sets
GOTO of Item Sets
• The second useful function is GOTO(I, X) where I is a set of items and X
is a grammar symbol.
• Defines the transitions in the LR(0) automaton

Assume that

GOTO(I,X) specifies the


I2=GOTO(I1, X) transition from the state for
X I under input X
I1 I2
I3
X I2=GOTO(I3, X)
GOTO of Item Sets
• The second useful function is GOTO(I, X) where I is a set of items and X
is a grammar symbol.
• Defines the transitions in the LR(0) automaton

Assume that

I5 X I8 I2=GOTO(I1, X)
GOTO(I,X) specifies the
transition from the state for
X I under input X
I1 I2
I3
X I2=GOTO(I3, X)
GOTO of Item Sets

I1 set
Canonical LR(0) collection
LR(0) automaton: Construct a deterministic finite automaton that is
used to make parsing decisions

• Sets of items => One state


• Collection of sets of items=> canonical LR(0) collection => Collection
of states
To construct the canonical LR(0) collection for a grammar G,
we define (a) augmented grammar and (b) two functions, CLOSURE and
GOTO

Augmented grammar: If G is a grammar with start symbol S, then the


augmented grammar G'
Canonical collection of sets of items

LR(0) automaton

I1
I1
C I2 C
LR(0) automaton

(a) The states of this


automaton are the
sets of items from the
canonical LR(0)
collection,

(b) the transitions are


given by the GOTO
function

We say "state j" to refer to


the state corresponding to the
set of items Ij.
Symbol
representation : X
LR-Parsing Algorithm

S0

Parsing table
The stack holds a sequence of states
Where a shift-reduce parser shifts a symbol, an LR parser shifts a state
Top of the stack state (s_m) represents the state of the parser
Role of LR(0) automata in shift-reduce decisions

Key Idea Input: w=yaα


Consider we are in state j (maybe after scanning y symbols)
Next input symbol a
• If state j has a transition on a.
• Shift (to state k) on next input symbol a
• Otherwise, we choose to reduce;
• The items in state j will tell us which production to use

a
j k
• All transitions to state k must be for the same grammar symbol a. Thus,
each state has a unique grammar symbol associated with it (except the
start state 0)
• Multiple states may have same grammar symbol
Key Idea States

Reduction
With symbols,
Reduction is implemented by popping the body of the production (the
body is id) from the stack and pushing the head of the production (in
this case, F).
With states, (a) we pop state 5, which brings state 0 to the top and
(b) look for a transition on F, the head of the production.
(c) we push state 3
Shift Reduce parsing

Bottom-up parsing in which


(a) Stack holds grammar symbols and
(b) Input buffer holds the rest of the string to be parsed.
(c) handle always appears at the top of the stack
Initial config. Final config.
Shift Reduce parsing
LR(0) automaton
Pop and push

i A j
SLR Parsing
table
LR-parsing algorithm.

S0
Optional
SLR Parsing
table
LR(0) automaton
Constructing SLR-Parsing Tables

• LR parser using an SLR-parsing table as an SLR parser


• Same for LR(1), LALR parser

• Step 1: Given a grammar, G, we augment G to produce G', with


a new start symbol S‘
• Step 2: Construct LR(0) items and LR(0) automata
• We construct canonical collection of sets of items for
G' together with the GOTO function.
• Step 3: Construct the parsing table
• Determine the ACTION and GOTO entries
SLR-Parsing Table: Algorithm
Input string

Ii Ij
a
A->αa.β

Stack: …αa looking for an handle


Key Idea States
SLR-Parsing Table: Algorithm
Input string

α
Ii

Stack: …α… *May* detected a handle!!


If this is a sentential form.
S=>..Aa…=>αa
Input string

α
Ii

Stack: …α.. *May* detected a handle!!


• If this is a sentential form.
S=>..Aa…=>αa • a follows A
Input string

α
Ii

Stack: …αa.. *May* detected a handle!!


• If this is a sentential form.
S=>..Aa…=>αa • a follows A
• a in Follow(A)!
SLR-Parsing Table: Algorithm
Input string

S
Ii

Done!!
SLR-Parsing Table: Algorithm
SLR Parsing
table
SLR-Parsing Table: Algorithm
SLR-Parsing Table: Example
LR(0) automaton
SLR-Parsing Table: Example
Non-SLR: Example
Grammar

Conflicting action!!
Non-SLR: Where is the problem?
id=*id
Right sentential derivation

S-> L=R -> L=L -> L=*R -> L=*L -> L=*id -> id=*id
SLR parsing
Stack: $ Input string: id=*id$

Stack: $ id Input string: =*id$


Stack: $ L Input string: =*id$ (Reduction with R->L??)
Stack: $ L= Input string: *id$
Stack: $ L=*id Input string: $
Stack: $ S Input string: $
Non-SLR: Where is the problem?
id=*id
Right sentential derivation

S-> L=R -> L=L -> L=*R -> L=*L -> L=*id -> id=*id
SLR parsing
Stack: $ Input string: id=*id$

Stack: $ id Input string: =*id$


Stack: $ L Input string: =*id$ (Reduction with R->L??)
Stack: $ R Input string: =*id$ Incorrect!
Stack: $ L=*id Input string: $
Stack: $ S Input string: $
Viable Prefixes
• The LR(0) automaton characterizes the strings of grammar symbols
that can appear on the stack of a shift-reduce parser for the
grammar.

• The stack contents must be a prefix of a right-sentential form.

• If the stack holds α and the rest of the input is x, then a sequence of
reductions will take αx to S.

Not all prefixes of right-sentential forms can appear on the stack

The prefixes of right sentential forms that can appear on the stack of
a shift reduce parser are called viable prefixes.
Handle always appears at the top of the
stack prefix

A-> βBy
B->ɣ
Viable prefix E+T
LR(0) automaton
Viable Prefixes
• the set of valid items for a viable prefix γ is
• Set of items reached from the initial state S along the
path labeled γ in the LR(0) automaton

SLR parsing is based on the fact that LR(0) automata recognize


viable prefixes and valid items.

Viable prefix
Shift

Reduction
SLR says…

βα

stack
IO βα
α Invalid right sentential form
Ii
β A→α.
.....
… S≠> …βAa..=> βαa ......
SLR says…

βα
Avoid
reduction!
stack
IO βα
α Invalid right sentential form
Ii
β A→α
… .....
S≠> …βAa..=> βαa Invalid item ......
Non-SLR: Where is the problem?
id=*id
Right sentential derivation

S-> L=R -> L=L -> L=*R -> L=*L -> L=*id -> id=*id
SLR parsing
Stack: $ Input string: id=*id$

Stack: $ id Input string: =*id$


Stack: $ L Input string: =*id$ (Reduction with R->L??)
Stack: $ L= Input string: *id$
Stack: $ L=*id Input string: $
Stack: $ S Input string: $
Non-SLR: Where is the problem?
id=*id
Right sentential derivation

S-> L=R -> L=L -> L=*R -> L=*L -> L=*id -> id=*id
SLR parsing
Stack: $ Input string: id=*id$

Stack: $ id Input string: =*id$


Stack: $ L Input string: =*id$ (Reduction with R->L??)
Stack: $ R Input string: =*id$ Incorrect!
Stack: $ L=*id Input string: $
Stack: $ S Input string: $
Non-SLR: Example
Grammar

Conflicting action!!
Non-SLR: Where is the problem?

Since ……… *id=id


It is possible to carry extra information in the state that will
allow us to rule out some of these invalid reductions
LR(1) Parser, CLR

Since

It is possible to carry extra information in the state that will


allow us to rule out some of these invalid reductions
• Splitting states
• Each state of an LR parser indicates exactly which input symbols
can follow a handle α for which there is a possible reduction to A
• This extra information is incorporated into the state by redefining
items to include a terminal symbol as a second component.
LR(1) Parser

Look-ahead a is implicit for SLR


LR(1) Sets of Items

by
Closure of Item Sets – LR(1)
Closure of Item Sets
LR(1) automation -- GOTO
LR(1) automation

I1
I1
C I2 C
LR(1) automation

No redundant states
LR(1) Parsing table

b is not important
Input string

Ii Ij
a
A->αa.β

Stack: …αa expecting an handle


LR(1) Parsing table
acc Input: dd

Stack Input
$0 dd$
$0 4 d$
$0 2 d$ C->d
$0 2 7 $
$0 2 5 $ C->d
$0 1 $ S->CC
LALR
• Considerably smaller than the canonical LR tables
• Most common syntactic constructs of programming languages
can be expressed conveniently by an LALR grammar

Same core
items, different
lookahead
• Sets of LR(1) items having the same core, that is, set of first
components,
• Merge these sets with common cores into one set of LALR
items.
Merge

,$
,$
,$
LALR -- GOTO

C
I3 , I6 I8 , I9

• Since the core of GOTO(I,X) depends only on the core,


• Goto's of merged sets can themselves be merged.
• Thus, there is no problem revising the GOTO function as we
merge sets of items.
LR(1) automation -- GOTO
LALR Parsing table
LALR conflicts
LALR item
Shift reduce conflict on a
• Shares same core in LR(1)!!
• Same conflict for LR(1)!

LR(1) items LALR item!

Reduce-reduce conflict on d, e!
No Reduce-reduce conflict on d, e
Efficient Construction of LALR Parsing Tables
We must attach the proper lookaheads to the LR(0) items in the
kernels, to create the kernels of the sets of LALR(l) items.
LR(1) Sets of Items

by
We must attach the proper lookaheads to the LR(0) items in the
kernels, to create the kernels of the sets of LALR(l) items.
LR(1) automation -- GOTO
• We are now ready to attach lookaheads to the kernels of the sets of
LR(0) items to form the sets of LALR(l) items.

• First, we know that $ is a lookahead for S'-> .S in the initial set of


LR(0) items.

• Algorithm gives us all the lookaheads generated spontaneously.

• After listing all those lookaheads, we must allow them to propagate


until no further propagation is possible.

• Keep track of "new“ lookaheads that have propagated into an item


but which have not yet propagated out.
Using Ambiguous Grammars Unambiguous grammar

• This grammar is ambiguous because it does not specify the associativity or


precedence of the operators + and *.

• The unambiguous grammar gives + lower precedence than *, and makes both
operators left associative.

• we might prefer to use the ambiguous grammar


• the parser for the unambiguous grammar will spend a substantial
fraction of its time reducing by the productions E -> T and T -> F,
• whose sole function is to enforce associativity and precedence.

• The parser for the ambiguous grammar will not waste time reducing by these
single productions (productions whose body consists of a single nonterminal).
Conflicts

Follow(E)={+,*}
Conflict resolution
However, these problems can be resolved using the precedence and
associativity information for + and *.
Consider the input id + id * id, which causes a parser to enter state 7 after
processing id + id;

In particular the parser reaches a configuration

If * takes precedence over +, the parser should shift * onto the stack

Thus the relative precedence of + followed by * uniquely determines how


the parsing action conflict between reducing E -> E + E and shifting on * in
state 7 should be resolved.
Conflict resolution
Problems can be resolved using the associativity information for +.
Consider the input id + id + id, which causes a parser to enter state 7 after
processing id + id;

In particular the parser reaches a configuration


Conflict resolution
However, these problems can be resolved using the precedence and associativity
information for + and *.
Consider the input id * id + id, which causes a parser to enter state 8 after
processing id * id;

In particular the parser reaches a configuration


Conflicts

Follow(E)={+,*}
Conflict resolution
The answer is that we should shift else, because it is "associated"
with the previous then.

We conclude that the shift/reduce conflict should be


resolved in favor of shift on input else

You might also like