Chapter 4
Syntax Analysis
Role of parser
Context free grammars
Top down parsing
Bottom up parsing
Parser generators
Lexical token Rest of
Source Parse tree Intermediate
program Analyze Parser Front representation
r getNext End
Token
Symbol
table
A parsing a process of deriving string from the given
grammar. (CFG)
It check the proper grammar of the token generated by lexical
analyzer.
Generate the parse tree (Derivation etc.)
A parser implements a C-F grammar as a recognizer of strings
The role of the parser in a compiler is two fold:
1. To check syntax (= string recognizer)
And to report syntax errors accurately
2. To invoke semantic actions
For static semantics checking, e.g. type checking of
expressions, functions, etc.
For syntax-directed translation of the source code to an
intermediate representation
4
E -> E + T | T s->T+S|T2
T -> T * F | F T->id*T|id|(s)
F -> (E) | id w= id*id*id
Common programming errors
◦ Lexical errors
◦ Syntactic errors
◦ Semantic errors
Error handler goals
◦ Report the presence of errors clearly and
accurately
◦ Recover from each error quickly enough to detect
subsequent errors
◦ Add minimal overhead to the processing of correct
programs
Panic mode recovery
◦ Discard input symbol one at a time until one of
designated set of synchronization tokens is found
Phrase level recovery
◦ Replacing a prefix of remaining input by some
string that allows the parser to continue
Error productions
◦ Augment the grammar with productions that
generate the erroneous constructs
Global correction
◦ Choosing minimal sequence of changes to obtain
a globally least-cost correction
Terminals
Nontermina
expression -> expression + term
ls expression -> expression – term
Start expression -> term
symbol term -> term * factor
productions term -> term / factor
term -> factor
factor -> (expression)
factor -> id
Productions are treated as rewriting rules to
generate a string
Rightmost and leftmost derivations
◦ E -> E + E | E * E | -E | (E) | id
◦ Derivations for –(id+id)
E => -E => -(E) => -(E+E) => -(id+E)=>-(id+id)
-(id+id)
E => -E => -(E) => -(E+E) => -(id+E)=>-(id+id)
For some strings there exist more than one
parse tree
Or more than one leftmost derivation
Or more than one rightmost derivation
Example: id+id*id
Idea:
◦ A statement appearing between a then and an
else must be matched
A grammar is left recursive if it has a non-
terminal A such that there is a derivation
+
A=> Aα
Top down parsing methods cant handle
left-recursive grammars
A simple rule for direct left recursion
elimination:
◦ For a rule like:
A -> A α|β
There are cases like following
◦ S -> Aa | b
◦ A -> Ac | Sd | ɛ
Left recursion elimination algorithm:
◦ Arrange the nonterminals in some order A1,A2,
…,An.
◦ For (each i from 1 to n) {
For (each j from 1 to i-1) {
Replace each production of the form Ai-> Aj γ by the
production Ai -> δ1 γ | δ2 γ | … |δk γ where Aj-> δ1 |
δ2 | … |δk are all current Aj productions
}
Eliminate left recursion among the Ai-productions
}
Left factoring is a grammar transformation that is
useful for producing a grammar suitable for
predictive or top-down parsing.
Consider following grammar:
◦ Stmt -> if expr then stmt else stmt
◦ | if expr then stmt
On seeing input if it is not clear for the parser
which production to use
We can easily perform left factoring:
◦ If we have A->αβ1 | αβ2 then we replace it with
A -> αA’
A’ -> β1 | β2
A Top-down parser tries to create a parse tree
from the root towards the leafs scanning input
from left to right
It can be also viewed as finding a leftmost
derivation for an input string
Example: id+id*id
E E E E
E -> TE’ lm lm
E
lm
E
lm lm
E’ -> +TE’ | Ɛ T E’ T E’ T E’ T E’ T E’
T -> FT’
T’ -> *FT’ | Ɛ F T’ F T’ F T’ F T’ + T E’
F -> (E) | id id id Ɛ id Ɛ
Consists of a set of procedures, one for each
nonterminal
Execution begins with the procedure for start
symbol
A typical procedure for a non-terminal
void A() {
choose an A-production, A->X1X2..Xk
for (i=1 to k) {
if (Xi is a nonterminal
call procedure Xi();
else if (Xi equals the current input symbol a)
advance the input to the next symbol;
else /* an error has occurred */
}
}
General recursive descent may require
backtracking
The previous code needs to be modified to
allow backtracking
In general form it cant choose an A-
production easily.
So we need to try all alternatives
If one failed the input pointer needs to be
reset and another alternative should be tried
Recursive descent parsers cant be used for
left-recursive grammars
S->cAd
A->ab | a Input: cad
S S S
c A d c A d c A d
a b a
First() is set of terminals that begins strings
derived from
If α=>ɛ then is also in First(ɛ)
*
In predictive parsing when we have A-> α|β, if
First(α) and First(β) are disjoint sets then we can
select appropriate A-production by looking at the
next input
Follow(A), for any nonterminal A, is set of
terminals a that can appear immediately after A
in some sentential form
◦ If we have *S => αAaβ for some αand βthen a is in
Follow(A)
If A can be the rightmost symbol in some
sentential form, then $ is in Follow(A)
To compute First(X) for all grammar symbols
X, apply following rules until no more
*
terminals or ɛ can be added to any First set:
1. If X is a terminal then First(X) = {X}.
2. If X is a non terminal and X->Y1Y2…Yk is a
production for some k>=1, then place a in
First(X) if for some i a is in First(Yi) and ɛ is in all
of First(Y1),…,First(Yi-1) that is Y1…Yi-1 => ɛ. if
ɛ is in First(Yj)
* for j=1,…,k then add ɛ to First(X).
3. If X-> ɛ is a production then add ɛ to First(X)
Example!
To compute First(A) for all nonterminals A,
apply following rules until nothing can be
added to any follow set:
1. Place $ in Follow(S) where S is the start symbol
2. If there is a production A-> αBβ then everything
in First(β) except ɛ is in Follow(B).
3. If there is a production A->B or a production
A->αBβ where First(β) contains ɛ, then
everything in Follow(A) is in Follow(B)
Example!
Predictive parsers are those recursive descent
parsers needing no backtracking
Grammars for which we can create predictive
parsers are called LL(1)
◦ The first L means scanning input from left to right
◦ The second L means leftmost derivation
◦ And 1 stands for using one input symbol for lookahead
A grammar G is LL(1) if and only if whenever A-> α|
βare two distinct productions of G, the following
conditions hold:
◦ For no terminal a do αandβ both derive strings beginning
with a
◦ At most one of α or βcan derive empty string
◦ If α=> ɛ then βdoes not derive any string beginning with
a terminal
* in Follow(A).
For each production A->α in grammar do
the following:
◦ For each terminal a in First(α) add A-> in M[A,a]
◦ If ɛ is in First(α), then for each terminal b in
Follow(A) add A-> ɛ to M[A,b]. If ɛ is in First(α)
and $ is in Follow(A), add A-> ɛ to M[A,$] as
well
If after performing the above, there is no
production in M[A,a] then set M[A,a] to error
First Follow
E -> TE’ F {(,id} {+, *, ), $}
T {(,id} {+, ), $}
E’ -> +TE’ | Ɛ {(,id} {), $}
T -> FT’ E
E’ {+,ɛ} {), $}
T’ -> *FT’ | Ɛ {+, ), $}
F -> (E) | id T’ {*,ɛ}
Input Symbol
Non -
terminal id + * ( ) $
E E -> TE’ E -> TE’
E’ E’ -> +TE’ E’ -> Ɛ E’ -> Ɛ
T T -> FT’ T -> FT’
T’ T’ -> Ɛ T’ -> *FT’ T’ -> Ɛ T’ -> Ɛ
F F -> id F -> (E)
f(s’)= {e, ℇ)
S -> iEtSS’ | a f(s)={i, a) fo(s)={e,$}
S’ -> eS | Ɛ fo(s’)= {e,
$}
E -> b f(E)= {b} fo(E)= { t}
Input Symbol
Non -
terminal a b e i t $
S S -> a S -> iEtSS’
S’ S’ -> Ɛ S’ -> Ɛ
S’ -> eS
E E -> b
S-> aSbS|bSaS|ℇ
F(S)= {a,b,ℇ} Fo(S)={b,a,$}
a b $
S bSaS S-> aSbS S->ℇ
S->ℇ S->ℇ
in order to reach the start symbol.
The image given below depicts the bottom-up parsers available.
Shift-Reduce Parsing
Shift-reduce parsing uses two unique steps for bottom-up parsing.
These steps are known as shift-step and reduce-step.
Shift step: The shift step refers to the advancement of the input
pointer to the next input symbol, which is called the shifted
symbol. This symbol is pushed onto the stack. The shifted symbol
is treated as a single node of the parse tree.
Reduce step : When the parser finds a complete grammar rule
(RHS) and replaces it to (LHS), it is known as reduce-step. This
occurs when the top of the stack contains a handle. To reduce, a
POP function is performed on the stack which pops off the handle
and replaces it with LHS non-terminal symbol.
LR Parser
The LR parser is a non-recursive, shift-reduce, bottom-up parser.
It uses a wide class of context-free grammar which makes it the
most efficient syntax analysis technique. LR parsers are also
known as LR(k) parsers, where L stands for left-to-right scanning
of the input stream; R stands for the construction of right-most
derivation in reverse, and k denotes the number of lookahead
symbols to make decisions.
There are three widely used algorithms available for constructing
an LR parser:
SLR(1) – Simple LR Parser:
Works on smallest class of grammar
Few number of states, hence very small table
Simple and fast construction
LR(1) – LR Parser:
Works on complete set of LR(1) Grammar
Generates large table and large number of states
Slow construction
LALR(1) – Look-Ahead LR Parser:
Works on intermediate size of grammar
Number of states are same as in SLR(1)
LR (1) Parser: -
LR (1) parser is an LR(k) parser for
k=1, i.e.; with a single look, ahead terminal. The
special attribute of the parser is that all LR(k) parser
with k=1 can be transformed in LR (1).
It can be handled by all context free language. An
LR(1) parser base reduce decision only on the set of
terminals which can actually follow non-terminals
being reduced.
The LR (1) parser is most powerful. LR parser table
can be extremely large.
SLR Parser: -
A simple LR parser is a type of LR parser with small
parser table.
An SLR parser generator create LR (0) state machine
and compute look ahead from the grammar (first &
follow set).
This is simplified approach and many report conflict
that do not really exist in LR (0) state machine.
LALR Parser: -
An LALR parser is simplified version of CLR parser,
according to a set of productions rules specified by a
formal language for a computer language.
An LALR parser generators creates an LR (0) state
machine and compute the look ahead from LR (0) state
machine.
An LALR parser have same number of states as SLR
parser.
An LALR parser would use the set of all terminals that can
follow the particular production R, their no conflict.
Constructs parse tree for an input string
beginning at the leaves (the bottom) and
working towards the root (the top)
Example: id*id
E -> E + T | T id*id F * id T * id T*F F E
T -> T * F | F
T*F F
F -> (E) | id id F F id
id id F id T*F
id F id
id
The tasks of the Error Handling process
are to detect each error, report it to the
user, and then make some recover strategy
and implement them to handle error. During
this whole process processing time of
program should not be slow. An Error is the
blank entries in the symbol table.
There are two types of error: run-time and compile-
time error:
A run-time error: is an error which takes place
during the execution of a program, and usually
happens because of adverse system parameters
or invalid input data. The lack of sufficient
memory to run an application or a memory
conflict with another program and logical error
are example of this. Logic errors, occur when
executed code does not produce the expected
result. Logic errors are best handled by
meticulous program debugging.
Compile-time errors: rises at compile
time, before execution of the program.
Syntax error or missing file reference that
prevents the program from successfully
compiling is the example of this.
Lexical : This includes misspellings of
identifiers, keywords or operators
Syntactical : missing semicolon or
unbalanced parenthesis
Semantical : incompatible value
assignment or type mismatches between
operator and operand
Logical : code not reachable, infinite loop.
Viable-prefix is the property of a parser which
allows early detection of syntax errors.
Goal: detection of an error as soon as possible
without further consuming unnecessary input
How: detect an error as soon as the prefix of the
input does not match a prefix of any string in the
language.
Example: for(;), this will report an error as for
have two semicolons inside braces.
basic requirement for the compiler is to simply stop
and issue a message, and cease compilation. There
are some common recovery methods that are
follows.
Panic mode recovery: This is the easiest way of
error-recovery and also, it prevents the parser from
developing infinite loops while recovering error. The
parser discards the input symbol one at a time until one
of the designated (like end, semicolon) set of
synchronizing tokens (are typically the statement or
expression terminators) is found. This is adequate when
the presence of multiple errors in same statement is
rare. Example: Consider the erroneous expression- (1 +
+ 2) + 3. Panic-mode recovery: Skip ahead to next
integer and then continue. Bison: use the special
terminal error to describe how much input to skip.
E->int|E+E|(E)|error int|(error)
Phase level recovery: Perform local
correction on the input to repair the error.
But error correction is difficult in this
strategy.
Error productions: Some common errors
are known to the compiler designers that
may occur in the code. Augmented
grammars can also be used, as productions
that generate erroneous constructs when
these errors are encountered. Example:
write 5x instead of 5*x
Global correction: Its aim is to make as
few changes as possible while converting an
incorrect input string to a valid string. This
strategy is costly to implement.
In this phase of compilation, all possible
errors made by the user are detected and
reported to the user in form of error
messages. This process of locating errors
and reporting it to user is called Error
Handling process.
Functions of Error handler
Detection
Reporting
Recovery
Classification of Errors
These errors are detected during the lexical
analysis phase. Typical lexical errors are
Exceeding length of identifier or numeric
constants.
Appearance of illegal characters
Unmatched string
Example 1 : printf("Geeksforgeeks");$
This is a lexical error since an illegal character $
appears at the end of statement.
Example 2 : This is a comment */
This is an lexical error since end of comment is
present but beginning is not present.
Panic Mode Recovery
In this method, successive characters from
the input are removed one at a time until a
designated set of synchronizing tokens is
found. Synchronizing tokens are delimiters
such as; or }
Advantage is that it is easy to implement
and guarantees not to go to infinite loop
Disadvantage is that a considerable amount
of input is skipped without checking it for
additional errors
These errors are detected during syntax analysis phase.
Typical syntax errors are
Errors in structure
Missing operator
Misspelled keywords
Unbalanced parenthesis
Example : swicth(ch)
{
.......
.......
}
The keyword switch is incorrectly written as
swicth. Hence, “Unidentified
keyword/identifier” error occurs.
Panic Mode Recovery
◦ In this method, successive characters from
input are removed one at a time until a
designated set of synchronizing tokens is found.
Synchronizing tokens are deli-meters such as ;
or }
◦ Advantage is that its easy to implement and
guarantees not to go to infinite loop
◦ Disadvantage is that a considerable amount of
input is skipped without checking it for
additional errors
Statement Mode recovery
◦ In this method, when a parser encounters an
error, it performs necessary correction on
remaining input so that the rest of input
statement allow the parser to parse ahead.
◦ The correction can be deletion of extra
semicolons, replacing comma by semicolon or
inserting missing semicolon.
◦ While performing correction, atmost care should
be taken for not going in infinite loop.
◦ Disadvantage is that it finds difficult to handle
situations where actual error occured before
point of detection.
Error production
◦ If user has knowledge of common errors that
can be encountered then, these errors can be
incorporated by augmenting the grammar with
error productions that generate erroneous
constructs.
◦ If this is used then, during parsing appropriate
error messages can be generated and parsing
can be continued.
◦ Disadvantage is that its difficult to maintain.
Global Correction
◦ The parser examines the whole program and
tries to find out the closest match for it which is
error free.
◦ The closest match program has less number of
insertions, deletions and changes of tokens to
recover from erroneous input.
◦ Due to high time and space complexity, this
method is not implemented practically.
These errors are detected during semantic analysis phase.
Typical semantic errors are
Incompatible type of operands
Undeclared variables
Not matching of actual arguments with formal one
Example : int a[10], b;
.......
.......
a = b;
It generates a semantic error because of an
incompatible type of a and b.
If error “Undeclared Identifier” is encountered then, to
recover from this a symbol table entry for corresponding
identifier is made.
If data types of two operands are incompatible then,
automatic type conversion is done by the compiler.