[go: up one dir, main page]

0% found this document useful (0 votes)
7 views68 pages

Chapter 4 - Syntax Analysis

Chapter 4 discusses syntax analysis in compilers, focusing on the role of parsers, context-free grammars, and various parsing techniques such as top-down and bottom-up parsing. It covers error handling strategies, including panic mode recovery and left recursion elimination, as well as the construction of predictive parsers and LR parsers. The chapter emphasizes the importance of accurately checking syntax and reporting errors while efficiently processing correct programs.

Uploaded by

Fatima Faisal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views68 pages

Chapter 4 - Syntax Analysis

Chapter 4 discusses syntax analysis in compilers, focusing on the role of parsers, context-free grammars, and various parsing techniques such as top-down and bottom-up parsing. It covers error handling strategies, including panic mode recovery and left recursion elimination, as well as the construction of predictive parsers and LR parsers. The chapter emphasizes the importance of accurately checking syntax and reporting errors while efficiently processing correct programs.

Uploaded by

Fatima Faisal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 68

Chapter 4

Syntax Analysis
 Role of parser
 Context free grammars
 Top down parsing
 Bottom up parsing
 Parser generators
Lexical token Rest of
Source Parse tree Intermediate
program Analyze Parser Front representation
r getNext End
Token

Symbol
table
 A parsing a process of deriving string from the given
grammar. (CFG)
 It check the proper grammar of the token generated by lexical
analyzer.
 Generate the parse tree (Derivation etc.)
 A parser implements a C-F grammar as a recognizer of strings
 The role of the parser in a compiler is two fold:
1. To check syntax (= string recognizer)
 And to report syntax errors accurately
2. To invoke semantic actions
 For static semantics checking, e.g. type checking of
expressions, functions, etc.
 For syntax-directed translation of the source code to an
intermediate representation

4
E -> E + T | T s->T+S|T2
T -> T * F | F T->id*T|id|(s)
F -> (E) | id w= id*id*id
 Common programming errors
◦ Lexical errors
◦ Syntactic errors
◦ Semantic errors

 Error handler goals


◦ Report the presence of errors clearly and
accurately
◦ Recover from each error quickly enough to detect
subsequent errors
◦ Add minimal overhead to the processing of correct
programs
 Panic mode recovery
◦ Discard input symbol one at a time until one of
designated set of synchronization tokens is found
 Phrase level recovery
◦ Replacing a prefix of remaining input by some
string that allows the parser to continue
 Error productions
◦ Augment the grammar with productions that
generate the erroneous constructs
 Global correction
◦ Choosing minimal sequence of changes to obtain
a globally least-cost correction
 Terminals
 Nontermina
expression -> expression + term
ls expression -> expression – term
 Start expression -> term
symbol term -> term * factor
 productions term -> term / factor
term -> factor
factor -> (expression)
factor -> id
 Productions are treated as rewriting rules to
generate a string
 Rightmost and leftmost derivations

◦ E -> E + E | E * E | -E | (E) | id
◦ Derivations for –(id+id)
 E => -E => -(E) => -(E+E) => -(id+E)=>-(id+id)
 -(id+id)
 E => -E => -(E) => -(E+E) => -(id+E)=>-(id+id)
 For some strings there exist more than one
parse tree
 Or more than one leftmost derivation
 Or more than one rightmost derivation
 Example: id+id*id
 Idea:
◦ A statement appearing between a then and an
else must be matched
 A grammar is left recursive if it has a non-
terminal A such that there is a derivation
+
A=> Aα
 Top down parsing methods cant handle
left-recursive grammars
 A simple rule for direct left recursion
elimination:
◦ For a rule like:
 A -> A α|β
 There are cases like following
◦ S -> Aa | b
◦ A -> Ac | Sd | ɛ
 Left recursion elimination algorithm:
◦ Arrange the nonterminals in some order A1,A2,
…,An.
◦ For (each i from 1 to n) {
 For (each j from 1 to i-1) {
 Replace each production of the form Ai-> Aj γ by the
production Ai -> δ1 γ | δ2 γ | … |δk γ where Aj-> δ1 |
δ2 | … |δk are all current Aj productions
 }
 Eliminate left recursion among the Ai-productions
 }
 Left factoring is a grammar transformation that is
useful for producing a grammar suitable for
predictive or top-down parsing.
 Consider following grammar:
◦ Stmt -> if expr then stmt else stmt
◦ | if expr then stmt
 On seeing input if it is not clear for the parser
which production to use
 We can easily perform left factoring:
◦ If we have A->αβ1 | αβ2 then we replace it with
 A -> αA’
 A’ -> β1 | β2
 A Top-down parser tries to create a parse tree
from the root towards the leafs scanning input
from left to right
 It can be also viewed as finding a leftmost
derivation for an input string
 Example: id+id*id

E E E E
E -> TE’ lm lm
E
lm
E
lm lm
E’ -> +TE’ | Ɛ T E’ T E’ T E’ T E’ T E’
T -> FT’
T’ -> *FT’ | Ɛ F T’ F T’ F T’ F T’ + T E’

F -> (E) | id id id Ɛ id Ɛ
 Consists of a set of procedures, one for each
nonterminal
 Execution begins with the procedure for start
symbol
 A typical procedure for a non-terminal
void A() {
choose an A-production, A->X1X2..Xk
for (i=1 to k) {
if (Xi is a nonterminal
call procedure Xi();
else if (Xi equals the current input symbol a)
advance the input to the next symbol;
else /* an error has occurred */
}
}
 General recursive descent may require
backtracking
 The previous code needs to be modified to
allow backtracking
 In general form it cant choose an A-
production easily.
 So we need to try all alternatives
 If one failed the input pointer needs to be
reset and another alternative should be tried
 Recursive descent parsers cant be used for
left-recursive grammars
S->cAd
A->ab | a Input: cad

S S S

c A d c A d c A d

a b a
 First() is set of terminals that begins strings
derived from
 If α=>ɛ then is also in First(ɛ)
*
 In predictive parsing when we have A-> α|β, if
First(α) and First(β) are disjoint sets then we can
select appropriate A-production by looking at the
next input
 Follow(A), for any nonterminal A, is set of
terminals a that can appear immediately after A
in some sentential form
◦ If we have *S => αAaβ for some αand βthen a is in
Follow(A)
 If A can be the rightmost symbol in some
sentential form, then $ is in Follow(A)
 To compute First(X) for all grammar symbols
X, apply following rules until no more
*
terminals or ɛ can be added to any First set:
1. If X is a terminal then First(X) = {X}.
2. If X is a non terminal and X->Y1Y2…Yk is a
production for some k>=1, then place a in
First(X) if for some i a is in First(Yi) and ɛ is in all
of First(Y1),…,First(Yi-1) that is Y1…Yi-1 => ɛ. if
ɛ is in First(Yj)
* for j=1,…,k then add ɛ to First(X).
3. If X-> ɛ is a production then add ɛ to First(X)
 Example!
 To compute First(A) for all nonterminals A,
apply following rules until nothing can be
added to any follow set:
1. Place $ in Follow(S) where S is the start symbol
2. If there is a production A-> αBβ then everything
in First(β) except ɛ is in Follow(B).
3. If there is a production A->B or a production
A->αBβ where First(β) contains ɛ, then
everything in Follow(A) is in Follow(B)
 Example!
 Predictive parsers are those recursive descent
parsers needing no backtracking
 Grammars for which we can create predictive
parsers are called LL(1)
◦ The first L means scanning input from left to right
◦ The second L means leftmost derivation
◦ And 1 stands for using one input symbol for lookahead
 A grammar G is LL(1) if and only if whenever A-> α|
βare two distinct productions of G, the following
conditions hold:
◦ For no terminal a do αandβ both derive strings beginning
with a
◦ At most one of α or βcan derive empty string
◦ If α=> ɛ then βdoes not derive any string beginning with
a terminal
* in Follow(A).
 For each production A->α in grammar do
the following:
◦ For each terminal a in First(α) add A-> in M[A,a]
◦ If ɛ is in First(α), then for each terminal b in
Follow(A) add A-> ɛ to M[A,b]. If ɛ is in First(α)
and $ is in Follow(A), add A-> ɛ to M[A,$] as
well
 If after performing the above, there is no
production in M[A,a] then set M[A,a] to error
First Follow

E -> TE’ F {(,id} {+, *, ), $}


T {(,id} {+, ), $}
E’ -> +TE’ | Ɛ {(,id} {), $}
T -> FT’ E
E’ {+,ɛ} {), $}
T’ -> *FT’ | Ɛ {+, ), $}
F -> (E) | id T’ {*,ɛ}
Input Symbol
Non -
terminal id + * ( ) $
E E -> TE’ E -> TE’

E’ E’ -> +TE’ E’ -> Ɛ E’ -> Ɛ

T T -> FT’ T -> FT’

T’ T’ -> Ɛ T’ -> *FT’ T’ -> Ɛ T’ -> Ɛ

F F -> id F -> (E)


f(s’)= {e, ℇ)
S -> iEtSS’ | a f(s)={i, a) fo(s)={e,$}
S’ -> eS | Ɛ fo(s’)= {e,
$}
E -> b f(E)= {b} fo(E)= { t}
Input Symbol
Non -
terminal a b e i t $
S S -> a S -> iEtSS’

S’ S’ -> Ɛ S’ -> Ɛ
S’ -> eS
E E -> b
 S-> aSbS|bSaS|ℇ

 F(S)= {a,b,ℇ} Fo(S)={b,a,$}

a b $

S bSaS S-> aSbS S->ℇ


S->ℇ S->ℇ
in order to reach the start symbol.
The image given below depicts the bottom-up parsers available.
Shift-Reduce Parsing
Shift-reduce parsing uses two unique steps for bottom-up parsing.
These steps are known as shift-step and reduce-step.
Shift step: The shift step refers to the advancement of the input
pointer to the next input symbol, which is called the shifted
symbol. This symbol is pushed onto the stack. The shifted symbol
is treated as a single node of the parse tree.
Reduce step : When the parser finds a complete grammar rule
(RHS) and replaces it to (LHS), it is known as reduce-step. This
occurs when the top of the stack contains a handle. To reduce, a
POP function is performed on the stack which pops off the handle
and replaces it with LHS non-terminal symbol.
LR Parser
The LR parser is a non-recursive, shift-reduce, bottom-up parser.
It uses a wide class of context-free grammar which makes it the
most efficient syntax analysis technique. LR parsers are also
known as LR(k) parsers, where L stands for left-to-right scanning
of the input stream; R stands for the construction of right-most
derivation in reverse, and k denotes the number of lookahead
symbols to make decisions.
There are three widely used algorithms available for constructing
an LR parser:
SLR(1) – Simple LR Parser:
Works on smallest class of grammar
Few number of states, hence very small table
Simple and fast construction
LR(1) – LR Parser:
Works on complete set of LR(1) Grammar
Generates large table and large number of states
Slow construction
LALR(1) – Look-Ahead LR Parser:
Works on intermediate size of grammar
Number of states are same as in SLR(1)
LR (1) Parser: -
LR (1) parser is an LR(k) parser for
k=1, i.e.; with a single look, ahead terminal. The
special attribute of the parser is that all LR(k) parser
with k=1 can be transformed in LR (1).
It can be handled by all context free language. An
LR(1) parser base reduce decision only on the set of
terminals which can actually follow non-terminals
being reduced.
The LR (1) parser is most powerful. LR parser table
can be extremely large.
SLR Parser: -
A simple LR parser is a type of LR parser with small
parser table.
An SLR parser generator create LR (0) state machine
and compute look ahead from the grammar (first &
follow set).
This is simplified approach and many report conflict
that do not really exist in LR (0) state machine.
LALR Parser: -
An LALR parser is simplified version of CLR parser,
according to a set of productions rules specified by a
formal language for a computer language.
An LALR parser generators creates an LR (0) state
machine and compute the look ahead from LR (0) state
machine.
An LALR parser have same number of states as SLR
parser.
An LALR parser would use the set of all terminals that can
follow the particular production R, their no conflict.
 Constructs parse tree for an input string
beginning at the leaves (the bottom) and
working towards the root (the top)
 Example: id*id

E -> E + T | T id*id F * id T * id T*F F E


T -> T * F | F
T*F F
F -> (E) | id id F F id

id id F id T*F

id F id

id
 The tasks of the Error Handling process
are to detect each error, report it to the
user, and then make some recover strategy
and implement them to handle error. During
this whole process processing time of
program should not be slow. An Error is the
blank entries in the symbol table.
There are two types of error: run-time and compile-
time error:
 A run-time error: is an error which takes place

during the execution of a program, and usually


happens because of adverse system parameters
or invalid input data. The lack of sufficient
memory to run an application or a memory
conflict with another program and logical error
are example of this. Logic errors, occur when
executed code does not produce the expected
result. Logic errors are best handled by
meticulous program debugging.
 Compile-time errors: rises at compile
time, before execution of the program.
Syntax error or missing file reference that
prevents the program from successfully
compiling is the example of this.
 Lexical : This includes misspellings of
identifiers, keywords or operators
 Syntactical : missing semicolon or

unbalanced parenthesis
 Semantical : incompatible value

assignment or type mismatches between


operator and operand
 Logical : code not reachable, infinite loop.
Viable-prefix is the property of a parser which
allows early detection of syntax errors.
 Goal: detection of an error as soon as possible

without further consuming unnecessary input


 How: detect an error as soon as the prefix of the

input does not match a prefix of any string in the


language.
 Example: for(;), this will report an error as for

have two semicolons inside braces.


basic requirement for the compiler is to simply stop
and issue a message, and cease compilation. There
are some common recovery methods that are
follows.
 Panic mode recovery: This is the easiest way of
error-recovery and also, it prevents the parser from
developing infinite loops while recovering error. The
parser discards the input symbol one at a time until one
of the designated (like end, semicolon) set of
synchronizing tokens (are typically the statement or
expression terminators) is found. This is adequate when
the presence of multiple errors in same statement is
rare. Example: Consider the erroneous expression- (1 +
+ 2) + 3. Panic-mode recovery: Skip ahead to next
integer and then continue. Bison: use the special
terminal error to describe how much input to skip.
 E->int|E+E|(E)|error int|(error)
 Phase level recovery: Perform local
correction on the input to repair the error.
But error correction is difficult in this
strategy.
 Error productions: Some common errors

are known to the compiler designers that


may occur in the code. Augmented
grammars can also be used, as productions
that generate erroneous constructs when
these errors are encountered. Example:
write 5x instead of 5*x
 Global correction: Its aim is to make as
few changes as possible while converting an
incorrect input string to a valid string. This
strategy is costly to implement.
 In this phase of compilation, all possible
errors made by the user are detected and
reported to the user in form of error
messages. This process of locating errors
and reporting it to user is called Error
Handling process.
Functions of Error handler
 Detection
 Reporting
 Recovery
 Classification of Errors
These errors are detected during the lexical
analysis phase. Typical lexical errors are
 Exceeding length of identifier or numeric

constants.
 Appearance of illegal characters
 Unmatched string

Example 1 : printf("Geeksforgeeks");$
This is a lexical error since an illegal character $
appears at the end of statement.
Example 2 : This is a comment */
This is an lexical error since end of comment is
present but beginning is not present.
 Panic Mode Recovery
In this method, successive characters from
the input are removed one at a time until a
designated set of synchronizing tokens is
found. Synchronizing tokens are delimiters
such as; or }
 Advantage is that it is easy to implement

and guarantees not to go to infinite loop


 Disadvantage is that a considerable amount

of input is skipped without checking it for


additional errors
These errors are detected during syntax analysis phase.
Typical syntax errors are
 Errors in structure

 Missing operator

 Misspelled keywords

 Unbalanced parenthesis

 Example : swicth(ch)

{
.......
.......
}
The keyword switch is incorrectly written as
swicth. Hence, “Unidentified
keyword/identifier” error occurs.
 Panic Mode Recovery
◦ In this method, successive characters from
input are removed one at a time until a
designated set of synchronizing tokens is found.
Synchronizing tokens are deli-meters such as ;
or }
◦ Advantage is that its easy to implement and
guarantees not to go to infinite loop
◦ Disadvantage is that a considerable amount of
input is skipped without checking it for
additional errors
 Statement Mode recovery
◦ In this method, when a parser encounters an
error, it performs necessary correction on
remaining input so that the rest of input
statement allow the parser to parse ahead.
◦ The correction can be deletion of extra
semicolons, replacing comma by semicolon or
inserting missing semicolon.
◦ While performing correction, atmost care should
be taken for not going in infinite loop.
◦ Disadvantage is that it finds difficult to handle
situations where actual error occured before
point of detection.
 Error production
◦ If user has knowledge of common errors that
can be encountered then, these errors can be
incorporated by augmenting the grammar with
error productions that generate erroneous
constructs.
◦ If this is used then, during parsing appropriate
error messages can be generated and parsing
can be continued.
◦ Disadvantage is that its difficult to maintain.
 Global Correction
◦ The parser examines the whole program and
tries to find out the closest match for it which is
error free.
◦ The closest match program has less number of
insertions, deletions and changes of tokens to
recover from erroneous input.
◦ Due to high time and space complexity, this
method is not implemented practically.
These errors are detected during semantic analysis phase.
Typical semantic errors are
 Incompatible type of operands

 Undeclared variables

 Not matching of actual arguments with formal one

 Example : int a[10], b;


.......
.......
a = b;
It generates a semantic error because of an
incompatible type of a and b.
 If error “Undeclared Identifier” is encountered then, to
recover from this a symbol table entry for corresponding
identifier is made.
 If data types of two operands are incompatible then,
automatic type conversion is done by the compiler.

You might also like