CD Unit-2 (R20)
CD Unit-2 (R20)
UNIT-II
Syntax Analysis
Syntax Analysis
a second phase of the compiler design process in which the given input string is checked for the
confirmation of rules and structure of the formal grammar. It analyses the syntactical structure and
checks if the given input is in the correct syntax of the programming language or not.
Syntax Analysis in Compiler Design process comes after the Lexical
analysis phase. It is also known as the Parse Tree or Syntax Tree. The Parse Tree is developed with
the help of pre-defined grammar of the language. The syntax analyser also checks whether a given
program fulfills the rules implied by a context-free grammar. If it satisfies, the parser then creates
the parse tree of that source program. Otherwise, it will display error messages.
The parser or syntactic analyzer obtains a string of tokens from the lexical analyzer and
verifies that the string can be generated by the grammar for the source language. It reports any syntax
errors in the program. It also recovers from commonly occurring errors so that it can continue
processing its input.
Issues :
On discovering an error, the parser discards input symbols one at a time until a synchronizing token
is found. The synchronizing tokens are usually delimiters, such as
semicolon or end. It has the advantage of simplicity and does not go into an infinite loop. When
multiple errors in the same statement are rare, this method is quite useful.
On discovering an error, the parser performs local correction on the remaining input that
allows it to continue. Example: Insert a missing semicolon or delete an extraneous semicolon etc.
Error productions:
The parser is constructed using augmented grammar with error productions. If an error
production is used by the parser, appropriate error diagnostics can be generated to indicate the
erroneous constructs recognized by the input.
Global correction:
Given an incorrect input string x and grammar G, certain algorithms can be used to find a parse tree
for a string y, such that the number of insertions, deletions and changes of tokens is as small as
possible. However, these methods are in general too costly in terms of time and space
Context-Free Grammars
Context Free Grammars (CFG) can be classified on the basis of following two properties:
Note: A recursive context-free grammar that contains no useless rules necessarily produces an
infinite language.
Non-Recursive Grammars
S->Aa
A->b|c
The language generated by the above grammar is :{ba, ca}, which is finite.
Context free grammar is a formal grammar which is used to generate all possible strings in a
given formal language.
G= (V, T, P, S)
Where,
In CFG, the start symbol is used to derive the string. You can derive the string by repeatedly replacing
a non-terminal by the right hand side of the production, until all non-terminal have been replaced by
terminal symbols.
Example:
Production rules:
S → aSa
S → bSb
S→c
Now check that abbcbba string can be derived from the given CFG.
By applying the production S → aSa, S → bSb recursively and finally applying the production S →
c, we get the string abbcbba.
Derivations
Derivation is a sequence of production rules. It is used to get the input string through these production
rules. During parsing we have to take two decisions. These are as follows:
We have two options to decide which non-terminal to be replaced with production rule.
Left-most Derivation
In the left most derivation, the input is scanned and replaced with the production rule from left to
right. So in left most derivatives we read the input string from left to right.
Example:
Production rules:
1. S=S+S
2. S=S-S
3. S = a | b |c
Input:
a-b+c
S=S+S
S=S-S+S
S=a-S+S
S=a-b+S
Right-most Derivation
In the right most derivation, the input is scanned and replaced with the production rule from right to
left. So in right most derivatives we read the input string from right to left.
Example:
Production rules:
1) S = S + S
2) S = S - S
3) S = a | b |c
Input:
a-b+c
S=S-S
S=S-S+S
S=S-S+c
S=S-b+c
S=a-b+c
Parse Trees
o Parse tree is the graphical representation of symbol. The symbol can be terminal or non-
terminal.
o In parsing, the string is derived using the start symbol. The root of the parse tree is that start
symbol.
o It is the graphical representation of symbol that can be terminals or non-terminals.
o Parse tree follows the precedence of operators. The deepest sub-tree traversed first. So, the
operator in the parent node has less precedence over the operator in the sub-tree.
Example:
Production rules:
1. T= T + T | T * T
2. T = a|b|c
Input:
a*b+c
Step 1:
Step 2:
Step 3:
Step 5:
Ambiguity
A grammar is said to be ambiguous if there exists more than one leftmost derivation or more than one
rightmost derivative or more than one parse tree for the given input string. If the grammar is not
ambiguous then it is called unambiguous.
Example:
1. S = aSb | SS
2. S=∈
For the string aabb, the above grammar generates two parse trees:
If the grammar has ambiguity then it is not good for a compiler construction. No method can
automatically detect and remove the ambiguity but you can remove ambiguity by re-writing the whole
grammar without ambiguity.
You can also read our previously discussed article on the Classification of Context-Free
Grammar.
Context Free Grammars(CFGs) are classified based on:
Number of Derivation trees
Number of strings
Depending on the Number of Derivation trees, CFGs are sub-divided into 2 types:
Ambiguous grammars
Unambiguous grammars
Ambiguous grammar: A CFG is said to be ambiguous if there exists more than one derivation
tree for the given input string i.e., more than one LeftMost Derivation Tree (LMDT)
or RightMost Derivation Tree (RMDT).
Definition: G = (V,T,P,S) is a CFG that is said to be ambiguous if and only if there exists a string
in T* that has more than one parse tree. where V is a finite set of variables. T is a finite set of
terminals. P is a finite set of productions of the form, A -> α, where A is a variable and α ∈ (V ∪
T)* S is a designated variable called the start symbol.
For Example:
Let us consider this grammar: E -> E+E|id We can create a 2 parse tree from this grammar to
obtain a string id+id+id. The following are the 2 parse trees generated by left-most derivation:
E -> I
E -> E + E
E -> E * E
E -> (E)
I -> ε | 0 | 1 | … | 9
From the above grammar String 3*2+5 can be derived in 2 ways:
I) First leftmost derivation II) Second leftmost derivation
E=>E*E E=>E+E
=>I*E =>E*E+E
=>3*E+E =>I*E+E
=>3*I+E =>3*E+E
=>3*2+E =>3*I+E
=>3*2+I =>3*2+I
=>3*2+5 =>3*2+5
Following are some examples of ambiguous grammar:
S-> aS |Sa| Є
E-> E +E | E*E| id
A -> AA | (A) | a
S -> SS|AB , A -> Aa|a , B -> Bb|b
Whereas following grammars are unambiguous:
S -> (L) | a, L -> LS | S
S -> AA , A -> aA , A -> b
Inherently ambiguous Language: Let L be a Context Free Language (CFL). If every Context-
Free Grammar G with Language L = L(G) is ambiguous, then L is said to be inherently
ambiguous Language. Ambiguity is a property of grammar not languages. Ambiguous grammar is
unlikely to be useful for a programming language because two parse tree structures(or more) for
the same string(program) imply two different meanings (executable programs) for the program.
An inherently ambiguous language would be absolutely unsuitable as a programming language
because we would not have any way of fixing a unique structure for all its programs. For example,
L = {anbncm} ∪ {anbmcm}
Note: Ambiguity of grammar is undecidable, i.e. there is no particular algorithm for removing the
ambiguity of grammar, but we can remove ambiguity by Disambiguate the grammar i.e.,
rewriting the grammar such that there is only one derivation or parse tree possible for a string of
Parse Tree
2) If both left and right recursion are not present in grammar, then is the grammar
unambiguous? Explain with an example.
Ans– No, the grammar can still be ambiguous. If both left and right recursion are present in
grammar, then the grammar is ambiguous, but the reverse is not always true.
Example -
S -> aB | ab
A -> AB | a
B -> Abb | b
In the above example, although both left and right recursion are not present, but if
we see string { ab }, we can make more than one parse tree to generate the string.
From the above example, we can see that even if both left and right recursion are not present in
grammar, the grammar can be ambiguous.
3) State whether the grammar is ambiguous or not.
S -> SAB | Є
A -> AaB | a
B -> AS | b
Ans – The grammar is Ambiguous.
If we put
B -> AS in S -> SAB
Then we get S -> SAAS and the grammar clearly contains both left and right recursion. Hence
the grammar is ambiguous.
Left Recursion
A grammar becomes left-recursive if it has any non-terminal ‘A’ whose derivation contains ‘A’
itself as the left-most symbol. Left-recursive grammar is considered to be a problematic situation for
top-down parsers. Top-down parsers start parsing from the Start symbol, which in itself is non-
terminal. So, when the parser encounters the same non-terminal in its derivation, it becomes hard
for it to judge when to stop parsing the left non-terminal and it goes into an infinite loop.
E → E +T | T
A → A α | Β
∴ A = E, α = +T, β = T
∴ A → A α |β is changed to A → βA′and A′ → α A′|ε
∴ A → βA′ means E → TE′
T → T *F | F
A → A α | β
∴ A = T, α =∗ F, β = F
∴ A → β A′ means T → FT′
A → α A′|ε means T′ →* FT′|ε
Production F → (E)|id does not have any left recursion
∴ Combining productions 1, 2, 3, 4, 5, we get
E → TE′
E′ → +TE′| ε
T → FT′
T →* FT′|ε
F → (E)| id
Example2 − Eliminate the left recursion for the following Grammar.
S → a|^|(T)
T → T, S|S
Solution
We have immediate left recursion in T-productions.
Comparing T → T, S|S With A → A α | β where A = T, α =, S and β = S
Solution
The production after removing the left recursion will be
E → TE′
E′ → +TE′| ∈
T → FT′
T′ →∗ FT′| ∈
F → (E)|id
Solution
Eliminating immediate left-recursion among all Aα productions, we obtain
E → TE′
E → (T)E′|ε
T → FT′
T′ → (F)T′|ε
F → id
Direct Recursion
For each rule which contains a left-recursive option,
A --> A | β
introduce a new nonterminal A' and rewrite the rule as
A --> β A'
A' --> | A'
Thus the production:
E --> E + T | T
is left-recursive with "E" playing the role of "A","+ T" playing the role of , and "T" playing
the role of β A'.
Introducing the new nonterminal E', the production can be replaced by:
E --> T E'
E' --> | + T E'
Of course, there may be more than one left-recursive part on the right-hand side. The general
rule is to replace
A --> A | | ... | β | β | ... | β
by
A --> β A' | β A'| ...| β A'
A' --> | A' | A' | ...| A'
Note that this may change the "structure". For the expression above, the original grammar is
left-associative, while the non-left recursive one is now right-associative.
Indirect Recursion
Step one describes a rule to eliminate direct left recursion from a production. To eliminate
left-recursion from an entire grammar may be more difficult because of indirect left-
recursion.
For example
A --> B x y | x
B --> C D
Left Factoring
A → αA’
A → β1 / β2
Rewrite the given expression (1) using the 1 and 2 expressions.
Example:
Consider the grammar and eliminate the left recursion
E ⟶ E + T / T ——- (1)
T ⟶ T * F / F ——- (2)
First, let’s take equation (1) and rewrite it using the given formula
E⟶E+T/T
E ⟶ T E’ —— (3)
E’ ⟶ +TE’ / Є —— (4)
Productions 3 and 4 are the Left factoring equations of the given production
T⟶T*F/F
T ⟶ FT’ —–(5)
T’ ⟶ FT’ / Є —— (6)
Productions 5 and 6 are the Left factoring equations of the given production
E⟶TE
E’ ⟶ +TE’ / Є
T ⟶ FT’
T’ ⟶ FT’ / Є
Example-
Syntax analysis or parsing is the process of analyzing the string of symbols using grammar rules.
Parsing is divided into two types: bottom-up parsing and top-down parsing. It is breaking
something down into its constituent parts, for example, breaking a sentence down to explain what its
verb does.
For data, the compiler unit performs the sequential parsing operation in top-down and bottom-up
parsing after its tokenizer, lexer, or scanner has produced its tokens.
Top-down parsing means parsing the input and constructing the parse tree, starting from the root
and going down to the leaf. It uses left most derivation to build a parse tree. On the contrary, bottom-
up parsing is based on reverse rightmost derivation. The function of top-down parsers is to construct
from the grammar (free from left recursion and ambiguity). Top-down parsing allows the grammar
free from left factoring.
Top-down parsing is a method of parsing the input string provided by the lexical analyzer. The top-
down parser parses the input string and then generates the parse tree for it.
Construction of the parse tree starts from the root node i.e. the start symbol of the grammar. Then
using leftmost derivation it derives a string that matches the input string.
In the top-down approach construction of the parse tree starts from the root node and end up
creating the leaf nodes. Here the leaf node presents the terminals that match the terminals of the
input string.
1. To derive the input string, first, a production in grammar with a start symbol is applied.
2. Now at each step, the parser has to identify which production rule of a non-terminal must be
applied in order to derive the input string.
3. The next step is to match the terminals in the production with the terminals of the input string.
In Top-Down Parsing with Backtracking, Parser will attempt multiple rules or production to identify
the match for input string by backtracking at every step of derivation. So, if the applied production
does not give the input string as needed, or it does not match with the needed string, then it can undo
that shift.
Example1 − Consider the Grammar
S→aAd
A→bc|b
Make parse tree for the string a bd. Also, show parse Tree when Backtracking is required when the
wrong alternative is chosen.
Solution
The derivation for the string abd will be −
S ⇒ a A d ⇒ abd (Required String)
If bc is substituted in place of non-terminal A then the string obtained will be abcd.
S ⇒ a A d ⇒ abcd (Wrong Alternative)
Figure (a) represents S → aAd
Figure (b) represents an expansion of tree with production A → bc which gives string abcd which
does not match with string abd. So, it backtracks & chooses another alternative, i.e., A → b in figure
(c) which matches with abd.
A parser that uses a set of recursive procedures to recognize its input with no backtracking is called a
recursive descent parser
For implementing a recursive descent parser for a grammar.
The grammar must not be left recursive
The grammar must be left factored that means it should not have common prefixes for
alternates
We need a language that has recursion facility
Left factoring Problems:
Eliminate Left recursion and then left factor the following grammar
rexpr → reexpr + rterm | rterm
rterm → rterm rfactor | rfactor
rfactor → rfactor* | rprimary
rprimary → a | b
Design:
It can be written on the same lines as that of brute force approach that is one procedure for one
non terminal
The procedures need not return anything because they need not try another alternate as in brute
force
Only on failure we call some error routine to take care of the error
On can make use of different type of transition diagrams for designing a recursive descent parser
as used in case of scanners.
As there is no non-determinism in the transition diagram the code for parser is:
We can further improve on the transition diagrams to better code the parser as, transition diagram for
E' can be transformed as
Now, the code for these transition diagrams runs faster than the code written for the original transition
diagram.
Now the code for parser is, from these reduced transition diagrams
Advantage:
Easy to hand construct a recursive descent parser from an appropriate grammar
Limitations
Recursive descent parser encodes state information in its run-time stack, or call stack, and this may
not be particularly efficient
Uses recursive procedure calls to implement a stack abstraction
It is very difficult to automated
One of major drawback or recursive-descent parsing is that it can be implemented only for those
languages which support recursive procedure calls and it suffers from the problem of left-recursion.
LL (1) Grammars
When we construct predictive parsing table for some grammars, the table may have some multiply
defined entries. This happens when the given grammar is left recursive or ambiguous
A grammar whose predictive parser has no multiply defined entries is known as LL(1) grammar
LL(1) means left to right scan of the input to generate the left most derivation by using 1 symbol of
look ahead (can be extended to k symbol look ahead)
Rather than constructing a table to check for LL(1) Grammar, we can check for it using the
following definition of LL(1) Grammar.
Definition:
A grammar G is LL(1) if and only if whenever A → α | β are two distinct productions of G then the
following three conditions hold.
For no terminal a do α and β derive strings beginning with a
At most one of α and β can derive ε
If β⇒* ε, then α does not derive any string beginning with a terminal in FOLLOW(A)
The parser needs to find a production to use for nonterminal N when it sees lookahead token t.To
select which production to use, it suffices to have a table that has, as a key, a pair (N, t) and gives
the number of a production to use.
Let's illustrate with an LL(1) parsing table for the expression grammar that we used earlier, which
looks like this.
1. E → T R
2. R → ε
3. R → + E
4. T → F S
5. S → ε
6. S → * T
7. F → n
8. F → ( E )
Parsing table D below does the job. Each row is labeled by a nonterminal and each column by a
lookahead token, or the special symbol $ that indicates the end of the input.
D(N, t) is the production to use to expand N when the lookahead is t. Blank entries mean syntax
error.
Table D
n + * ( ) $
E 1 1
R 3 2 2 2
T 4 4
S 5 6 5 5
F 7 8
E
/\
T R
E
/\
T R
/\
F S
E
/\
T R
/\
F S
|
n
The lookahead changes to * and D(S, *) = 6. Since production 5 is S → * T, the table tells us to
replace S by * T.
E
/\
T R
/\
F S
| /\
n * T
Now the lookahead is n, and D(T, n) = 4. After using production 4 (T → F S), the table will tell us
to use production 7 (F → n), giving
The parse is almost finished. Since there are no more tokens, the lookahead is $. D(S, $) = 5 and
D(R, $) = 2, which says to replace S and R by ε.
E
/\
/ \
T R
/\ |
F S ε
| /\
n * T
/\
F S
| |
n ε
A stack-based approach
We use a two stacks, called Match and Todo. Stack Matched only holds tokens.
At any given point, the string that has been derived is m t where m is the contents of the Matched
stack (from bottom to top) and t is the contents of the Todo stack (from top to bottom).
The action tells the production that is used, or, when a token is moved to the Matched stack and
removed from the input, a match action.
Predictive parsing can be performed using a pushdown stack, avoiding recursive calls.
Initially the stack holds just the start symbol of the grammar.
At each step a symbol X is popped from the stack:
o if X is a terminal symbol then it is matched with lookahead and lookahead is
advanced,
o if X is a nonterminal, then using lookahead and a parsing table (implementing the
FIRST sets) a production is chosen and its right hand side is pushed onto the stack.
This process goes on until the stack and the input string become empty. It is useful to have
an end_of_stack and an end_of_input symbols. We denote them both by $.
S
aAa | BAa |
A
cA | bA |
B b
S
S aAa S BAa S
A
A A bA A cA
B
B b
where the empty slots correspond to error-entries. Consider the parsing of the word w = bcba.
THE ALGORITHM.
We know that the Predictive parser performs Left most derivative while parsing the given
sentence
Now the given sentence may be a valid sentence or an invalid sentence with respect to the
specified grammar
An error is detected during the predictive parsing when the terminal on top of the stack does
not match the next input symbol, or when nonterminal X on top of the stack, a is the next
input symbol, and parsing table entry PT[ X , a ] is empty i.e., X is not in the parsing table.
Our predictive parser works as follows
Stack top X
input symbol a
:
while (...) {
if (X is Terminal)
{
if(X==a)
{
pop X
move input ptr forward to next symbol
}
else // X!=a
{ // Error found }
}
else // X is NT
{