Introduction to Parsing
Ambiguity and Syntax Errors
Outline
• Parser overview
• Context-free grammars (CFG’s)
• Derivations
• Ambiguity
• Syntax errors
2
What is the job of Syntax Analysis?
• Syntax Analysis is also called Parsing or Hierarchical
Analysis.
• A Parser implements grammar of the language may it be java,
C, C++ etc
• The parser obtains a string of tokens from the lexical analyzer
and :
• verifies that the string can be generated by the
grammar for the source language
• Reports any syntax errors in the program
• Constructs a parse tree representation of the
program
• usually calls the lexical analyzer to supply a token to it
when necessary
• The grammar that a parser implements is called a Context
The Functionality of theParser
• Input: sequence of tokens from lexer
• Output: parse tree of the program
Comparison with Lexical Analysis:
Phase Input Output
Lexer Sequence of Sequence of
characters tokens
Parser Sequence of Parse tree
tokens
4
Example
• If-then-else statement
if (x == y) then z =1; else z = 2;
• Parser input
IF (ID == ID) THEN ID = INT; ELSE ID =
INT;
• Possible parser output
IF-THEN-ELSE
== = =
I I I IN I IN
D D D T D T
5
The Role of the Parser
• Not all sequences of tokens are programs ...
• Parser must distinguish between valid and
invalid sequences of tokens
• The Role is:
1. To check syntax (= string recognizer)
And to report syntax errors accurately
2. To invoke semantic actions
For static semantics checking, e.g. type checking of
expressions, functions, etc.
• We need
– A language for describing valid sequences of
tokens
– A method for distinguishing valid from invalid
6
What is the difference between Syntax and Semantic?
Syntax is the way in which we construct sentences
by following principles and rules.
Semantics is the interpretations of and meanings
derived from the sentence transmission and
understanding of the message or in other words
are the logical sentences making sense or not
Context-Free Grammars
• Many programming language constructs have a
recursive structure
• A STMT is of the form
if COND then STMT else STMT , or
while COND do STMT , or
…
• Context-free grammars are a natural notation
for this recursive structure
8
CFGs (Cont.)
A CFG consists of
– A set of terminals T
– A set of non-terminals N
– A start symbol S (a non-terminal)
– A set of productions
Assuming X N the productions are of the form
X , or
X Y1 Y2 ... Yn where Yi N T
10
Notational Conventions
Terminals: Lower case letters, operator symbols,
punctuation symbols, digits, boldface strings are all
terminals
Non Terminals: Upper case letters, lower case italic
names are usually non terminals
• Greek letters such as ,, represent strings of
grammars symbols. Thus a generic production can
be written as A
• The start symbol is the left-hand side of the first
production
10
Examples of CFGs
A fragment of our example language
(simplified):
STMT if COND then STMT else STMT
| while COND do STMT
| id = int
11
Examples of CFGs (cont.)
Grammar for simple arithmetic expressions:
E E*
E
|E+E
|(E)
|id
12
The Language of a CFG
Read productions as replacement rules:
X Y1 ... Yn
Means X can be replaced by Y1 ... Yn
X
Means X can be erased (replaced with empty string)
13
Key Idea
(1) Begin with a string consisting of the start
symbol “S”
(2) Replace any non-terminal X in the string
by a right-hand side of some production
X Y1 . . Y n
(3) Repeat (2) until there are no non-terminals
in the string
14
The Language of a CFG (Cont.)
More formally, we write
X1 . . X i . . X n X1 ..X i 1 Y 1 . . Y m Xi1 . . X n
if there is a production
Xi Y1 . . Y m
Write
X1 . . X n Y1 . . Y m
if
X1 . . X n .. ..
Y1 . . Y m
15
The Language of a CFG
Let G be a context-free grammar with start
symbol S. Then the language of G is:
{a1 . . a n | S a1 . . a n and every ai is a
terminal }
Terminals
• Terminals are called so because there are no
rules for replacing them
• Once generated, terminals are permanent
• Terminals ought to be tokens of the
language 16
Examples
L(G) is the language of the CFG G
Strings of balanced parentheses
( ) i i
|i
Two grammars:
S (S ) S 0(S)
or
S |
20
Example
A fragment of our example language (simplified :
STMT if COND then STMT
|if COND then STMT else STMT
|while COND do STMT
|id = int
COND (id == id)
|
(id != id)
18
Arithmetic Example
Simple arithmetic expressions:
E E+E | E E | (E) |
id
Some elements
id of the id
language:
+ id
(id) id
(id) id id
(id)
19
Derivations and Parse Trees
A derivation is a sequence of productions
S.. .. ..
A derivation can be drawn as a tree
– Start symbol is the tree’s root
– For a production
X Y1 . . Y n add
children Y1 . . Y n to node X
20
Derivation Example
• Grammar
E E+E | E E | (E) | id
• String id id + id E
E
E+E
E + E
E E+E
id E + E E * id
id id + E E
id id + id id id
21
Derivation Example
• Grammar
E E+E | E E | (E) | id
• String id id + id
Notes on Derivations
• A parse tree has
– Terminals at the leaves
– Non-terminals at the interior nodes
• An in-order traversal of the leaves is the
original input
• The parse tree shows the association of
operations, the input string does not
23
Left-most and Right-most
Derivations
E • What was shown before
was a left-most derivation
E+E – At each step, replace the
left-most non-terminal
E+id
E E + id • There is an equivalent
E id + id notion of a right-
most derivation
id id + id – Shown on the right
24
Right-most Derivation in Detail
E E E
E E+E
E + E
E
E
E+E E + E
E+id
id
25
Right-most Derivation in Detail (2)
E E
E+E
E+id E + E
E E + id
E * id
E id + id E
id id + id id
• Note that right-most and left-most
id
derivations have the same parse tree
• The difference is just in the order in
which branches are added
26
Ambiguity
• Grammar:
EE+E|E* E | (E)|
int
• The string int * int + int has two parse trees
E
E E+ E * E
E
E * E int int
E + E
int int int
int 27
Ambiguity (Cont.)
A grammar is ambiguous if it has more
than one parse tree for some string
– Equivalently, there is more than one right-
most or left-most derivation for some string
Ambiguity is bad
– Leaves meaning of some programs ill-defined
Ambiguity is common in programming
languages
– Arithmetic expressions
– IF-THEN-ELSE
28
Dealing with Ambiguity
• There are several ways to handle ambiguity
• Most direct method is to rewrite grammar
unambiguously
ET+E|T
T int * T | int | ( E )
• This grammar enforces precedence of * over +
29
The Dangling Else: Example
Consider the following grammar S if C then S
• The expression |if
C then S else S
if C1 then if C2 then S3 else S4
|
has twoOTHER
parse trees
if if
C1 if S4 C1 if
C2 S3
S3 S4
C2
• Typically we want the second form
30
The Dangling Else: A Fix
• else should match the closest unmatched then
• We can describe this in the grammar
S MIF /* all then are matched */
| /* some then are unmatched */
MIFUIF
if C then MIF else MIF
| OTHER
UIF if C then S
| if C then MIF else UIF
• Describes the same set of strings
The Dangling Else: Example Revisited
• The expression if C1 then if C2 then S3 else S4
if if
C1 if C1 if S4
C2 C2
S3
S3
• Not valid because the
S4 then expression is
not a MIF
• A valid parse tree 32
Ambiguity
• No general techniques for handling ambiguity
• Impossible to convert automatically an
ambiguous grammar to an unambiguous one
• Used with care, ambiguity can simplify the
grammar
– Sometimes allows more natural definitions
– We need disambiguation mechanisms
33
Precedence and Associativity Declarations
• Instead of rewriting the grammar
– Use the more natural (ambiguous) grammar
– Along with disambiguating declarations
• Most tools allow precedence and associativity
declarations to disambiguate grammars
• Examples …
34
Associativity Declarations
• Consider the grammar EE+E|
int
• Ambiguous: two parse trees of int + int + int
E E
E + E + E
E
E + E int int E + E
int int int int
• Left associativity declaration: %left +
35
Precedence Declarations
• Consider the grammar E E + E * E | int
| E And the string int + int * int
E E
* E + E
E
E + E int int E *
E
int int int
int
• Precedence declarations: %left
+
36
Error Handling
Purpose of the compiler is
– To detect non-valid programs
– To translate the valid ones
Many kinds of possible errors (e.g. in C)
Error kind Example Detected by …
Lexical …$… Lexer
Syntax … x *% … Parser
Semantic … int x; y = x(3); … Type checker
Correctness your favorite program Tester/User
37
Error Handling
A good compiler should assist in identifying and locating errors
◦ Lexical errors: important, compiler can easily recover and
continue such as misspelling an identifier, keyword etc.
◦ Syntax errors: most important for compiler, can almost
always recover such as arithmetic expression with unbalanced
parenthesis
◦ Static semantic errors: important, can sometimes recover
such as operator applied to incompatible operands
◦ Dynamic semantic errors: hard or impossible to detect at
compile time, runtime checks are required
◦ Logical errors: hard or impossible to detect such as infinite
recursive calls
38
Syntax Error Handling
Error handler should
– Report errors accurately and clearly
– Recover from an error quickly
– Not slow down compilation of valid code
• Good error handling is not easy to achieve
39
Approaches to Syntax Error Recovery
Approaches from simple to complex
– Panic mode
– Error productions
– Automatic local or global correction
• Panic mode is the simplest, most popular method
• When an error is detected:
– Discard tokens until one with a clear role is
found
– Continue from there
• Such tokens are called synchronizing tokens
– Typically the statement or expression
terminators
40
Questions???