Programming Languages:Syntax Description and Parsing
Programming Languages:
Syntax Description and Parsing
Onur Tolga Şehitoğlu
Computer Engineering,METU
27 May 2009
Programming Languages:Syntax Description and Parsing
Outline
Programming Languages:Syntax Description and Parsing
Describing Syntax
Introduction
Introduction
Syntax: the form and structure of a program.
Semantics: meaning of a program
Language definitions are used by:
Programmers
Implementors of the language processors
Language designers
Programming Languages:Syntax Description and Parsing
Describing Syntax
Introduction
Definitions
A sentence is a string of characters over some alphabet
A language is a set of sentences
A lexeme is the lowest level syntactic unit of the language (i.e.
++, int, total)
A token is a category of lexemes (i.e. identifier )
Programming Languages:Syntax Description and Parsing
Describing Syntax
Introduction
Definitions
syntax recognition: read input strings of the language and
verify the input belonging to the language
syntax generation: generate sentences of the language (i.e.
from a given data structure)
Compilers and interpreters recognize syntax and convert it
into machine understandable form.
Programming Languages:Syntax Description and Parsing
Describing Syntax
Backus-Naur Form and CFGs
Backus-Naur Form and CFGs
CFG’s introduced by Noam Chomsky (mid 1950s)
Programming languages are usually in context free language
class
BNF introduced by John Bakus and modified by Peter Naur
for describing Algol language
BNF is equivalent to CFGs. It is a meta-language that
decribes other languages
Extended BNF improves readability of BNF
Programming Languages:Syntax Description and Parsing
Describing Syntax
Backus-Naur Form and CFGs
A Grammar Rule
hwhile stmti → while ( hlogic expri ) hstmti
LHS is a non-terminal denoting an intermediate phrase
LHS can be defined (rewritten) as the RHS sequence which
can contain terminals (lexems and tokens) of the language
and other non-terminals
Non-terminals are denoted as strings enclosed in angle
brackets.
::= may be used in BNF notation instead of the arrow
| is used to combine multiple rules with same LHS in a single
rule
hlgc consi ::= true ≡ hlgc consi ::= true | false
hlgc consi ::= false
Programming Languages:Syntax Description and Parsing
Describing Syntax
Context Free Grammar
Context Free Grammar
A grammar G is defined as G = (V , Σ, R, S):
N, finite set of non terminals
Σ, finite set of terminals
R is a set of grammar rules. A relation from V to (V ∪ Σ)∗ .
S ∈ N the start symbol
Application of a rule maps one sentential form into the other
by replacing a non-terminal element in sentential form with its
right handside seuqence in the rule, u 7→ v .
n o
∗
Language of a grammar L(G ) = w | w ∈ Σ∗ , S 7→ w
Programming Languages:Syntax Description and Parsing
Describing Syntax
Context Free Grammar
Recursive or list like structures can be represented using
recursion
hexpr listi → hexpri , hexpr listi
hbtreei → hheadi ( hbtreei , hbtreei )
A derivation starts with a starting non-terminal and rules are
applied repeteadly to end with a sentence containing only
terminal symbols.
leftmost derivation: always leftmost non-terminal is chosen for
replacement
rightmost derivation: always rightmost non-terminal is chosen
for replacement
Same sentence can be derived using leftmost, rightmost, or
other derivaionts.
Programming Languages:Syntax Description and Parsing
Describing Syntax
Context Free Grammar
Sample Grammar
hstmti → hidi = hexpri
hexpri → hexpri hopi hexpri | hidi
hopi → + | *
hidi → a | b | c
Leftmost derivation of a = a * b :
hstmti 7→ hidi = hexpri 7→ a = hexpri
7→ a = hidi hopi hexpri 7→ a = b hopi hexpri
7→ a = b * hexpri 7→ a = b * hidi 7→ a = b * c
Rightmost derivation of a = a * b :
hstmti 7→ hidi = hexpri 7→ hidi = hexpri hopi hexpri
7→ hidi = hexpri hopi hidi 7→ hidi = hexpri hopi b
7→ hidi = hexpri * b 7→ hidi = hidi * b
7→ hidi = a * b 7→ a = a * b
Programming Languages:Syntax Description and Parsing
Describing Syntax
Context Free Grammar
Parse Tree
Steps of a derivation gives the structure of the sentence. This
structure can be represented as a tree.
All non-terminals used in derivation are intermediate nodes.
Each grammar rule replaces the non-terminal node with is
children. Root node is the start symbol.
Terminal nodes are the leaf nodes.
preorder traversal of leaf nodes gives the resulting sentence.
leftmost and rightmos derivations can be retrieved by traversal
of the tree.
Programming Languages:Syntax Description and Parsing
Describing Syntax
Context Free Grammar
Parse Tree Example
a = a * b
hstmti
hidi = hexpri
a hexpri hopi hexpri
hidi * hidi
a b
Programming Languages:Syntax Description and Parsing
Describing Syntax
Context Free Grammar
Parse Tree Generation
A parse tree gives the structure of the program so semantics
of the program is related to this structure.
For example local scopes, evaluation order of expressions etc.
During compilation, parse trees might be required for code
generation, semantic analysis and optimization phases.
After a parse tree generated, it can be traversed to do various
tasks of compilation.
The processing of parse tree takes too long, so creation of
parse trees is usually avoided.
Approaches like syntax directed translation combines parsing
with code generation, semantic analysis etc..
Programming Languages:Syntax Description and Parsing
Describing Syntax
Ambigous Grammars
Ambigous Grammars
Consider a = a + b * c in our grammar:
hstmti hstmti
hidi = hexpri vs hidi = hexpri
a hexpri hopi hexpri a hexpri hopi hexpri
hidi + hexpri hopi hexpri
hexpri hopi hexpri * hidi
a hidi * hidi
hidi + hidi c
b c
a b
Both can be derived by the grammar!
Programming Languages:Syntax Description and Parsing
Describing Syntax
Ambigous Grammars
A grammar is called ambigous if same sentence can be derived
by following different set of rules, thus resulting in a different
parse tree
If structure changes semantic meaning of the program,
ambiguity is a serious problem.
Even if not, which one is the result?
i.e. Precedence of operators affects the value of the
expression.
Programming languages enforces precedence rules to resolv
ambiguity.
Solution:
1 design grammar not to be ambigous, or
2 during parsing, choose rules to generate the correct parse tree
Programming Languages:Syntax Description and Parsing
Describing Syntax
Ambigous Grammars
Precedence and Grammar
Operators with different precedence levels should be treated
differently
Higher precedence operations should be deep in the parse tree
→ their rules should be applied later.
Lower precedence operations should be closer to root →
applied earlier in derivation.
For each precedence level, define a non-terminal
One rewritten on the other based on the precedence lower to
higher
Programming Languages:Syntax Description and Parsing
Describing Syntax
Ambigous Grammars
Rewritten Grammar
hstmti
hstmti → hidi = hexpri hidi = hexpri
hexpri → hexpri + htermi | htermi
a hexpri + htermi
htermi → htermi * hfactori | hfactori
htermi htermi * hfactori
hfactori → hidi | ( hexpri )
hidi → a | b | c hfactori hfactori hidi
hidi hidi c
a b
htermi and hexpri has different precedence.
Once inside of htermi, there is no way to derive +
Only one parse possible
Programming Languages:Syntax Description and Parsing
Describing Syntax
Associativity
Associativity
Associativity of operators is another issue
a - b - c ≡ ( a - b ) - c or a - ( b - c)
Recursion of grammar defines how tree is constructed for
operators in the same level.
If left recursive, later operators in the sentence will be closer
to root, if right recursive earlier operators will be closer to root
left recursion implies left associativity, right recursion implies
right associativity.
Consider a + b + c in these grammars:
hexpri → hexpri + hidi | hidi hexpri → hidi + hexpri | hidi
hidi → a | b | c
vs hidi → a | b | c
Programming Languages:Syntax Description and Parsing
Describing Syntax
An Assignment Grammar
Sample Grammar
hasgni → hidi = hasgni | hidi = hexpri
hexpri → hexpri + htermi | htermi
htermi → htermi * hfactori | hfactori
hfactori → hpowi ^ hfactori | hpowi
hpowi → hidi | ( hexpri )
hidi → a | b | c
hasgni is right recursive like right associative C assignments.
hexpri and htermi are left recursive, * and + left associative
hfactori is right recursive for power operation ^ to be right associative.
precedence order is (...) ≺ ^ ≺ * ≺ + ≺ =
Programming Languages:Syntax Description and Parsing
Describing Syntax
An Assignment Grammar
a = a + b * c * a ^ b ^ c
hasgni
hidi = hexpri
a hexpri + htermi
htermi htermi * hfactori
hfactori htermi * hfactori hpowi ^ hfactori
hpowi hfactori hpowi hidi hpowi ^ hfactori
hidi hpowi hidi a hidi hpowi
a hidi c b hidi
b c
Programming Languages:Syntax Description and Parsing
Parsing
Compilation
Compilation
source code
Lexical Analysis
sequence of lexemes
Syntax analysis
parse tree
Intermediate while ( c o u n t e r < 12341) {
Symbol code generation f () ;
Optimization
Table and Semantic c o u n t e r += 12;
Analysis }
intermediate code
WHL LP ID LT I L I T RP LB
Code ID LP RP SC
generation ID PLEQ I L I T SC
RB
machine code
hwhlstmti
Programming Languages:Syntax Description and Parsing
Parsing
Parsing
Parsing
input: sequence of lexemes (output of lexical analysis) or
characters.
output: parse tree, intermediate code, translated code, or
sometimes only if document is valid or not.
Two main classes of parser:
Top down parsing
Tottom up parsing
Programming Languages:Syntax Description and Parsing
Parsing
Top-down Parsing
Top-down Parsing
Start from the starting non-terminal, apply grammar rules to
reach the input sentence
hassigni 7→ a = hexpr i 7→ a = hexpr i + htermi 7→
a = htermi + htermi 7→ a = hfacti + htermi 7→
a = a + htermi 7→ a = a + htermi ∗ hfacti 7→
a = a + hfacti ∗ hfacti 7→ a = a + b ∗ hfacti 7→
a=a+b∗a
Simplest form gives leftmost derivation of a grammar
processing input from left to right.
Left recursion in grammar is a problem. Elimination of left
recursion needed.
Deterministic parsing: Look at input symbols to choose next
rule to apply.
recursive descent parsers, LL family parsers are top-down
parsers
Programming Languages:Syntax Description and Parsing
Parsing
Top-down Parsing
Recursive Descent Parser
typedef enum { i d e n t , number , l p a r e n , r p a r e n , t i m e s ,
s l a s h , p l u s , minus } Symbol ;
int a c c e p t ( Symbol s ) { if (sym == s ) { n e x t (); return 1; }
return 0;
}
void f a c t o r ( void ) {
if ( a c c e p t ( i d e n t )) ;
else if ( a c c e p t ( number )) ;
else if ( a c c e p t ( l p a r e n )) { e x p r e s s i o n (); e x p e c t ( r p a r e n );}
else { e r r o r ( " factor : syntax error at " , c u r r s y m ); n e x t (); }
}
void term ( void ) {
f a c t o r ();
while ( a c c e p t ( t i m e s ) || a c c e p t ( s l a s h ))
f a c t o r ();
}
void e x p r e s s i o n ( void ) {
term ();
while ( a c c e p t ( p l u s ) || a c c e p t ( minus ))
term ();
}
Programming Languages:Syntax Description and Parsing
Parsing
Top-down Parsing
Each non-terminal realized as a parsing function
Parsing functions calls the right handside functions in
sequence
Rule choices are based on the current input symbol. accept
checks a terminal and consumes if matches.
Cannot handle direct or indirect left recursion. A function has
to call itself before anything else.
Hand coded, not flexible.
Programming Languages:Syntax Description and Parsing
Parsing
Top-down Parsing
LL Parsers
First L is ‘left to right input processing’, second is ‘leftmost
derivation’
Checks next N input symbols to decide on which rule to
apply: LL(N) parsing.
For example LL(1) checks the next input symbol only.
LL(N) parsing table: A table for V × ΣN 7→ R
for expanding a nonterminal NT ∈ V , looking at this table
and the next N input symbols, LL(N) parser chooses the
grammar rule r ∈ R to apply in the next step.
Programming Languages:Syntax Description and Parsing
Parsing
Top-down Parsing
Grammar and lookup table for a LL(1) parser:
1 S →E
2 S → −E a b - (
3 E → N+E S 1 1 2 1
4 E → (E ) E 3 3 4
5 N→a N 5 6
6 N→b
What if we add E → N to grammar?
You need an LL(2) grammar. What if N is recursive?
Programming Languages:Syntax Description and Parsing
Parsing
Bottom-up Parsing
Bottom-up Parsing
Start from input sentence and merge parts of sentential form
matching RHS of a rule into LHS at each step. Try to reach
the starting non-terminal. reach the input sentence
a = a + b ∗ a 7→ a = hfacti + b ∗ a 7→ a = htermi + b ∗ a 7→
a = hexpr i + b ∗ a 7→ a = hexpr i + hfacti ∗ a 7→
a = hexpr i + htermi ∗ a 7→ a = hexpr i + htermi ∗ hfacti 7→
a = hexpr i + htermi 7→ a = hexpr i 7→ hassigni
Simplest form gives rightmost derivation of a grammar (in
reverse) processing input from left to right.
Shift-reduce parsers are bottom-up:
shift: take a symbol from input and push to stack.
reduce: match and pop a RHS from stack and reduce into
LHS.
Programming Languages:Syntax Description and Parsing
Parsing
Bottom-up Parsing
Shift-Reduce Parser in Prolog
% Grammar is E - > E - T | E + T | T T -> a | b
r u l e ( e ,[ e , - , t ]).
r u l e ( e ,[ e ,+ , t ]).
r u l e ( e ,[ t ]).
r u l e ( t ,[ a ]).
r u l e ( t ,[ b ]).
p a r s e ([] ,[ S]) : - S = e . % s t a r t i n g symbol alone in the stack
% reduce : find RHS of a rule on stack , reduce it to LHS
p a r s e ( I n p u t , S t a c k ) : - match (LHS, S t a c k , R e m a i n d e r ) ,
p a r s e ( I n p u t ,[LHS| R e m a i n d e r ]).
% shift : n o n t e r m i n a l s are removed from input added on stack
p a r s e ([H| I n p u t ] , S t a c k ) : - member(X,[ a ,b , - ,+]) ,
p a r s e ( I n p u t ,[H| S t a c k ]).
% check if RSH of a rule is a prefix of Stack ( r e v e r s e d ).
match (LHS, L i s t ,L) : - r u l e (LHS,RHS) , r e v e r s e (RHS,NRHS) ,
p r e f i x (NRHS, L i s t ,L ).
Programming Languages:Syntax Description and Parsing
Parsing
Bottom-up Parsing
Shift reduce parser tries all non-deterministic shift
combinations to get all parses.
Deterministic bottom up parsers: LALR, SLR(1).