----------------------- Page 1-----------------------
Programming Language Syntax
CSE 307 – Principles of Programming Languages
Stony Brook University
http://www.cs.stonybrook.edu/~cse307
----------------------- Page 2-----------------------
Programming Languages Syntax
Computer languages must be precise:
Both their form (syntax) and meaning (semantics) must be specified
without ambiguity, so that both programmers and computers can tell
what a program is supposed to do.
Example: the syntax of Arabic numerals:
A digit “is”: 0 |(or) 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
A non_zero_digit “is” 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
A natural_number (>0)“is”a non_zero_digit followed
by other digits (a number that doesn’t start with 0) = the
regular expression “non_zero_digit digit*”
Specifying the syntax for programming languages has 2 parts:
Regular Expressions (RE) and Context-Free Grammars
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 3-----------------------
Regular Expressions
A regular expression is one of the following:
a character
the empty string, denoted by ε
two regular expressions concatenated
E.g., letter letter
two regular expressions separated by |(i.e., or),
E.g., letter ( letter | digit )
a regular expression followed by the Kleene star
(concatenation of zero or more previous item)
E.g., letter ( letter | digit )*
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 4-----------------------
Regular Expressions
RE example: the syntax of numeric constants can be
defined with regular expressions:
A digit “is” 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
A number “is” integer | real
An integer “is” digit digit*
A real “is” integer exponent
| decimal ( exponent | ε )
A decimal “is” digit* (.digit |digit.) digit*
An exponent “is” ( e | E ) ( + | - | ε ) integer
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 5-----------------------
Regular Expressions
Regular expressions work well for defining tokens
They are unable to specify nested constructs
For example, a context free grammar in BNF
form to define arithmetical expressions is:
expr → id | number | - expr | ( expr ) | expr op expr
op → + | - | * | /
Same number of open and closed
parenthesis cannot be represented by RE
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 6-----------------------
Chomsky Hierarchy
Context Free Languages are strictly more powerful than
Regular Expressions, BUT, Regular Expressions are way faster
to recognize, so
Regular Expressions are used to create tokens, the leafs of the
syntax tree, while Context Free grammars build the syntax tree
Chomsky Hierarchy:
Type-3: Regular Languages (Regex) – implemented by Finite Automata
(called Lexer, Scanner, Tokenizer)
Type-2: Context-Free Languages - Pushdown Automata (called Parsers)
Type-1: Context-Sensitive Language
Type-0: Unrestricted Language -Turing Machine
Types 0 and 1 are not for practical use in defining programming languages
3
Type 2, for very restricted practical use (O(N ) in the worst case)
Type 3 are fast (linear time to recognize tokens), but not expressive enough
for most languages (c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 7-----------------------
Context-Free Grammars (CFG)
Backus–Naur Form (BNF) notation for CFG:
expr → id | number | - expr | ( expr ) | expr op expr
op → + | - | * | /
Each of the rules in a CFG is known as a production
The symbols on the left-hand sides of the productions are
nonterminals (or variables)
A CFG consists of:
a set of terminals/tokens T (that cannot appear on the
left-hand side of any production)
a set of non-terminals N
a non-terminal start symbol S, and
7 a set of productions
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 8-----------------------
Context-Free Grammars (CFG)
John Backus was the inventor of Fortran (won the
ACM Turing Award in 1977)
John Backus and Peter Naur used the BNF form for
Algol
Peter Naur also won the ACM Turing Award in 2005 for
Report on the Algorithmic Language ALGOL 60
BNF was named by Donald Knuth
8
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 9-----------------------
Context-Free Grammars (CFG)
The Kleene star *and meta-level parentheses of regular
expressions do not change the expressive power of the
notation
id_list → id ( , id )*
is shorthand for
id_list → id id_list_tail
id_list_tail → , id id_list_tail
id_list_tail → ε
or the left-recursive version
id_list → id
id_list → id_list , id
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 10-----------------------
Context-Free Grammars (CFG)
From RE to BNF notation:
Consider the RE: a*( b a* b )*
Start with a*:
As −> a As
| ε
Same with ( b a* b )*. It is:
S −> b As b S
| ε
Now you concatenate them into a single non-terminal:
G −> As S
10
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 11-----------------------
Context-Free Grammars (CFG)
Derivations and Parse Trees: A context-free grammar shows us how to
generate a syntactically valid string of terminals
1. Begin with the start symbol
2. Choose a production with the start symbol on the left-hand side ;
replace the start symbol with the right-hand side of that production
3. Now choose a nonterminal A in the resulting string, choose a
production Pwith A on its left-hand side, and replace A with the
right-hand side of P
Repeat this process until no non-terminals remain
The replacement strategy named right-most derivation chooses
at each step to replace the right-most nonterminal with the right-
hand side of some production
o There are many other possible derivations, including left-most
and options in between.
11
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 12-----------------------
Context-Free Grammars (CFG)
Example: we can use our grammar for expressions to generate the
string “slope * x + intercept”: Grammar:
expr ⇒ expr op expr
⇒ expr op id expr → id | number
⇒ expr + id | - expr | ( expr )
⇒ expr op expr + id | expr op expr
⇒ expr op id + id
⇒ expr * id + id op → + | - | * | /
⇒ id * id + id
⇒ id(slope)* id(x)+ id(intercept)
Notes: The ⇒ metasymbol is often pronounced “derives”
A series of replacement operations that shows how to derive a string of
terminals
from the start symbol is called a derivation
Each string of symbols along the way is called a sentential form
The final sentential form, consisting of only terminals, is called the
yield of the
12 derivation
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 13-----------------------
Derivations and Parse Trees
We can represent a derivation graphically as a parse tree
The root of the parse tree is the start symbol of the
grammar
The leaves are its yield
Each node with its
children represent a
production
E.g., The parse tree for the
expression grammar for
3 + 4 * 5is:
13
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 14-----------------------
Derivations and Parse Trees
The example grammar is ambiguous (it can generate
multiple parse trees for 3+4*5): one corresponds to
3+(4*5)and one corresponds to (3+4)*5
Grammar:
expr → id | number
| - expr | ( expr )
| expr op expr
op → + | - | * | /
14
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 15-----------------------
Context free grammars
A better version of our expression grammar should include
precedence and associativity:
expr → term | expr add_op term
term →factor | term mult_opfactor
factor → id | number | -factor | ( expr )
add_op → + | -
mult_op → * | /
15
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 16-----------------------
Context free grammars
Parse tree for expression grammar for 10 - 4 - 3
expr → term | expr add_op term
term → factor | term mult_opfactor
has left associativity factor → id | number | -factor |
( expr )
add_op → + | -
16 mult_op → * | /
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 17-----------------------
Scanning
The scanner and parser for a programming language are
responsible for discovering the syntactic structure of a
program (i.e., the syntax analysis)
The scanner/lexer is responsible for
tokenizing source
removing comments
(often) dealing with pragmas (i.e., significant
comments)
saving text of identifiers, numbers, strings
saving source locations (file, line, column) for error
17 messages
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 18-----------------------
Scanning
The Scanner turns a program into a string of tokens
It matches regular expressions (usually written in Perl style
regex) to a program and creates a list of tokens
There are two syntaxes for regular expressions: Perl-style Regex and
EBNF
Scanners tend to be built three ways:
Writing / Generating a finite automaton from REs
Scanner code (usually realized as nested if/case statements)
Table-driven DFA
Writing / Generating a finite automaton generally yields the
fastest, most compact code by doing lots of special-purpose things,
although good automatically-generated scanners come very close
18
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 19-----------------------
Scanning
Construction of an NFA equivalent to a given regular
expression: cases
19
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 20-----------------------
Scanning
Construction of an NFA equivalent to a given regular
expression: cases
20
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 21-----------------------
Scanning
Construction of an NFA equivalent to the regular
expression d* ( .d | d. ) d*
21
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 22-----------------------
Scanning
Construction of an NFA equivalent to the regular
expression d* ( .d | d. ) d*
22
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 23-----------------------
Scanning
From an NFA to a DFA:
Reason: With no way to “guess” the right transition to take
from any given state, any practical implementation of an
NFA would need to explore all possible transitions
concurrently or via backtracking
We can instead build a DFA from that NFA:
The state of the DFA after reading any input will be the set
of states that the NFA might have reached on the same input
Our example: Initially, before it consumes any input, the NFA
may be in State 1, or it may make epsilon transitions to
States 2, 4, 5, or 8
o We thus create an initial State A for our DFA to represent
23 this set: 1,2,4,5,8
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 24-----------------------
Scanning
On an input of d, our NFA may move from State 2 to State 3,
or from State 8 to State 9.
It has no other transitions on this input from any of the states
in A.
From State 3, however, the NFA may make epsilon transitions
to any of States 2, 4, 5, or 8.
We therefore create DFA State B: 2, 3, 4, 5, 8, 9
On a .,our NFA may move from State 5 to State 6
There are no other transitions on this input from any of the
states in A, and there are no epsilon transitions out of State 6.
We therefore create the singleton DFA State C: 6
We continue the process until we find all the states and
transitions in the DFA (it is a finite process – Why?)
24
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 25-----------------------
Scanning
The DFA equivalent to our previous NFA:
25
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 26-----------------------
Scanner code (usually realized as nested if/case statements)
Suppose we are building an ad-hoc (hand-written) scanner for a
Calculator:
assign → :=
plus → +
minus → -
times → *
div → /
lparen → (
rparen → )
id → letter ( letter | digit )*
number → digit digit *
| digit * ( . digit | digit . ) digit *
comment → /* ( non-* | * non-/ )* */
| // ( non-newline )* newline
26
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 27-----------------------
Scanning
We read the characters one at a time with look-ahead
skip any initial white space (spaces, tabs, and newlines)
if cur_char ∈ {‘(’, ‘)’, ‘+’, ‘-’, ‘*’}
return the corresponding single-character token
if cur_char = ‘:’
read the next character
if it is ‘=’ then return assign else announce an error
if cur_char = ‘/’
peek at the next character
if it is ‘*’ or ‘/’
read additional characters until “*/” or newline
is seen, respectively
jump back to top of code
else return div
27
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 28-----------------------
Scanning
if cur_char = .
read the next character
if it is a digit
read any additional digits
return number
else announce an error
if cur_char is a digit
read any additional digits and at most one decimal point
return number
if cur_char is a letter
read any additional letters and digits
check to see whether the resulting string is read or
write
if so then return the corresponding token
else return id
else announce an error
28
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 29-----------------------
Scanning
Pictorial
representation of
a scanner for
calculator
tokens, in the
form of a finite
automaton
29
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 30-----------------------
Scanning
We run the machine over and over to get one
token after another
Nearly universal rule:
always take the longest possible token from the input
thus foobaris foobarand never for fooor
foob
more to the point, 3.14159is a real constant and
never 3, ., and 14159
30
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 31-----------------------
Scanning
The rule about longest-possible tokens means you return only
when the next character can't be used to continue the current
token
the next character will generally need to be saved for the
next token
In some cases, you may need to peek at more than one
character of look-ahead in order to know whether to
proceed
In Pascal, for example, when you have a 3and you a see a dot
do you proceed (in hopes of getting 3.14)? or
do you stop (in fear of getting 3..5)? (declaration of
arrays in Pascal, e.g., “array [1..6] of
31 Integer”) (c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 32-----------------------
Scanning
Writing a pure DFA as a set of nested case
statements is a surprisingly useful
programming technique
use perl, awk, sed
Table-driven DFA is what lexand
scangenproduce
lex (flex) in the form of C code
scangenin the form of numeric tables
32 and a separate driver
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 33-----------------------
Perl-style Regexp
Learning by examples:
abcd - concatenation
a(b|c)d- grouping
a(b|c)*d- can apply a number of repeats to char or group
?= 0-1
* = 0-inf
+ = 1-inf
[bc]- character class
[a-zA-Z0-9_]- ranges
.- matches any character.
\a - alpha
\d - numeric
\w - word (alpha, num , _)
\s - whitespace
33
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 34-----------------------
Perl-style Regexp
Learning by examples:
How do we write a regexp that matches
floats?
digit*(.digit|digit.)digit*
\d*(\.\d|\d \.)\d*
34
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 35-----------------------
Parsing
The parser calls the scanner to get the tokens,
assembles the tokens together into a syntax
tree, and passes the tree (perhaps one
subroutine at a time) to the later phases of the
compiler (this process is called syntax-directed
translation).
Most use a context-free grammar (CFG)
35
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 36-----------------------
Parsing
It turns out that for any CFG we can create
a parser that runs in O(n )time (e.g.,
Earley’s algorithm and the Cocke-Younger-
Kasami (CYK) algorithm)
O(n )time is clearly unacceptable for a
parser in a compiler - too slow even for a
program of 100 tokens (~1,000,000
cycles)
36
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 37-----------------------
Parsing
Fortunately, there are large classes of grammars
for which we can build parsers that run in linear
time
The two most important classes are called
LL and LR
LL stands for Left-to-right, Leftmost derivation
Leftmost derivation - work on the left side of the parse tree
LR stands for Left-to-right, Rightmost derivation
Rightmost derivation - work on the right side of the tree
LL parsers are also called 'top-down', or 'predictive' parsers
LR parsers are also called 'bottom-up', or 'shift-reduce' parsers
37
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 38-----------------------
Top-down parsing (LL)
Consider a grammar for a comma
separated list of identifiers,
terminated by a semicolon:
id_list → id id_list_tail
id_list_tail → , id id_list_tail
id_list_tail → ;
• The top-down construction of a
parse tree for the string: “A, B,
C;” starts from the root and
applies rules and tried to identify
38 nodes.
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 39-----------------------
Bottom-up parsing (LR)
id_list → id id_list_tail
id_list_tail → , id id_list_tail
id_list_tail → ;
- The bottom-up construction of a
parse tree for the same string:
“A, B, C;”
- The parser finds the left-most
leaf of the tree is an id. The next
leaf is a comma. The parser
continues in this fashion,
shifting new leaves from the
scanner into a forest of partially
completed parse tree fragments.
39
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 40-----------------------
Bottom-up parsing (LR)
- The bottom-up construction realizes
that some of those fragments
constitute a complete right-hand
side.
- In this grammar, that occur when
the parser has seen the semicolon—
the right-hand side of id_list_tail.
With this right-hand side in hand,
the parser reduces the semicolon to
an id_list_tail.
- It then reduces ", id id_list_tail"
into another id_list_tail.
- After doing this one more time it is
able to reduce "id id_list_tail" into
the root of the parse tree, id_list.
40
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 41-----------------------
Parsing
The number in LL(1), LL(2), …, indicates how many
tokens of look-ahead are required in order to parse
Almost all real compilers use one token of look-
ahead
LL grammars requirements:
no left recursion
no common prefixes
Every LL(1) grammar is also LR(1), though right
recursion in production tends to require very deep
stacks and complicates semantic analysis
41
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 42-----------------------
An LL(1) grammar
program → stmt_list $$(end of file)
stmt_list → stmt stmt_list
| ε
stmt → id := expr
| read id
| write expr
expr → term term_tail
term_tail → add_op term term_tail
| ε
term → factor fact_tailt
fact_tail → mult_op factor fact_tail
| ε
factor → ( expr )
| id
| number
add_op → +
| -
mult_op → *
42
| /(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 43-----------------------
LL Parsing
This grammar captures associativity and precedence,
but most people don't find it as pretty
for one thing, the operands of a given operator aren't
in a Right Hand Side (RHS) together!
however, the simplicity of the parsing algorithm
makes up for this weakness
The first parsers were LL
How do we parse a string with this grammar?
by building the parse tree incrementally
43
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 44-----------------------
LL Parsing
Example (the average program):
read A
read B
sum := A + B
write sum
write sum / 2 $$
We keep a stack of non-terminals with the start
symbol inserted
We start at the top and predict needed productions on
the basis of the current "left-most" non-terminal in
44 the tree and the current input token
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 45-----------------------
LL Parsing
Table-driven LL parsing: you have a big loop in
which you repeatedly look up an action in a
two-dimensional table based on current leftmost
non-terminal and current input token
The actions are:
(1) match a terminal
(2) predict a production
OR
(3) announce a syntax error
45
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 46-----------------------
LL Parsing
First, unfold the
production rules to
collect for each
production the
possible tokens that
could start it
46
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 47-----------------------
LL Parsing
Construct the prediction table :
for each possible input token and
the left-most nonterminal, what is
the possible production rule that
will be used?
The non-terminal will be "used",
while the RHS of the production is
added to the stack.
47
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 48-----------------------
LL Parsing
LL(1) parse table for parsing for
calculator language
read A
read B
sum := A + B
write sum
write sum / 2 $$
48
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 49-----------------------
49
(c) Paul Fodor (CS Stony Brook) and
Elsevier
----------------------- Page 50-----------------------
50
(c) Paul Fodor (CS Stony Brook) and
Elsevier
----------------------- Page 51-----------------------
Parse tree for the average program
51
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 52-----------------------
LL Parsing
Problems trying to make a grammar LL(1)
left recursion
example:
id_list → id_list , id
id_list → id
we can get rid of all left recursion
mechanically in any grammar
id_list → id id_list_tail
id_list_tail → , id id_list_tail
id_list_tail → ε
52
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 53-----------------------
LL Parsing
Problems trying to make a grammar LL(1)
common prefixes
example:
stmt → id := expr
| id ( arg_list )
we can eliminate left-factor mechanically =
"left-factoring ”
stmt → id id_stmt_tail
id_stmt_tail → := expr
| ( arg_list)
53
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 54-----------------------
LL Parsing
Eliminating left recursion and common prefixes still does NOT
make a grammar LL
there are infinitely many non-LL LANGUAGES, and the
mechanical transformations work on them just fine
Problems trying to make a grammar LL(1)
the"dangling else" problem prevents grammars from being
LL(1) (or in fact LL(k) for any k)
the following natural (Pascal) grammar fragment is ambiguous:
stmt → if cond then_clause else_clause
| other_stuff
then_clause → then stmt
else_clause → else stmt | ε
Example String: “if C1 then if C2 then S1 else S2”
Ambiguity : the else can be paired with either if then!!!
54
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 55-----------------------
LL Parsing
Desired effect: pair the else with the nearest then .
The less natural grammar fragment:
stmt → balanced_stmt | unbalanced_stmt
balanced_stmt → if cond then balanced_stmt
else balanced_stmt
| other_stuff
unbalanced_stmt → if cond then stmt
| if cond then balanced_stmt
else unbalanced_stmt
A balanced_stmt is one with the same number of
thens and elses.
An unbalanced_stmt has more thens.
55
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 56-----------------------
LL Parsing
The usual approach, whether top-down OR
bottom-up, is to use the ambiguous grammar
together with a disambiguating rule that says :
else goes with the closest then or
more generally, the first of two possible
productions is the one to predict (or reduce)
stmt → if cond then_clause else_clause
| other_stuff
then_clause → then stmt
else_clause → else stmt | ε
56
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 57-----------------------
LL Parsing
Better yet, languages (since Pascal) generally employ
explicit end-markers, which eliminate this
problem.
In Modula-2, for example, one says:
if A = B then
if C = D then E := F end
else
G := H
end
Ada says 'end if'; other languages say 'fi'
57
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 58-----------------------
LL Parsing
One problem with end markers is that they tend to bunch up.
In Pascal you say
if A = B then …
else if A = C then …
else if A = D then …
else if A = E then …
else ...;
With end markers this becomes
if A = B then …
else if A = C then …
else if A = D then …
else if A = E then …
else ...;
end; end; end; end; end; end; …
58
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 59-----------------------
LR Parsing
LR parsers are almost always table-driven:
like a table-driven LL parser, an LR parser uses a
big loop in which it repeatedly inspects a two-
dimensional table to find out what action to take
unlike the LL parser, however, the LR driver has
non-trivial state (like a DFA), and the table is
indexed by current input token and current
state
also the stack contains a record of what has been
59 seen SO FAR (NOT what is expected)
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 60-----------------------
LR Parsing
LR keeps the roots of its partially
completed subtrees on a stack
When it accepts a new token from the scanner, it
shifts the token into the stack
When it recognizes that the top few symbols on
the stack constitute a right-hand side, it reduces
those symbols to their left-hand side by popping
them off the stack and pushing the left-hand side
in their place
60
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 61-----------------------
LR Parsing id_list → id id_list_tail
id_list_tail → , id id_list_tail
id_list_tail → ;
Rightmost (canonical) derivation for the
identifiers grammar:
61
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 62-----------------------
LR Parsing
LR(1) grammar for the calculator language:
62
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 63-----------------------
LR Parsing
Example (the average program):
read A
read B
sum := A + B
write sum
write sum / 2 $$
63
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 64-----------------------
LR Parsing
When we begin execution, the parse stack is
empty and we are at the beginning of the
production for program:
program → . stmt_list $$
When augmented with a ., a production is called an LR item
This original item (program → . stmt_list $$)
is called the basis of the list.
64
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 65-----------------------
LR Parsing
Since the . in this item is immediately in front of
a nonterminal—namely stmt_list —we
may be about to see the yield of that
nonterminal coming up on the input.
program → . stmt_list $$
stmt_list → . stmt_list stmt
stmt_list → . stmt
65
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 66-----------------------
LR Parsing
Since stmtis a nonterminal, we may also be at
the beginning of any production whose left-hand
side is stmt:
program → . stmt_list $$
stmt_list → . stmt_list stmt
stmt_list → . stmt
stmt → . id := expr
stmt → . read id
stmt → . write expr
The additional items to the basis are its closure.
66
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 67-----------------------
LR Parsing
Our upcoming token is a read
Once we shift it onto the stack, we know we are
in the following state:
stmt → read . id
This state has a single basis item and an empty closure—the .
precedes a terminal.
After shifting the A, we have:
stmt → read id .
67
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 68-----------------------
LR Parsing
We now know that read idis the handle,
and we must reduce.
The reduction pops two symbols off the parse
stack and pushes a stmtin their place
Since one of the items in State 0 was
stmt_list → . stmt
we now have
stmt_list → stmt .
Again we must reduce: remove the stmtfrom
68 the stack and push a stmt_listin its place.
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 69-----------------------
LR Parsing
Our new state:
program → stmt_list . $$
stmt_list → stmt_list . stmt
stmt → . id := expr
stmt → . read id
stmt → . write expr
69
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 70-----------------------
70
(c) Paul Fodor (CS Stony Brook) and
Elsevier
----------------------- Page 71-----------------------
71
(c) Paul Fodor (CS Stony Brook) and
Elsevier
----------------------- Page 72-----------------------
72
(c) Paul Fodor (CS Stony Brook) and
Elsevier
----------------------- Page 73-----------------------
73
(c) Paul Fodor (CS Stony Brook) and
Elsevier
----------------------- Page 74-----------------------
Table entries indicate whether to shift (s), reduce (r), or shift and
then reduce (b). The accompanying number is the new state
when shifting, or the production that has been recognized when
(shifting and) reducing
74
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 75-----------------------
Driver for a table-driven LR(1) parser
75
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 76-----------------------
76
(c) Paul Fodor (CS Stony Brook) and
Elsevier
----------------------- Page 77-----------------------
Parsing summary
A scanner is a DFA
it can be specified with a state diagram
An LL or LR parser is a PDA (push down automata)
a PDA can be specified with a state diagram and a
stack
the state diagram looks just like a DFA state diagram,
except the arcs are labeled with <input symbol,
top-of-stack symbol>pairs, and in addition to
moving to a new state the PDA has the option of pushing
or popping a finite number of symbols onto/off the stack
Early's algorithm does NOT use PDAs, but dynamic
77 programming (c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 78-----------------------
Actions
We can run actions when a rule triggers:
Often used to construct an AST for a
compiler.
For simple languages, can interpret code
directly
We can use actions to fix the Top-Down
Parsing problems
78
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 79-----------------------
Programming
A compiler-compiler (or parser generator , compiler generator) is a
programming tool that creates a parser, interpreter, or
compiler from some form of formal description of a language
and machine
the input is a grammar (usually in BNF) of a programming
language
the generated output is the source code of a parser
Examples of parser generators:
classical parsing tools: lex, Yacc, bison, flex, ANTLR
PLY: python implementation of lex and yacc
Python TPG parser
ANTLR for python
79
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 80-----------------------
Classic Parsing Tools
lex - original UNIX Lexical analysis (tokenizing) generator
create a C function that will parse input according to a set of
regular
expressions
yacc -Yet Another Compiler Compiler (parsing)
generate a C program for a parser from BNF rules
bison and flex ("fast lex") - more powerful, free versions of
yacc and lex, from GNU Software Fnd'n.
Lexical Rules Grammar Rules
Lex or Flex Yacc or Bison
Input yylex() yyparse() Parsed
Input
80
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 81-----------------------
Classic Parsing Tools
Lex and Yacc generate C code for your analyzer & parser
Lexical Rules Grammar Rules
Lex or Flex Yacc or Bison
Parsed
Input yylex() token yyparse()
Input
char
C code stream C code
stream
Lexical Analyzer Parser
(Tokenizer)
81
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 82-----------------------
Lex and Yacc the big picture
82
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 83-----------------------
Lex Example
/* lexer.l */
%{
#include “header.h”
int lineno = 1;
%}
%%
[ \t]* ; /* Ignore whitespace */
\n { lineno++; }
[0-9]+ { yylval.val = atoi (yytext);
return NUMBER; }
[a-zA-Z_][a-zA-Z0-9_]* { yylval.name = strdup(yytext);
return ID; }
\+ { return PLUS; }
- { return MINUS; }
\* { return TIMES; }
\/ { return DIVIDE; }
= { return EQUALS; }
%%
83
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 84-----------------------
Yacc Example
/* parser.y */
%{
#include “header.h”
%}
%union {
char *name;
int val;
%token PLUS MINUS TIMES DIVIDE EQUALS
%token<name> ID;
%token<val> NUMBER;
%%
start : ID EQUALS expr;
expr : expr PLUS term
| expr MINUS term
| term
...
84
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 85-----------------------
Bison Overview
The programmer puts BNF rules and
myparser.y
token rules for the parser he wants in
a
BNF rules and actions for bison source file myparser.y
your grammar.
run bison to create a C program
(*.tab.c)
containing a parser function.
> bison myparser.y The programmer must also supply a
tokenizer named yylex( )
myparser.tab.c yylex.c
parser source code tokenizer function in C
> gcc -o myprog myparser.tab.c yylex.c
myprog
executable program
85
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 86-----------------------
PLY
PLY: Python Lex-Yacc = an implementation of
lex and yacc parsing tools for Python by David
Beazley: http://www.dabeaz.com/ply/
A bit of history:
Yacc : ~1973. Stephen Johnson (AT&T)
Lex : ~1974. Eric Schmidt and Mike Lesk (AT&T)
PLY: 2001
86
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 87-----------------------
PLY
PLY is not a code generator
PLY consists of two Python modules
ply.lex = A module for writing lexers
Tokens specified using regular expressions
Provides functions for reading input text
ply.yacc = A module for writing grammars
You simply import the modules to use them
The grammar must be in a file
87
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 88-----------------------
PLY
ply.lex example:
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
’DIVIDE’, EQUALS’ ]
t_ignore = ‘ \t’
t_PLUS = r’\+’
t_MINUS = r’-’
t_TIMES = r’\*’
t_DIVIDE = r’/’
t_EQUALS = r’=’
t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’
def t_NUMBER(t):
r’\d+’
t.value = int (t.value)
return t
lex.lex () # Build the lexer
88
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 89-----------------------
PLY
Two functions: input() and token()
lex.lex () # Build the lexer
...
lex.input ("x = 3 * 4 + 5 * 6")
while True:
tok = lex.token ()
if not tok: break
# Use token
89
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 90-----------------------
PLY
import ply.yacc as yacc
import mylexer # Import lexer information
tokens = mylexer.tokens # Need token list
def p_assign (p):
'''assign : NAME EQUALS expr'''
def p_expr (p):
'''expr : expr PLUS term
| expr MINUS term
| term'''
def p_term (p):
'''term : term TIMES factor
| term DIVIDE factor
| factor'''
def p_factor (p):
'''factor : NUMBER'''
yacc.yacc () # Build the parser
data = "x = 3*4+5*6"
90 yacc.parse (data) # Parse some text
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 91-----------------------
PLY
Parameter p contains grammar symbol values
def p_factor (p):
‘factor : NUMBER’
p [0] = p[1]
PLY does Bottom-up parsing
Rule functions process values on right hand side of grammar rule
Result is then stored in left hand side
Results propagate up through the grammar
91
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 92-----------------------
PLY Calculator Example
92
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 93-----------------------
Build a parse tree using tuples
93
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 94-----------------------
94
(c) Paul Fodor (CS Stony Brook) and
Elsevier
----------------------- Page 95-----------------------
PLY Precedence Specifiers
Precedence Specifiers (most precedence at bottom):
precedence = (
('left','PLUS','MINUS'),
('left','TIMES','DIVIDE'),
('nonassoc ','UMINUS'),
def p_expr_uminus (p):
'expr : MINUS expr %prec UMINUS'
p[0] = -p[1]
...
95
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 96-----------------------
PLY Best Documentation
Google Mailing list/group:
http://groups.google.com/group/ply-hack
96
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 97-----------------------
TPG
TGP is a lexical and syntactic parser generator
for Python
YACC is too complex to use in simple cases
(calculators, configuration files, small
programming languages, …)
You can also add Python code directly into
grammar rules and build abstract syntax trees
while parsing
97
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 98-----------------------
Python TPG Lexer
Toy Parser Generator (TPG): http://cdsoft.fr/tpg
Syntax:
token <name> <regex> <function> ;
separator <name> <regex>;
Example:
token integer '\d+' int;
token float '\d+\.\d*|\.\d+' float;
token rbrace '{';
separator space '\s+';
98
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 99-----------------------
Python TPG Lexer
Embed TPG in Python:
import tpg
class Calc:
r"""
separator spaces: '\s+' ;
token number: '\d+' ;
token add: '[+-]' ;
token mul : '[*/]' ;
"""
Try it in Python: download TGP from
99 http://cdsoft.fr/tpg
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 100-----------------------
TPG example
Defining the grammar:
Non-terminal productions:
START -> Expr ;
Expr -> Term ( add Term )* ;
Term -> Fact ( mul Fact )* ;
Fact -> number | '\(' Expr '\)' ;
100
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 101-----------------------
TPG example
import tpg
class Calc:
r"""
separator spaces: '\s+' ;
token number: '\d+' ;
token add: '[+-]' ;
token mul : '[*/]' ;
START -> Expr ;
Expr -> Term ( add Term )* ;
Term -> Fact ( mul Fact )* ;
Fact -> number | '\(' Expr '\)' ;
101 """
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 102-----------------------
TPG example
Reading the input and returning values:
separator spaces: '\s+' ;
token number : '\d+' int ;
token add: '[+-]' make_op ;
token mul : '[*/]' make_op ;
Transform tokens into defined operations:
def make_op (s):
return {
'+': lambda x,y: x+y,
'-': lambda x,y: x-y,
'*': lambda x,y: x*y,
'/': lambda x,y: x/y,
}[s]
102
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 103-----------------------
TPG example
After a terminal symbol is recognized we will store it
in a Python variable: for example to save a number in
a variable n: number/n.
Include Python code example:
Expr/t -> Term/t ( add/op Term/f $t=op(t,f)$ )* ;
Term/f -> Fact/f ( mul/op Fact/a $f=op(f,a)$ )* ;
Fact/a -> number/a | '\(' Expr/a '\)' ;
103
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 104-----------------------
import math # Simple calculator calc.py
import operator
import string
import tpg
def make_op (s):
return {
'+': lambda x,y : x+y,
'-': lambda x,y : x-y,
'*': lambda x,y : x*y,
'/': lambda x,y : x/y,
}[s]
class Calc (tpg.Parser):
r"""
separator spaces: '\s+' ;
token number: '\d+' int ;
token add: '[+-]' make_op ;
token mul : '[*/]' make_op ;
104 START/e -> Term/e ;
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 105-----------------------
Term/t -> Fact/t ( add/op Fact/f $ t = op(t,f) $ )* ;
Fact/f -> Atom/f ( mul/op Atom/a $ f = op(f,a) $ )* ;
Atom/a -> number/a | '\(' Term/a '\)' ;
"""
calc = Calc ()
if tpg.__python__ == 3:
operator.div = operator.truediv
raw_input = input
expr = raw_input ('Enter an expression: ')
print(expr, '=', calc (expr))
105
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 106-----------------------
#!/usr/bin/env python
# Larger example: scientific_calc.py
import math
import operator
import string
import tpg
if tpg.__python__ == 3:
operator.div = operator.truediv
raw_input = input
def make_op(op):
return {
'+' : operator.add,
'-' : operator.sub,
'*' : operator.mul,
'/' : operator.div,
'%' : operator.mod,
'^' : lambda x,y:x**y,
'**' : lambda x,y:x**y,
'cos' : math.cos,
'sin' : math.sin,
'tan' : math.tan,
106 'acos': math.acos,
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 107-----------------------
'asin': math.asin,
'atan': math.atan,
'sqr' : lambda x:x*x,
'sqrt': math.sqrt,
'abs' : abs,
'norm': lambda x,y:math.sqrt(x*x+y*y),
}[op]
class Calc(tpg.Parser, dict):
r"""
separator space '\s+' ;
token pow_op '\^|\*\*' $ make_op
token add_op '[+-]' $ make_op
token mul_op '[*/%]' $ make_op
token funct1 '(cos|sin|tan|acos|asin|atan|sqr|sqrt|abs)\b' $ make_op
token funct2 '(norm)\b' $ make_op
token real '(\d+\.\d*|\d*\.\d+)([eE][-+]?\d+)?|\d+[eE][-+]?\d+'
$ float
token integer '\d+' $ int
token VarId '[a-zA-Z_]\w*'
107
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 108-----------------------
START/e ->
'vars' $ e=self.mem()
| VarId/v '=' Expr/e $ self[v]=e
| Expr/e
;
Var/$self.get(v,0)$ -> VarId/v ;
Expr/e -> Term/e ( add_op/op Term/t $ e=op(e,t)
)*
Term/t -> Fact/t ( mul_op/op Fact/f $ t=op(t,f)
)*
Fact/f ->
add_op/op Fact/f $ f=op(0,f)
| Pow/f
Pow/f -> Atom/f ( pow_op/op Fact/e $ f=op(f,e)
)?
108
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 109-----------------------
Atom/a ->
real/a
| integer/a
| Function/a
| Var/a
| '\(' Expr/a '\)'
Function/y ->
funct1/f '\(' Expr/x '\)' $ y = f(x)
| funct2/f '\(' Expr/x1 ',' Expr/x2 '\)' $ y = f(x1,x2)
"""
def mem(self):
vars = sorted(self.items())
memory = [ "%s = %s"%(var, val) for (var, val) in vars ]
return "\n\t" + "\n\t".join(memory)
109
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 110-----------------------
print("Calc (TPG example)")
calc = Calc()
while 1:
l = raw_input("\n:")
if l:
try:
print(calc(l))
except Exception:
print(tpg.exc())
else:
break
110
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 111-----------------------
AntLR
ANother Tool for Language Recognition is an LL(k)
parser and translator generator tool
which can create
lexers
parsers
abstract syntax trees (AST’s)
in which you describe the language grammatically
and in return receive a program that can recognize and
translate that language
111
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 112-----------------------
Tasks Divided
Lexical Analysis (scanning)
Semantic Analysis (parsing)
Tree Generation
Abstract Syntax Tree (AST) is a structure
which keeps information in an easily
traversable form (such as operator at a
node, operands at children of the node)
ignores form-dependent superficial details
Code Generation
112
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 113-----------------------
The Java Code
The code to invoke the parser:
import java.io.*;
class Main {
public static void main(String[] args) {
try {
// use DataInputStream to grab bytes
MyLexer lexer = new MyLexer (
new DataInputStream (System.in));
MyParser parser = new MyParser (lexer);
int x = parser.expr ();
System.out.println (x);
} catch(Exception e) {
System.err.println ("exception: "+e);
113
(c) Paul Fodor (CS Stony Brook) and Elsevier
----------------------- Page 114-----------------------
Abstract Syntax Trees
Abstract Syntax Tree: Like a parse tree, without
unnecessary information
Two-dimensional trees that can encode the structure of
the input as well as the input symbols
An AST for (3+4) might be represented as
No parentheses are included in the tree!
114
(c) Paul Fodor (CS Stony Brook) and Elsevier