CS 480 Translators (Compilers)
CS 480 Translators (Compilers)
Translators (Compilers)
2
HW1 overall grade
3
Switching to ast from compiler
• the compiler packages is deprecated; replaced by ast
• similar structure, but easier to use, and faster
>>> import compiler
>>> compiler.parse("if 3==4: print 5")
Module(None, Stmt([If([(Compare(Const(3), [('==', Const(4))]),
Stmt([Printnl([Const(5)], None)]))], None)]))
http://greentreesnakes.readthedocs.org/en/latest/
4
Switching to ast from compiler
• the compiler packages is deprecated; replaced by ast
• similar structure, but easier to use, and faster
https://docs.python.org/2/library/ast.html
nonterminals only
nonterminals and terminals
(abstract syntax tree) 5
program : module
HW2 Grammar (P_2 subset)
module : stmt+
stmt : (simple_stmt | if_stmt | for_stmt) NEWLINE
bool_expr : bool_name
/* "/Users/lhuang/install/svector/tutorial/fib_cy.pyx":3
* def fib(int n):
* l = []
* cdef int a=0, b=1 # <<<<<<<<<<<<<<
* for i in xrange(n):
* l.append(b)
*/
__pyx_v_a = 0;
__pyx_v_b = 1;
9
Compiled Cython Program (.cpp)
/* "/Users/lhuang/install/svector/tutorial/fib_cy.pyx":4
* l = []
* cdef int a=0, b=1 very clever: for loop detected!
* for i in range(n): # <<<<<<<<<<<<<<
* l.append(b) but should always use xrange in
* a, b = b, a+b your .pyx or .py!
*/
__pyx_t_2 = __pyx_v_n;
for (__pyx_t_3 = 0; __pyx_t_3 < __pyx_t_2; __pyx_t_3+=1) {
__pyx_v_i = __pyx_t_3;
/* "/Users/lhuang/install/svector/tutorial/fib_cy.pyx":5
* cdef int a=0, b=1
* for i in xrange(n):
* l.append(b) # <<<<<<<<<<<<<<
* a, b = b, a+b
* return l
*/
if (unlikely(__pyx_v_l == Py_None)) {
PyErr_SetString(PyExc_AttributeError, "'NoneType' object has no attribute 'append'");
{__pyx_filename = __pyx_f[0]; __pyx_lineno = 5; __pyx_clineno = __LINE__; goto
__pyx_L1_error;} }
__pyx_t_1 = PyInt_FromLong(__pyx_v_b); if (unlikely(!__pyx_t_1)) {__pyx_filename =
__pyx_f[0]; __pyx_lineno = 5; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
__Pyx_GOTREF(__pyx_t_1);
__pyx_t_4 = PyList_Append(__pyx_v_l, __pyx_t_1); if (unlikely(__pyx_t_4 == -1))
{__pyx_filename = __pyx_f[0]; __pyx_lineno = 5; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
__Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
10
Compiled Cython Program (.cpp)
/* "/Users/lhuang/install/svector/tutorial/fib_cy.pyx":6
* for i in xrange(n):
* l.append(b)
* a, b = b, a+b # <<<<<<<<<<<<<<
* return l
*/
__pyx_t_4 = __pyx_v_b;
__pyx_t_5 = (__pyx_v_a + __pyx_v_b); correctly handles
__pyx_v_a = __pyx_t_4; simultaneous assignment
__pyx_v_b = __pyx_t_5;
}
__pyx_v_n = __Pyx_PyInt_AsInt(__pyx_arg_n);
__pyx_t_1 = PyList_New(0);
__pyx_v_a = 0;
__pyx_v_b = 1;
__pyx_t_1 = PyInt_FromLong(__pyx_v_b);
__pyx_t_4 = PyList_Append(__pyx_v_l, __pyx_t_1);
__Pyx_DECREF(__pyx_t_1);
__pyx_t_4 = __pyx_v_b;
__pyx_t_5 = (__pyx_v_a + __pyx_v_b);
__pyx_v_a = __pyx_t_4;
__pyx_v_b = __pyx_t_5;
}
...
} 11
By Comparison... using python int
import sys
n, m = map(int, sys.argv[1:3])
for _ in xrange(m):
a, b = 0, 1
for _ in xrange(n):
a, b = b, a+b
print "fib[%d] = %d" % (n, b)
12
Lexer within Compiler Pipeline
regexps
INT = \d+
CLASSNAME =
[A-Z][a-zA-Z\d]*
13
Regular Expression Examples
• integers and floats
(-?)\d+ (-?) ((\d+\.\d*) | (\d*\.\d+))
14
How RegExps are implemented
• regexp => NFA => DFA
15
Two Examples: (a|b) aba
* (ab|ba) |bb
*
CS 321 - TOC 16
Writing a Lexer is Boring...
• we often use a lexer-generator
only allows BNF
INT = \d+
not EBNF!
CLASSNAME = module : stmtlist
[A-Z][a-zA-Z\d]* stmtlist : stmt | stmt stmtlist
... ...
~1974 ~1973
lex
lexer.c
Lex/Yacc Big Picture
lexer.l parser.y
token grammar
specification specification
lex
lexer.c
Lex/Yacc Big Picture
lexer.l parser.y
token
/* parser.y */ grammar
specification
%{ specification
#include “header.h”
%}
lex
%union {
char *name;
lexer.c.cint val;
}
%token PLUS MINUS TIMES DIVIDE EQUALS
%token<name> ID;
%token<val> NUMBER;
%%
start : ID EQUALS expr;
expr : expr PLUS term
| expr MINUS term
| term
;
...
Lex/Yacc Big Picture
lexer.l parser.y
token grammar
specification specification
lex yacc
lexer.c parser.c
Lex/Yacc Big Picture
lexer.l parser.y
token grammar
specification specification
lex yacc
lexer.c parser.c
lex yacc
lexer.c parser.c
mycompiler
What is PLY?
def t_NUMBER(t):
r’\d+’
t.value = int(t.value)
return t
def t_NUMBER(t):
r’\d+’
t.value = int(t.value)
return t
def t_NUMBER(t):
r’\d+’
t.value = int(t.value)
return t
def t_NUMBER(t):
r’\d+’
t.value = int(t.value)
return t
def t_NUMBER(t):
r’\d+’
t.value = int(t.value)
return t
def t_NUMBER(t):
r’\d+’
t.value = int(t.value)
return t
def t_NUMBER(t):
r’\d+’
t.value = int(t.value)
return t
Builds the lexer
lex.lex() by
# creating
Build thea master
lexer
regular expression
ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
’DIVIDE’, EQUALS’ ]
t_ignore = ‘ \t’
t_PLUS = r’\+’
t_MINUS = r’-’ Introspection used
t_TIMES = r’\*’ to examine contents
t_DIVIDE = r’/’ of calling module.
t_EQUALS = r’=’
t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’
def t_NUMBER(t):
r’\d+’
t.value = int(t.value)
return t
def t_NUMBER(t):
r’\d+’ __dict__ = {
t.value = int(t.value) 'tokens' : [ 'NAME' ...],
return t 't_ignore' : ' \t',
't_PLUS' : '\\+',
lex.lex() # Build the ...
lexer
't_NUMBER' : <function ...
}
ply.lex use
• Two functions: input() and token()
...
lex.lex() # Build the lexer
...
lex.input("x = 3 * 4 + 5 * 6")
while True:
tok = lex.token()
if not tok: break
# Use token
...
ply.lex use
• Two functions: input() and token()
...
lex.lex() # Build the lexer
...
input() feeds a string
lex.input("x = 3 * 4 + 5 * 6")
while True: into the lexer
tok = lex.token()
if not tok: break
# Use token
...
ply.lex use
• Two functions: input() and token()
...
lex.lex() # Build the lexer
...
lex.input("x = 3 * 4 + 5 * 6")
while True:
tok = lex.token() token() returns the
if not tok: break next token or None
# Use token
...
ply.lex use
• Two functions: input() and token()
...
lex.lex() # Build the lexer
...
lex.input("x = 3 * 4 + 5 * 6")
while True:
tok = lex.token()
if not tok: break
def t_NAME(t):
r'[a-zA-Z_][a-zA-Z_0-9]*'
t.type = keywords.get(t.value, 'NAME') # Check for reserved words
return t
47
Python’s INDENT/DEDENT
• this is beyond context-free, and thus way beyond regular!
def t_space(t):
r'[ ]+'
return None # ignore white space 48
Python’s INDENT/DEDENT
• this is beyond context-free, and thus way beyond regular!
def p_assign(p):
'''assign : NAME EQUALS expr'''
def p_expr(p):
'''expr : expr PLUS term
| expr MINUS term
| term'''
def p_term(p):
'''term : term TIMES factor
| term DIVIDE factor
| factor'''
def p_factor(p):
'''factor : NUMBER'''
def p_assign(p):
'''assign : NAME EQUALS expr'''
def p_expr(p):
'''expr : expr PLUS term
| expr MINUS term
| term'''
def p_term(p):
'''term : term TIMES factor
| term DIVIDE factor
| factor'''
def p_factor(p):
'''factor : NUMBER'''
def p_assign(p):
'''assign : NAME EQUALS expr'''
def p_assign(p):
'''assign : NAME EQUALS expr'''
def p_expr(p):
'''expr : expr PLUS term
| expr MINUS term
| term''' docstrings contain
def p_term(p): grammar rules
'''term : term TIMES factor from BNF
| term DIVIDE factor
| factor'''
def p_factor(p):
'''factor : NUMBER'''
def p_assign(p):
'''assign : NAME EQUALS expr'''
def p_expr(p):
'''expr : expr PLUS term
| expr MINUS term
| term'''
def p_term(p):
'''term : term TIMES factor
| term DIVIDE factor
| factor'''
def p_factor(p):
'''factor : NUMBER'''
Builds the parser
yacc.yacc() # Build the parser
using introspection
ply.yacc parsing
• yacc.parse() function
yacc.yacc() # Build the parser
...
data = "x = 3*4+5*6"
yacc.parse(data) # Parse some text
NAME = factor
p[0] p[1]
Using an LR Parser
def p_expr_plus(p):
‘’’expr : expr PLUS term’’’
p[0] = p[1] + p[3]
def p_term_mul(p):
‘’’term : term TIMES factor’’’
p[0] = p[1] * p[3]
def p_term_factor(p):
'''term : factor'''
p[0] = p[1]
def p_factor(p):
‘’’factor : NUMBER’’’
p[0] = p[1]
Example: Parse Tree
def p_assign(p):
‘’’assign : NAME EQUALS expr’’’
p[0] = (‘ASSIGN’,p[1],p[3])
def p_expr_plus(p):
‘’’expr : expr PLUS term’’’
p[0] = (‘+’,p[1],p[3])
def p_term_mul(p):
‘’’term : term TIMES factor’’’
p[0] = (‘*’,p[1],p[3])
def p_term_factor(p):
'''term : factor'''
p[0] = p[1]
def p_factor(p):
‘’’factor : NUMBER’’’
p[0] = (‘NUM’,p[1])
Example: Parse Tree
>>> t = yacc.parse("x = 3*4 + 5*6")
>>> t
('ASSIGN','x',('+',
('*',('NUM',3),('NUM',4)),
('*',('NUM',5),('NUM',6))
)
)
>>>
ASSIGN
'x' '+'
'*' '*'
3 4 5 6
Why use PLY?
• PLY
% python mycompiler.py
yacc: Generating LALR parsing table...
4 shift/reduce conflicts
2 reduce/reduce conflicts
state 1
*
(5) expression : 5
-> expression . * expression -* shift and go
shift and go
to state 6
to state 8
+ (6) expression
-
: 3
: 4
-> expression . / expression / shift and go to state 9
/ : 6
= : 1 state 11
! shift/reduce conflict for + resolved as shift.
NAME
NUMBER
:
:
1
7 (4) expression -> expression - expression .
! shift/reduce conflict for - resolved as shift.
error : (3) expression -> expression . + expression
(4) expression -> expression . - expression
!Nonterminals,
shift/reducewith rules where conflict
they appear for * resolved as shift. (5) expression -> expression . * expression
(6) expression -> expression . / expression
!statement
shift/reduce
expression :
: 0
1 2 3 conflict
3 4 4 5 5 6 6 for / resolved as shift.
! shift/reduce conflict for + resolved as shift.
$end
Parsing method: LALR
reduce using rule 4 (expression ->conflict
! shift/reduce expression - shift.
for - resolved as expression .)
! shift/reduce conflict for * resolved as shift.
+
state 0
shift and go to state 7 ! shift/reduce conflict for / resolved as shift.
$end reduce using rule 4 (expression -> expression - expression .)
-(0) S' -> . statement shift and go to state 6 + shift and go to state 7
- shift and go to state 6
*(1) shift
statement -> . NAME = expression
(2) statement -> . expression
and go to state 8 * shift and go to state 8
/ shift and go to state 9
/(4) expression -> . expression shift
(3) expression -> . expression + and
expression
- expression
go to state 9
! + [ reduce using rule 4 (expression -> expression - expression .) ]
(5) expression -> . expression * expression ! - [ reduce using rule 4 (expression -> expression - expression .) ]
(6) expression -> . expression / expression ! * [ reduce using rule 4 (expression -> expression - expression .) ]
! + (7) expression -> . NUMBER [ reduce using rule 4 (expression
! / ->[ expression
reduce using rule 4 - expression
(expression -> expression - .) ]
expression .) ]
! -NAME
NUMBER
shift and go
shift and go
to[state
reduce
to state 2
1 using rule 4 (expression -> expression - expression .) ]
! * [ reduce using rule 4 (expression -> expression - expression .) ]
! /expression
statement
[shift
reduce using
and go to state 4
shift and go to state 3
rule 4 (expression -> expression - expression .) ]
...state 1
(1) statement -> NAME . = expression
def t_NUMBER():
r’\d+’
t.value = int(t.value)
return t
• PLY
precedence = (
('left','PLUS','MINUS'),
('left','TIMES','DIVIDE'),
('nonassoc','UMINUS'),
)
def p_expr_uminus(p):
'expr : MINUS expr %prec UMINUS'
p[0] = -p[1]
Character Literals
• Yacc
expr : expr '+' expr { $$ = $1 + $3; }
| expr '-' expr { $$ = $1 - $3; }
| expr '*' expr { $$ = $1 * $3; }
| expr '/' expr { $$ = $1 / $3; }
;
• PLY
def p_expr(p):
'''expr : expr '+' expr
| expr '-' expr
| expr '*' expr
| expr '/' expr'''
...
Error Productions
• Yacc
funcall_err : ID LPAREN error RPAREN {
printf("Syntax error in arguments\n");
}
;
• PLY
def p_funcall_err(p):
'''ID LPAREN error RPAREN'''
print "Syntax error in arguments\n"
PLY is Simple
• Two pure-Python modules. That's it.
• Not part of a "parser framework"
• Use doesn't involve exotic design patterns
• Doesn't rely upon C extension modules
• Doesn't rely on third party tools
PLY is Fast
• For a parser written entirely in Python
• Underlying parser is table driven
• Parsing tables are saved and only regenerated if
the grammar changes
• Considerable work went into optimization
from the start (developed on 200Mhz PC)
PLY Performance
• Parse file with 1000 random expressions
(805KB) and build an abstract syntax tree
• PLY-2.3 : 2.95 sec, 10.2 MB (Python)
class MyParser:
def p_assign(self,p):
‘’’assign : NAME EQUALS expr’’’
def p_expr(self,p):
‘’’expr : expr PLUS term
| expr MINUS term
| term’’’
def p_term(self,p):
‘’’term : term TIMES factor
| term DIVIDE factor
| factor’’’
def p_factor(self,p):
‘’’factor : NUMBER’’’
def build(self):
self.parser = yacc.yacc(object=self)
Limitations
• LALR(1) parsing
• Not easy to work with very complex grammars
(e.g., C++ parsing)
• Retains all of yacc's black magic
• Not as powerful as more general parsing
algorithms (ANTLR, SPARK, etc.)
• Tradeoff : Speed vs. Generality