[go: up one dir, main page]

0% found this document useful (0 votes)
15 views83 pages

CS 480 Translators (Compilers)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views83 pages

CS 480 Translators (Compilers)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

CS 480

Translators (Compilers)

weeks 2-3: SDT (cont’d), cython, lexer and lex/yacc

Instructor: Liang Huang


(some slides courtesy of David Beazley)
HW1 caveats & coding grades
• scope: either declare all vars diff
upfront, or before for-loop

• auxiliary loop variable takes


care of lots of corner cases
• ok for changing i in the loop
• no need for i-- after loop
• ok with range(0) diff -bd
• also need to cache range limit
• generous partial grades for
mistakes in printing multi-items

2
HW1 overall grade

3
Switching to ast from compiler
• the compiler packages is deprecated; replaced by ast
• similar structure, but easier to use, and faster
>>> import compiler
>>> compiler.parse("if 3==4: print 5")
Module(None, Stmt([If([(Compare(Const(3), [('==', Const(4))]),
Stmt([Printnl([Const(5)], None)]))], None)]))

>>> compiler.parse("if 3==4: print 5").node.nodes[0].__dict__


{'tests': [(Compare(Const(3), [('==', Const(4))]),
Stmt([Printnl([Const(5)], None)]))], 'else_': None, 'lineno': 1}

>>> import ast


>>> ast.dump(ast.parse("if 3==4: print 5"))
'Module(body=[If(test=Compare(left=Num(n=3), ops=[Eq()],
comparators=[Num(n=4)]), body=[Print(dest=None, values=[Num(n=5)],
nl=True)], orelse=[])])’

http://greentreesnakes.readthedocs.org/en/latest/
4
Switching to ast from compiler
• the compiler packages is deprecated; replaced by ast
• similar structure, but easier to use, and faster
https://docs.python.org/2/library/ast.html

CFG for AST CFG for Python


mod = Module(stmt* body) module: (NEWLINE | stmt)* ENDMARKER

stmt = Assign(expr* targets, expr value) stmt: simple_stmt | compound_stmt


| Print(..., expr* values, ...) simple_stmt: small_stmt (';' small_stmt)* [';']
| For(expr target, expr iter, stmt* body, ...) small_stmt: (expr_stmt | print_stmt ...)
| If(expr test, stmt* body, ...) expr_stmt: testlist (augassign (yield_expr|testl
('=' (yield_expr|testlist))
...
augassign: ('+=' | '-=' | '*=' | '/=' | '%=' | '
'<<=' | '>>=' | '**=' | '//=')
expr = BoolOp(boolop op, expr* values) print_stmt: 'print' ( [ test (',' test)* [','] ]
| BinOp(expr left, operator op, expr right)
| UnaryOp(unaryop op, expr operand) compound_stmt: if_stmt | while_stmt | for_stmt |
... if_stmt: 'if' test ':' suite ('elif' test ':' su
while_stmt: 'while' test ':' suite ['else' ':' s
for_stmt: 'for' exprlist 'in' testlist ':' suite
suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT
...

nonterminals only
nonterminals and terminals
(abstract syntax tree) 5
program : module
HW2 Grammar (P_2 subset)
module : stmt+
stmt : (simple_stmt | if_stmt | for_stmt) NEWLINE

simple_stmt : "print" expr ("," expr)*


| int_name "=" int_expr
| bool_name "=" bool_expr

expr : int_expr | bool_expr

if_stmt : "if" bool_expr ":" (simple_stmts | suite)


for_stmt : "for" name "in" "range" "(" int_expr ")" ":" (simple_stmts | suite)

simple_stmts : simple_stmt (";" simple_stmt)+


suite : NEWLINE INDENT stmt+ DEDENT • HW2: translate P_2 into C
int_expr :
|
int_name
decint • expand hw1.py
• no recursive type analysis
| "-" int_expr
| int_expr "+" int_expr
| "(" int_expr ")"

• caution on nested loops


| int_expr "if" bool_expr "else" int_expr

bool_expr : bool_name

• HW3: use PLY (lex/yacc) to


| bool_expr "and" bool_expr
| bool_expr "or" bool_expr
| "not" bool_expr
|
|
"(" bool_expr ")"
int_expr (comp_op int_expr)+
build a lexer/parser to
|
|
"True"
"False"
replace compiler.parse()
| "(" bool_expr "if" bool_expr "else" bool_expr ")"

comp_op : '<' | '>' | '==' | '>=' | '<=' | '<>' | '!='


6
HW1/2 Motivations: Cython

• Cython (2007~): compile a large


Python subset into C

• gradually being replaced by PyPy


(just-in-time compilation)

cython --embed $1.py


clang -Os -I /usr/include/python2.7 -lpython2.7 -lpthread -lm -lutil -ldl $1.c -o $1.out 7
C/C++ int long long long

Python int vs C int 32-bit


64-bit 32
32
64
64

• Python has arbitrary precision integer arithmetic (<type ‘long’>)


• when Python int exceeds ‘int’ range, it becomes ‘long’ and stays ‘long’
• you can turn it off in cython by “cdef int
• much faster, but overflows
import sys .py import sys .pyx
n, m = map(int, sys.argv[1:3]) n, m = map(int, sys.argv[1:3])
cdef int a, b
for _ in xrange(m): for _ in xrange(m):
a, b = 0, 1 a, b = 0, 1
for _ in xrange(n): for _ in xrange(n):
a, b = b, a+b a, b = b, a+b
print "fib[%d] = %d" % (n, b) print "fib[%d] = %d" % (n, b)

~95% faster than Python,


~15% faster than Python
but overflows!
(this is more like your HW1/2)
8
Compiled Cython Program (.cpp)
def fib(int n):
l = []
/* Generated by Cython 0.14 on Wed Jun 1 15:38:37 2011 */
cdef int a=0, b=1
#include "Python.h" for i in range(n):
... l.append(b)
/* "/Users/lhuang/install/svector/tutorial/fib_cy.pyx":2 a, b = b, a+b
* def fib(int n): return l
* l = [] # <<<<<<<<<<<<<<
* cdef int a=0, b=1
* for i in range(n):
*/
__pyx_t_1 = PyList_New(0); if (unlikely(!__pyx_t_1)) {__pyx_filename = __pyx_f[0];
__pyx_lineno = 2; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
__Pyx_GOTREF(((PyObject *)__pyx_t_1));
__Pyx_DECREF(((PyObject *)__pyx_v_l));
__pyx_v_l = __pyx_t_1;
__pyx_t_1 = 0;

/* "/Users/lhuang/install/svector/tutorial/fib_cy.pyx":3
* def fib(int n):
* l = []
* cdef int a=0, b=1 # <<<<<<<<<<<<<<
* for i in xrange(n):
* l.append(b)
*/
__pyx_v_a = 0;
__pyx_v_b = 1;

9
Compiled Cython Program (.cpp)
/* "/Users/lhuang/install/svector/tutorial/fib_cy.pyx":4
* l = []
* cdef int a=0, b=1 very clever: for loop detected!
* for i in range(n): # <<<<<<<<<<<<<<
* l.append(b) but should always use xrange in
* a, b = b, a+b your .pyx or .py!
*/
__pyx_t_2 = __pyx_v_n;
for (__pyx_t_3 = 0; __pyx_t_3 < __pyx_t_2; __pyx_t_3+=1) {
__pyx_v_i = __pyx_t_3;

/* "/Users/lhuang/install/svector/tutorial/fib_cy.pyx":5
* cdef int a=0, b=1
* for i in xrange(n):
* l.append(b) # <<<<<<<<<<<<<<
* a, b = b, a+b
* return l
*/
if (unlikely(__pyx_v_l == Py_None)) {
PyErr_SetString(PyExc_AttributeError, "'NoneType' object has no attribute 'append'");
{__pyx_filename = __pyx_f[0]; __pyx_lineno = 5; __pyx_clineno = __LINE__; goto
__pyx_L1_error;} }
__pyx_t_1 = PyInt_FromLong(__pyx_v_b); if (unlikely(!__pyx_t_1)) {__pyx_filename =
__pyx_f[0]; __pyx_lineno = 5; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
__Pyx_GOTREF(__pyx_t_1);
__pyx_t_4 = PyList_Append(__pyx_v_l, __pyx_t_1); if (unlikely(__pyx_t_4 == -1))
{__pyx_filename = __pyx_f[0]; __pyx_lineno = 5; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
__Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;

10
Compiled Cython Program (.cpp)
/* "/Users/lhuang/install/svector/tutorial/fib_cy.pyx":6
* for i in xrange(n):
* l.append(b)
* a, b = b, a+b # <<<<<<<<<<<<<<
* return l
*/
__pyx_t_4 = __pyx_v_b;
__pyx_t_5 = (__pyx_v_a + __pyx_v_b); correctly handles
__pyx_v_a = __pyx_t_4; simultaneous assignment
__pyx_v_b = __pyx_t_5;
}

... static PyObject *__pyx_pf_6fib_cy_0fib(PyObject *__pyx_self, PyObject *__pyx_arg_n) {

__pyx_v_n = __Pyx_PyInt_AsInt(__pyx_arg_n);
__pyx_t_1 = PyList_New(0);
__pyx_v_a = 0;
__pyx_v_b = 1;

for (__pyx_t_3 = 0; __pyx_t_3 < __pyx_t_n; __pyx_t_3+=1) {

__pyx_t_1 = PyInt_FromLong(__pyx_v_b);
__pyx_t_4 = PyList_Append(__pyx_v_l, __pyx_t_1);
__Pyx_DECREF(__pyx_t_1);

__pyx_t_4 = __pyx_v_b;
__pyx_t_5 = (__pyx_v_a + __pyx_v_b);
__pyx_v_a = __pyx_t_4;
__pyx_v_b = __pyx_t_5;
}

...
} 11
By Comparison... using python int
import sys
n, m = map(int, sys.argv[1:3])

for _ in xrange(m):
a, b = 0, 1
for _ in xrange(n):
a, b = b, a+b
print "fib[%d] = %d" % (n, b)

__pyx_t_2 = __Pyx_GetModuleGlobalName(__pyx_n_s_b); if (unlikely(!__pyx_t_2)) {__pyx_filename = __pyx_f[0];


__pyx_lineno = 12; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
__Pyx_GOTREF(__pyx_t_2);
__pyx_t_4 = __Pyx_GetModuleGlobalName(__pyx_n_s_a); if (unlikely(!__pyx_t_4)) {__pyx_filename = __pyx_f[0]
__pyx_lineno = 12; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
__Pyx_GOTREF(__pyx_t_4);
__pyx_t_10 = __Pyx_GetModuleGlobalName(__pyx_n_s_b); if (unlikely(!__pyx_t_10)) {__pyx_filename =
__pyx_f[0]; __pyx_lineno = 12; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
__Pyx_GOTREF(__pyx_t_10);
__pyx_t_11 = PyNumber_Add(__pyx_t_4, __pyx_t_10); if (unlikely(!__pyx_t_11)) {__pyx_filename = __pyx_f[0];
__pyx_lineno = 12; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
__Pyx_GOTREF(__pyx_t_11);
__Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0;
__Pyx_DECREF(__pyx_t_10); __pyx_t_10 = 0;
if (PyDict_SetItem(__pyx_d, __pyx_n_s_a, __pyx_t_2) < 0) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 12;
__pyx_clineno = __LINE__; goto __pyx_L1_error;}
__Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
if (PyDict_SetItem(__pyx_d, __pyx_n_s_b, __pyx_t_11) < 0) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 12;
__pyx_clineno = __LINE__; goto __pyx_L1_error;}
__Pyx_DECREF(__pyx_t_11); __pyx_t_11 = 0;

12
Lexer within Compiler Pipeline

regexps
INT = \d+
CLASSNAME =
[A-Z][a-zA-Z\d]*

used in HW1 CFG in


[E]BNF

simple_stmt : "print" expr ("," expr)*


| name "=" int_expr

13
Regular Expression Examples
• integers and floats
(-?)\d+ (-?) ((\d+\.\d*) | (\d*\.\d+))

• Python/Java/C variable names (starts with letter or _)


[a-zA-Z_] [a-zA-Z\d_]*

• Python variable/function naming convention: joined_lower


[a-z_]+ (_?)([a-z]+)(_[a-z]+)*(_?)

• Java variable/function naming convention: camelCase


[a-z]+([A-Z][a-z]+)+

• Python/Java class naming convention: StudlyCaps


([A-Z][a-z]+)+

14
How RegExps are implemented
• regexp => NFA => DFA

15
Two Examples: (a|b) aba
* (ab|ba) |bb
*

CS 321 - TOC 16
Writing a Lexer is Boring...
• we often use a lexer-generator
only allows BNF
INT = \d+
not EBNF!
CLASSNAME = module : stmtlist
[A-Z][a-zA-Z\d]* stmtlist : stmt | stmt stmtlist
... ...

~1974 ~1973

Writing Parsers and Compilers with PLY (David Beazley)


17
Lex & Yacc
• Programming tools for writing parsers
CEO/chairman of

• Lex - Lexical analysis (tokenizing)


• Yacc - Yet Another Compiler Compiler (parsing)
• History:
- Yacc : ~1973. Stephen Johnson (AT&T)
author of - Lex : ~1974. Eric Schmidt and Mike Lesk (AT&T)

• Variations of both tools are widely known


intern of

• Covered in compilers classes and textbooks


author of
Lex/Yacc Big Picture
lexer.l
token
specification
Lex/Yacc Big Picture
lexer.l
token */
/* lexer.l grammar
specification
%{ specification
#include “header.h”
int lineno = 1;
%}
%%
[ \t]* ; /* Ignore whitespace */
\n { lineno++; }
[0-9]+ { yylval.val = atoi(yytext);
return NUMBER; }
[a-zA-Z_][a-zA-Z0-9_]* { yylval.name = strdup(yytext);
return ID; }
\+ { return PLUS; }
- { return MINUS; }
\* { return TIMES; }
\/ { return DIVIDE; }
= { return EQUALS; }
%%
Lex/Yacc Big Picture
lexer.l
token
specification

lex

lexer.c
Lex/Yacc Big Picture
lexer.l parser.y
token grammar
specification specification

lex

lexer.c
Lex/Yacc Big Picture
lexer.l parser.y
token
/* parser.y */ grammar
specification
%{ specification
#include “header.h”
%}
lex
%union {
char *name;
lexer.c.cint val;
}
%token PLUS MINUS TIMES DIVIDE EQUALS
%token<name> ID;
%token<val> NUMBER;
%%
start : ID EQUALS expr;
expr : expr PLUS term
| expr MINUS term
| term
;
...
Lex/Yacc Big Picture
lexer.l parser.y
token grammar
specification specification

lex yacc

lexer.c parser.c
Lex/Yacc Big Picture
lexer.l parser.y
token grammar
specification specification

lex yacc

lexer.c parser.c

typecheck.c codegen.c otherstuff.c


Lex/Yacc Big Picture
lexer.l parser.y
token grammar
specification specification

lex yacc

lexer.c parser.c

typecheck.c codegen.c otherstuff.c

mycompiler
What is PLY?

• PLY = Python Lex-Yacc


• A Python version of the lex/yacc toolset
• Same functionality as lex/yacc
• But a different interface
• Influences : Unix yacc, SPARK (John Aycock)
ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
’DIVIDE’, EQUALS’ ]
t_ignore = ‘ \t’
t_PLUS = r’\+’
t_MINUS = r’-’
t_TIMES = r’\*’
t_DIVIDE = r’/’
t_EQUALS = r’=’
t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’

def t_NUMBER(t):
r’\d+’
t.value = int(t.value)
return t

lex.lex() # Build the lexer

REALLY AWFUL INTERFACE!!


SHOULD BE MODERNIZED!!
ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
’DIVIDE’, EQUALS’ ]
t_ignore = ‘ \t’
t_PLUS = r’\+’
t_MINUS = r’-’ tokens list specifies
t_TIMES = r’\*’ all of the possible tokens
t_DIVIDE = r’/’
t_EQUALS = r’=’
t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’

def t_NUMBER(t):
r’\d+’
t.value = int(t.value)
return t

lex.lex() # Build the lexer


ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
’DIVIDE’, EQUALS’ ]
t_ignore = ‘ \t’
t_PLUS = r’\+’ Each token has a matching
t_MINUS = r’-’
t_TIMES = r’\*’
declaration of the form
t_DIVIDE = r’/’ t_TOKNAME
t_EQUALS = r’=’
t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’

def t_NUMBER(t):
r’\d+’
t.value = int(t.value)
return t

lex.lex() # Build the lexer


ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
’DIVIDE’, EQUALS’ ]
t_ignore = ‘ \t’
t_PLUS = r’\+’ These names must match
t_MINUS = r’-’
t_TIMES = r’\*’
t_DIVIDE = r’/’
t_EQUALS = r’=’
t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’

def t_NUMBER(t):
r’\d+’
t.value = int(t.value)
return t

lex.lex() # Build the lexer


ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
’DIVIDE’, EQUALS’ ]
t_ignore = ‘ \t’
t_PLUS = r’\+’
t_MINUS = r’-’
t_TIMES = r’\*’
t_DIVIDE = r’/’
t_EQUALS = r’=’ Tokens are defined by
t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’
regular expressions
def t_NUMBER(t):
r’\d+’
t.value = int(t.value)
return t

lex.lex() # Build the lexer


ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
’DIVIDE’, EQUALS’ ]
t_ignore = ‘ \t’
t_PLUS = r’\+’
t_MINUS = r’-’ For simple tokens,
t_TIMES = r’\*’
t_DIVIDE = r’/’
strings are used.
t_EQUALS = r’=’
t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’

def t_NUMBER(t):
r’\d+’
t.value = int(t.value)
return t

lex.lex() # Build the lexer


ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
’DIVIDE’, EQUALS’ ]
t_ignore = ‘ \t’
t_PLUS = r’\+’
t_MINUS = r’-’
t_TIMES = r’\*’
t_DIVIDE = r’/’
Functions are used when
t_EQUALS = r’=’
t_NAME special action code
= r’[a-zA-Z_][a-zA-Z0-9_]*’
must execute
def t_NUMBER(t):
r’\d+’
t.value = int(t.value)
return t

lex.lex() # Build the lexer


ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
’DIVIDE’, EQUALS’ ]
t_ignore = ‘ \t’
t_PLUS = r’\+’
t_MINUS = r’-’
t_TIMES = r’\*’
t_DIVIDE = r’/’
t_EQUALS = r’=’
t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’

def t_NUMBER(t): docstring holds


r’\d+’ regular expression
t.value = int(t.value)
return t

lex.lex() # Build the lexer


ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
’DIVIDE’, EQUALS’ Specifies
] ignored
t_ignore = ‘ \t’ characters between
t_PLUS = r’\+’
tokens (usually whitespace)
t_MINUS = r’-’
t_TIMES = r’\*’
t_DIVIDE = r’/’
t_EQUALS = r’=’
t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’

def t_NUMBER(t):
r’\d+’
t.value = int(t.value)
return t

lex.lex() # Build the lexer


ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
’DIVIDE’, EQUALS’ ]
t_ignore = ‘ \t’
t_PLUS = r’\+’
t_MINUS = r’-’
t_TIMES = r’\*’
t_DIVIDE = r’/’
t_EQUALS = r’=’
t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’

def t_NUMBER(t):
r’\d+’
t.value = int(t.value)
return t
Builds the lexer
lex.lex() by
# creating
Build thea master
lexer
regular expression
ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
’DIVIDE’, EQUALS’ ]
t_ignore = ‘ \t’
t_PLUS = r’\+’
t_MINUS = r’-’ Introspection used
t_TIMES = r’\*’ to examine contents
t_DIVIDE = r’/’ of calling module.
t_EQUALS = r’=’
t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’

def t_NUMBER(t):
r’\d+’
t.value = int(t.value)
return t

lex.lex() # Build the lexer


ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
’DIVIDE’, EQUALS’ ]
t_ignore = ‘ \t’
t_PLUS = r’\+’
t_MINUS = r’-’ Introspection used
t_TIMES = r’\*’ to examine contents
t_DIVIDE = r’/’ of calling module.
t_EQUALS = r’=’
t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’

def t_NUMBER(t):
r’\d+’ __dict__ = {
t.value = int(t.value) 'tokens' : [ 'NAME' ...],
return t 't_ignore' : ' \t',
't_PLUS' : '\\+',
lex.lex() # Build the ...
lexer
't_NUMBER' : <function ...
}
ply.lex use
• Two functions: input() and token()
...
lex.lex() # Build the lexer
...
lex.input("x = 3 * 4 + 5 * 6")
while True:
tok = lex.token()
if not tok: break

# Use token
...
ply.lex use
• Two functions: input() and token()
...
lex.lex() # Build the lexer
...
input() feeds a string
lex.input("x = 3 * 4 + 5 * 6")
while True: into the lexer
tok = lex.token()
if not tok: break

# Use token
...
ply.lex use
• Two functions: input() and token()
...
lex.lex() # Build the lexer
...
lex.input("x = 3 * 4 + 5 * 6")
while True:
tok = lex.token() token() returns the
if not tok: break next token or None
# Use token
...
ply.lex use
• Two functions: input() and token()
...
lex.lex() # Build the lexer
...
lex.input("x = 3 * 4 + 5 * 6")
while True:
tok = lex.token()
if not tok: break

tok.type # Use token


...
tok.value
tok.line
tok.lexpos
ply.lex use
• Two functions: input() and token()
...
lex.lex() # Build the lexer
...
lex.input("x = 3 * 4 + 5 * 6")
while True:
tok = lex.token()
if not tok: break

tok.type # Use token


...
tok.value
tok.line
tok.lexpos t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’
ply.lex use
• Two functions: input() and token()
...
lex.lex() # Build the lexer
...
lex.input("x = 3 * 4 + 5 * 6")
while True:
tok = lex.token()
if not tok: break

tok.type # Use token


...
tok.value matching text
tok.line
tok.lexpos t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’
ply.lex use
• Two functions: input() and token()
...
lex.lex() # Build the lexer
...
lex.input("x = 3 * 4 + 5 * 6")
while True:
tok = lex.token()
if not tok: break

tok.type # Use token


...
tok.value
tok.line
tok.lexpos Position in input text
Actually this doesn’t work!
• key words will be mixed with identifiers (variable names...)
tokens = ('PRINT','INT','PLUS', 'NAME') $ echo -e "print a" | ./hw2a.py
t_ignore = ' \t' (NAME, 'print') (NAME, 'a')
t_PRINT = r'print'
t_PLUS = r'\+'
t_NAME = r'[a-zA-Z_][a-zA-Z\d_]*'

keywords = {'print' : 'PRINT', ...} $ echo -e "print a" | ./hw2b.py


(PRINT, 'print') (NAME, 'a')
tokens = ['NAME','INT','PLUS'] + keywords.values()
t_ignore = ' \t'
t_PLUS = r'\+'
t_PRINT = r'print'
t_NAME = r'[a-zA-Z_][a-zA-Z\d_]*'

def t_NAME(t):
r'[a-zA-Z_][a-zA-Z_0-9]*'
t.type = keywords.get(t.value, 'NAME') # Check for reserved words
return t

47
Python’s INDENT/DEDENT
• this is beyond context-free, and thus way beyond regular!

• use a stack of indentation levels; INDENT/DEDENT is produced whenever


indentation level changes for i in range(1): (FOR, 'for') [0]
# t_ignore = ‘ \t’ # can’t ignore spaces for j in range(2): (NAME, 'i') [0]
indents = [0] print i,j (IN, 'in') [0]
def t_indent(t): 6 (RANGE, 'range') [0]
r'\n[ ]*' # tab is not allowed here print 5 (LPAREN, '(') [0]
t.lexer.lineno += 1 (INT, 1) [0]
if t.lexer.lexpos >= len(t.lexer.lexdata) \ (RPAREN, ')') [0]
or t.lexer.lexdata[t.lexer.lexpos] == "\n": # empty line (COLON, ':') [0]
return None # ignore empty line (INDENT, 1) [0, 2]
width = len(t.value) - 1 # exclude \n (FOR, 'for') [0, 2]
last_pos = t.lexer.lexpos - width (NAME, 'j') [0, 2]
if width == indents[-1]: (IN, 'in') [0, 2]
return None # same indentation level (RANGE, 'range') [0, 2]
elif width > indents[-1]: # one more level (LPAREN, '(') [0, 2]
t.type = 'INDENT' (INT, 2) [0, 2]
t.value = 1 (RPAREN, ')') [0, 2]
indents.append(width) (COLON, ':') [0, 2]
return t (INDENT, 1) [0, 2, 4]
else: # try one or more DEDENTS (PRINT, 'print') [0, 2, 4]
if_stmt : "if" bool_expr ":" suite
ded = 0 (NAME, 'i') [0, 2, 4]
while len(indents) > 1: suite : NEWLINE INDENT stmt+ DEDENT (COMMA, ',') [0, 2, 4]
indents.pop() (NAME, 'j') [0, 2, 4]
ded += 1 (INT, 6) [0, 2, 4]
if width == indents[-1]: (DEDENT, 2) [0]
t.type = 'DEDENT' (DEDENT, 2) [0]
t.value = ded # remember how many dedents (PRINT, 'print') [0]
return t (INT, 5) [0]
raise Exception("bad indent level at line %d: %s" % (t.lexer.lineno - 1,
t.lexer.lines[t.lexer.lineno-1]))

def t_space(t):
r'[ ]+'
return None # ignore white space 48
Python’s INDENT/DEDENT
• this is beyond context-free, and thus way beyond regular!

• use a stack of indentation levels; INDENT/DEDENT is produced whenever


indentation level changes for i in range(1): (FOR, 'for') [0]
class MyLexer(object): for j in range(2): (NAME, 'i') [0]
def __init__(self): print i,j (IN, 'in') [0]
lex.lex() # build regexps 6 (RANGE, 'range') [0]
self.lexer = lex.lexer print 5 (LPAREN, '(') [0]
self.dedent_balance = 0 (INT, 1) [0]
(RPAREN, ')') [0]
def input(self, stream): (COLON, ':') [0]
# the initial \n is to simplify indent (INDENT, 1) [0, 2]
# the final \n is to simplify dedent (FOR, 'for') [0, 2]
stream = "\n" + stream + "\n" (NAME, 'j') [0, 2]
self.lexer.input(stream) (IN, 'in') [0, 2]
self.lexer.lines = stream.split("\n") # internal (RANGE, 'range') [0, 2]
print >> logs, "now lexing..." (LPAREN, '(') [0, 2]
(INT, 2) [0, 2]
def tokenstr(self, tok): (RPAREN, ')') [0, 2]
return "(%s, %s)" % (tok.type,
("'%s'" if type(tok.value) is str else '%s') % tok.value) (COLON, ':') [0, 2]
(INDENT, 1) [0, 2, 4]
def token(self): (PRINT, 'print') [0, 2, 4]
if_stmt : "if" bool_expr ":" suite
if self.dedent_balance != 0: (NAME, 'i') [0, 2, 4]
suite : NEWLINE INDENT stmt+ DEDENT
self.dedent_balance -= 1 (COMMA, ',') [0, 2, 4]
tok = self.last_tok # (DEDENT, 1) (NAME, 'j') [0, 2, 4]
print >> logs, self.tokenstr(tok), (INT, 6) [0, 2, 4]
else: (DEDENT, 2) [0]
tok = self.lexer.token() # lexer.token (DEDENT, 2) [0]
if not tok: (PRINT, 'print') [0]
print >> logs (INT, 5) [0]
return None
print >> logs, self.tokenstr(tok),
if tok.type == 'DEDENT':
self.dedent_balance = tok.value - 1
self.last_tok = tok
return tok 49
ply.lex Commentary

• Normally you don't use the tokenizer directly


• Instead, it's used by the parser module
ply.yacc preliminaries
• ply.yacc is a module for creating a parser
• Assumes you have defined a BNF grammar
assign : NAME EQUALS expr compare with (ambiguity):
expr : expr PLUS term
| expr MINUS term expr : expr PLUS expr
| term | expr TIMES expr
term : term TIMES factor | NUMBER
| term DIVIDE factor
| factor ---- or ----
factor : NUMBER
expr : expr PLUS expr
| term
term : term TIMES term
| NUMBER
ply.yacc example
import ply.yacc as yacc
import mylexer # Import lexer information
tokens = mylexer.tokens # Need token list

def p_assign(p):
'''assign : NAME EQUALS expr'''

def p_expr(p):
'''expr : expr PLUS term
| expr MINUS term
| term'''
def p_term(p):
'''term : term TIMES factor
| term DIVIDE factor
| factor'''
def p_factor(p):
'''factor : NUMBER'''

yacc.yacc() # Build the parser


ply.yacc example
import ply.yacc as yacc
import mylexer token
# Import information
lexer information
tokens = mylexer.tokens imported
# Need from lexer
token list

def p_assign(p):
'''assign : NAME EQUALS expr'''

def p_expr(p):
'''expr : expr PLUS term
| expr MINUS term
| term'''
def p_term(p):
'''term : term TIMES factor
| term DIVIDE factor
| factor'''
def p_factor(p):
'''factor : NUMBER'''

yacc.yacc() # Build the parser


ply.yacc example
import ply.yacc as yacc
import mylexer # Import lexer information
tokens = mylexer.tokens # Need token list

def p_assign(p):
'''assign : NAME EQUALS expr'''

def p_expr(p): grammar rules encoded


'''expr : expr PLUS term as functions with names
| expr MINUS term p_rulename
| term'''
def p_term(p):
'''term : term TIMES factor
| term DIVIDE factor Note: Name doesn't
| factor''' matter as long as it
def p_factor(p):
starts with p_
'''factor : NUMBER'''

yacc.yacc() # Build the parser


ply.yacc example
import ply.yacc as yacc
import mylexer # Import lexer information
tokens = mylexer.tokens # Need token list

def p_assign(p):
'''assign : NAME EQUALS expr'''

def p_expr(p):
'''expr : expr PLUS term
| expr MINUS term
| term''' docstrings contain
def p_term(p): grammar rules
'''term : term TIMES factor from BNF
| term DIVIDE factor
| factor'''
def p_factor(p):
'''factor : NUMBER'''

yacc.yacc() # Build the parser


ply.yacc example
import ply.yacc as yacc
import mylexer # Import lexer information
tokens = mylexer.tokens # Need token list

def p_assign(p):
'''assign : NAME EQUALS expr'''

def p_expr(p):
'''expr : expr PLUS term
| expr MINUS term
| term'''
def p_term(p):
'''term : term TIMES factor
| term DIVIDE factor
| factor'''
def p_factor(p):
'''factor : NUMBER'''
Builds the parser
yacc.yacc() # Build the parser
using introspection
ply.yacc parsing
• yacc.parse() function
yacc.yacc() # Build the parser
...
data = "x = 3*4+5*6"
yacc.parse(data) # Parse some text

• This feeds data into lexer


• Parses the text and invokes grammar rules
A peek inside

• PLY uses LR-parsing. LALR(1)


• AKA: Shift-reduce parsing
• Widely used parsing technique
• Table driven
General Idea
• Input tokens are shifted onto a parsing stack
Stack Input
X = 3 * 4 + 5
NAME = 3 * 4 + 5
NAME = 3 * 4 + 5
NAME = NUM * 4 + 5

• This continues until a complete grammar rule


appears on the top of the stack
General Idea
• If rules are found, a "reduction" occurs
Stack Input
X = 3 * 4 + 5
NAME = 3 * 4 + 5
NAME = 3 * 4 + 5
NAME = NUM * 4 + 5
reduce factor : NUM

NAME = factor

• RHS of grammar rule replaced with LHS


Rule Functions

• During reduction, rule functions are invoked


def p_factor(p):
‘factor : NUMBER’

• Parameter p contains grammar symbol values


def p_factor(p):
‘factor : NUMBER’

p[0] p[1]
Using an LR Parser

• Rule functions generally process values on


right hand side of grammar rule
• Result is then stored in left hand side
• Results propagate up through the grammar
• Bottom-up parsing
Example: Calculator
def p_assign(p):
‘’’assign : NAME EQUALS expr’’’
vars[p[1]] = p[3]

def p_expr_plus(p):
‘’’expr : expr PLUS term’’’
p[0] = p[1] + p[3]

def p_term_mul(p):
‘’’term : term TIMES factor’’’
p[0] = p[1] * p[3]

def p_term_factor(p):
'''term : factor'''
p[0] = p[1]

def p_factor(p):
‘’’factor : NUMBER’’’
p[0] = p[1]
Example: Parse Tree
def p_assign(p):
‘’’assign : NAME EQUALS expr’’’
p[0] = (‘ASSIGN’,p[1],p[3])

def p_expr_plus(p):
‘’’expr : expr PLUS term’’’
p[0] = (‘+’,p[1],p[3])

def p_term_mul(p):
‘’’term : term TIMES factor’’’
p[0] = (‘*’,p[1],p[3])

def p_term_factor(p):
'''term : factor'''
p[0] = p[1]

def p_factor(p):
‘’’factor : NUMBER’’’
p[0] = (‘NUM’,p[1])
Example: Parse Tree
>>> t = yacc.parse("x = 3*4 + 5*6")
>>> t
('ASSIGN','x',('+',
('*',('NUM',3),('NUM',4)),
('*',('NUM',5),('NUM',6))
)
)
>>>

ASSIGN

'x' '+'

'*' '*'

3 4 5 6
Why use PLY?

• There are many Python parsing tools


• Some use more powerful parsing algorithms
• Isn't parsing a "solved" problem anyways?
PLY is Informative
• Compiler writing is hard
• Tools should not make it even harder
• PLY provides extensive diagnostics
• Major emphasis on error reporting
• Provides the same information as yacc
PLY Diagnostics
• PLY produces the same diagnostics as yacc
• Yacc
% yacc grammar.y
4 shift/reduce conflicts
2 reduce/reduce conflicts

• PLY
% python mycompiler.py
yacc: Generating LALR parsing table...
4 shift/reduce conflicts
2 reduce/reduce conflicts

• PLY also produces the same debugging output


Debugging Output
Grammar state 10
Rule 1 statement -> NAME = expression (1) statement -> NAME = expression .
Rule 2 statement -> expression (3) expression -> expression . + expression
Rule 3 expression -> expression + expression (4) expression -> expression . - expression
Rule 4 expression -> expression - expression (5) expression -> expression . * expression
Rule 5 expression -> expression * expression (6) expression -> expression . / expression
Rule 6 expression -> expression / expression
Rule 7 expression -> NUMBER $end reduce using rule 1 (statement -> NAME = expression .)
+ shift and go to state 7
Terminals, with rules where they appear - shift and go to state 6
* shift and go to state 8
* : 5 / shift and go to state 9
+ : 3
- : 4
/ : 6
= : 1 state 11
NAME : 1
NUMBER : 7 (4) expression -> expression - expression .
error : (3) expression -> expression . + expression
(4) expression -> expression . - expression
Nonterminals, with rules where they appear (5) expression -> expression . * expression
(6) expression -> expression . / expression
expression : 1 2 3 3 4 4 5 5 6 6
statement : 0 ! shift/reduce conflict for + resolved as shift.
! shift/reduce conflict for - resolved as shift.
Parsing method: LALR ! shift/reduce conflict for * resolved as shift.
! shift/reduce conflict for / resolved as shift.
state 0 $end reduce using rule 4 (expression -> expression - expression .)
+ shift and go to state 7
(0) S' -> . statement - shift and go to state 6
(1) statement -> . NAME = expression * shift and go to state 8
(2) statement -> . expression / shift and go to state 9
(3) expression -> . expression + expression
(4) expression -> . expression - expression ! + [ reduce using rule 4 (expression -> expression - expression .) ]
(5) expression -> . expression * expression ! - [ reduce using rule 4 (expression -> expression - expression .) ]
(6) expression -> . expression / expression ! * [ reduce using rule 4 (expression -> expression - expression .) ]
(7) expression -> . NUMBER ! / [ reduce using rule 4 (expression -> expression - expression .) ]
NAME shift and go to state 1
NUMBER shift and go to state 2

expression shift and go to state 4


statement shift and go to state 3

state 1

(1) statement -> NAME . = expression

= shift and go to state 5


Debugging Output
... Grammar state 10
state
Rule 1
11 statement -> NAME = expression (1) statement -> NAME = expression .
Rule 2 statement -> expression (3) expression -> expression . + expression
Rule 3 expression -> expression + expression (4) expression -> expression . - expression
(4) expression -> expression
Rule
Rule
4
5
expression -> expression - expression
expression -> expression * expression
- expression . (5) expression -> expression . * expression
(6) expression -> expression . / expression
(3) expression -> expression
Rule
Rule
6
7
expression -> expression / expression
expression -> NUMBER
. + expression
(4) expression -> expression
Terminals, with rules where they appear
. - expression $end
+
reduce using
shift and go
rule 1 (statement -> NAME = expression .)
to state 7

*
(5) expression : 5
-> expression . * expression -* shift and go
shift and go
to state 6
to state 8
+ (6) expression
-
: 3
: 4
-> expression . / expression / shift and go to state 9

/ : 6
= : 1 state 11
! shift/reduce conflict for + resolved as shift.
NAME
NUMBER
:
:
1
7 (4) expression -> expression - expression .
! shift/reduce conflict for - resolved as shift.
error : (3) expression -> expression . + expression
(4) expression -> expression . - expression
!Nonterminals,
shift/reducewith rules where conflict
they appear for * resolved as shift. (5) expression -> expression . * expression
(6) expression -> expression . / expression
!statement
shift/reduce
expression :
: 0
1 2 3 conflict
3 4 4 5 5 6 6 for / resolved as shift.
! shift/reduce conflict for + resolved as shift.
$end
Parsing method: LALR
reduce using rule 4 (expression ->conflict
! shift/reduce expression - shift.
for - resolved as expression .)
! shift/reduce conflict for * resolved as shift.
+
state 0
shift and go to state 7 ! shift/reduce conflict for / resolved as shift.
$end reduce using rule 4 (expression -> expression - expression .)
-(0) S' -> . statement shift and go to state 6 + shift and go to state 7
- shift and go to state 6
*(1) shift
statement -> . NAME = expression
(2) statement -> . expression
and go to state 8 * shift and go to state 8
/ shift and go to state 9
/(4) expression -> . expression shift
(3) expression -> . expression + and
expression
- expression
go to state 9
! + [ reduce using rule 4 (expression -> expression - expression .) ]
(5) expression -> . expression * expression ! - [ reduce using rule 4 (expression -> expression - expression .) ]
(6) expression -> . expression / expression ! * [ reduce using rule 4 (expression -> expression - expression .) ]
! + (7) expression -> . NUMBER [ reduce using rule 4 (expression
! / ->[ expression
reduce using rule 4 - expression
(expression -> expression - .) ]
expression .) ]

! -NAME
NUMBER
shift and go
shift and go
to[state
reduce
to state 2
1 using rule 4 (expression -> expression - expression .) ]
! * [ reduce using rule 4 (expression -> expression - expression .) ]
! /expression
statement
[shift
reduce using
and go to state 4
shift and go to state 3
rule 4 (expression -> expression - expression .) ]
...state 1
(1) statement -> NAME . = expression

= shift and go to state 5


PLY Validation
• PLY validates all token/grammar specs
• Duplicate rules
• Malformed regexs and grammars
• Missing rules and tokens
• Unused tokens and rules
• Improper function declarations
• Infinite recursion
Error Example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
’DIVIDE’, EQUALS’ ]
t_ignore = ‘ \t’
t_PLUS = r’\+’
t_MINUS = r’-’
t_TIMES = r’\*’
t_DIVIDE = r’/’
t_EQUALS = r’=’
t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’
t_MINUS = r'-' example.py:12: Rule t_MINUS redefined.
t_POWER = r'\^' Previously defined on line 6

def t_NUMBER():
r’\d+’
t.value = int(t.value)
return t

lex.lex() # Build the lexer


Error Example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
’DIVIDE’, EQUALS’ ]
t_ignore = ‘ \t’
t_PLUS = r’\+’
t_MINUS = r’-’
t_TIMES = r’\*’
t_DIVIDE = r’/’
t_EQUALS = r’=’
t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’
t_MINUS = r'-'
t_POWER = r'\^' lex: Rule 't_POWER' defined for an
unspecified token POWER
def t_NUMBER():
r’\d+’
t.value = int(t.value)
return t

lex.lex() # Build the lexer


Error Example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
’DIVIDE’, EQUALS’ ]
t_ignore = ‘ \t’
t_PLUS = r’\+’
t_MINUS = r’-’
t_TIMES = r’\*’
t_DIVIDE = r’/’
t_EQUALS = r’=’
t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’
t_MINUS = r'-'
t_POWER = r'\^'

def t_NUMBER(): example.py:15: Rule 't_NUMBER' requires


r’\d+’ an argument.
t.value = int(t.value)
return t

lex.lex() # Build the lexer


PLY is Yacc
• PLY supports all of the major features of
Unix lex/yacc
• Syntax error handling and synchronization
• Precedence specifiers
• Character literals
• Start conditions
• Inherited attributes
Precedence Specifiers
• Yacc
%left PLUS MINUS
%left TIMES DIVIDE
%nonassoc UMINUS
...
expr : MINUS expr %prec UMINUS {
$$ = -$1;
}

• PLY
precedence = (
('left','PLUS','MINUS'),
('left','TIMES','DIVIDE'),
('nonassoc','UMINUS'),
)
def p_expr_uminus(p):
'expr : MINUS expr %prec UMINUS'
p[0] = -p[1]
Character Literals
• Yacc
expr : expr '+' expr { $$ = $1 + $3; }
| expr '-' expr { $$ = $1 - $3; }
| expr '*' expr { $$ = $1 * $3; }
| expr '/' expr { $$ = $1 / $3; }
;

• PLY
def p_expr(p):
'''expr : expr '+' expr
| expr '-' expr
| expr '*' expr
| expr '/' expr'''
...
Error Productions
• Yacc
funcall_err : ID LPAREN error RPAREN {
printf("Syntax error in arguments\n");
}
;

• PLY
def p_funcall_err(p):
'''ID LPAREN error RPAREN'''
print "Syntax error in arguments\n"
PLY is Simple
• Two pure-Python modules. That's it.
• Not part of a "parser framework"
• Use doesn't involve exotic design patterns
• Doesn't rely upon C extension modules
• Doesn't rely on third party tools
PLY is Fast
• For a parser written entirely in Python
• Underlying parser is table driven
• Parsing tables are saved and only regenerated if
the grammar changes
• Considerable work went into optimization
from the start (developed on 200Mhz PC)
PLY Performance
• Parse file with 1000 random expressions
(805KB) and build an abstract syntax tree
• PLY-2.3 : 2.95 sec, 10.2 MB (Python)

• DParser : 0.71 sec, 72 MB (Python/C)

• BisonGen : 0.25 sec, 13 MB (Python/C)

• Bison : 0.063 sec, 7.9 MB (C)

• 12x slower than BisonGen (mostly C)


• 47x slower than pure C
• System: MacPro 2.66Ghz Xeon, Python-2.5
Class Example
import ply.yacc as yacc

class MyParser:
def p_assign(self,p):
‘’’assign : NAME EQUALS expr’’’
def p_expr(self,p):
‘’’expr : expr PLUS term
| expr MINUS term
| term’’’
def p_term(self,p):
‘’’term : term TIMES factor
| term DIVIDE factor
| factor’’’
def p_factor(self,p):
‘’’factor : NUMBER’’’
def build(self):
self.parser = yacc.yacc(object=self)
Limitations
• LALR(1) parsing
• Not easy to work with very complex grammars
(e.g., C++ parsing)
• Retains all of yacc's black magic
• Not as powerful as more general parsing
algorithms (ANTLR, SPARK, etc.)
• Tradeoff : Speed vs. Generality

You might also like