0% found this document useful (0 votes)

26 views26 pages

Code Source Tokens Scanner Parser IR

The document describes scanners and how they work. It discusses: - Scanners map characters into tokens which are the basic units of syntax. They eliminate whitespace and identify tokens like numbers, identifiers, and operators. - Scanners must specify patterns to recognize tokens using notations like regular expressions and regular grammars. Regular expressions use operations like union, concatenation, and Kleene closure to describe patterns. - From regular expressions, a scanner can construct a deterministic finite automaton to recognize tokens as it reads character input.

Uploaded by

moien

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views26 pages

Code Source Tokens Scanner Parser IR

Uploaded by

moien

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Scanner

source tokens
code scanner parser IR

errors
• maps characters into tokens – the basic unit of syntax
x = x + y;
becomes
<id, x> = <id, x> + <id, y> ;
• character string value for a token is a lexeme
• typical tokens: number, id, +, -, *, /, do, end
• eliminates white space (tabs, blanks, comments)
• a key issue is speed
⇒ use specialized recognizer (as opposed to lex)

1
Specifying patterns

A scanner must recognize the units of syntax

Some parts are easy:

white space
<ws> ::= <ws> ’ ’
| <ws> ’\t’
| ’ ’
| ’\t’
keywords and operators
specified as literal patterns: do, end
comments
opening and closing delimiters: /* · · · */

2
Specifying patterns

A scanner must recognize the units of syntax

Other parts are much harder:

identifiers
alphabetic followed by k alphanumerics ( , $, &, . . . )
numbers
integers: 0 or digit from 1-9 followed by digits from 0-9
decimals: integer ’.’ digits from 0-9
reals: (integer or decimal) ’E’ (+ or -) digits from 0-9
complex: ’(’ real ’,’ real ’)’

We need a powerful notation to specify these patterns

3
Operations on languages

Operation Definition
union of L and M L ∪ M = {s | s ∈ L or s ∈ M}
written L ∪ M
concatenation of L and M LM = {st | s ∈ L and t ∈ M}
written LM
Kleene closure of L L∗ = ∞ i
S
i=0 L
written L∗
+ S∞
positive closure of L L = i=1 Li
written L+

4
Regular expressions

Patterns are often specified as regular languages

Notations used to describe a regular language (or a regular set) include
both regular expressions and regular grammars
Regular expressions (over an alphabet Σ):

1. ε is a RE denoting the set {ε}

2. if a ∈ Σ, then a is a RE denoting {a}
3. if r and s are REs, denoting L(r) and L(s), then:
(r) is a RE denoting L(r)
(r) | (s) is a RE denoting L(r) L(s)
S

(r)(s) is a RE denoting L(r)L(s)

(r)∗ is a RE denoting L(r)∗

If we adopt a precedence for operators, the extra parentheses can go

away. We assume closure, then concatenation, then alternation as the
order of precedence.
5
Examples

identifier
letter → (a | b | c | ... | z | A | B | C | ... | Z)
digit → (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)
id → letter ( letter | digit )∗
numbers
integer → (+ | − | ε) (0 | (1 | 2 | 3 | ... | 9) digit ∗)
decimal → integer . ( digit )∗
real → ( integer | decimal ) E (+ | −) digit ∗
complex → ’(’ real , real ’)’

Numbers can get much more complicated

Most programming language tokens can be described with REs

We can use REs to build scanners automatically

6
Algebraic properties of REs

Axiom Description
r|s = s|r | is commutative
r|(s|t) = (r|s)|t | is associative
(rs)t = r(st) concatenation is associative
r(s|t) = rs|rt concatenation distributes over |
(s|t)r = sr|tr
εr = r ε is the identity for concatenation
rε = r
r∗ = (r|ε)∗ relation between ∗ and ε
r∗∗ = r∗ ∗ is idempotent

7
Examples

Let Σ = {a, b}

1. a|b denotes {a, b}

2. (a|b)(a|b) denotes {aa, ab, ba, bb}
i.e., (a|b)(a|b) = aa|ab|ba|bb
3. a∗ denotes {ε, a, aa, aaa, . . .}
4. (a|b)∗ denotes the set of all strings of a’s and b’s (including ε)
i.e., (a|b)∗ = (a∗b∗)∗
5. a|a∗b denotes {a, b, ab, aab, aaab, aaaab, . . .}

8
Recognizers
From a regular expression we can construct a
deterministic finite automaton (DFA)
Recognizer for identifier :
letter
digit

letter other
0 1 2

digit accept
other
3

error

identifier
letter → (a | b | c | ... | z | A | B | C | ... | Z)
digit → (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)
id → letter ( letter | digit )∗
9
Code for the recognizer
char ← next char();
state ← 0; /* code for state 0 */
done ← false;
token value ← "" /* empty string */
while( not done ) {
class ← char class[char];
state ← next state[class,state];
switch(state) {
case 1: /* building an id */
token value ← token value + char;
char ← next char();
break;
case 2: /* accept state */
token type = identifier;
done = true;
break;
case 3: /* error */
token type = error;
done = true;
break;
}
}
return token type;

10
Tables for the recognizer

Two tables control the recognizer

a−z A−Z 0−9 other
char class:
value letter letter digit other

class 0 1 2 3
letter 1 1 — —
next state:
digit 3 1 — —
other 3 2 — —

To change languages, we can just change tables

11
Automatic construction

Scanner generators automatically construct code from RE-like

descriptions

• construct a DFA
• use state minimization techniques
• emit code for the scanner
(table driven or direct code )

A key issue in automation is an interface to the parser

lex is a scanner generator supplied with UNIX

• emits C code for scanner

• provides macro definitions for each token
(used in the parser)
12
Grammars for regular languages

Can we place a restriction on the form of a grammar to ensure that it

describes a regular language?

Provable fact:

For any RE r, ∃ a grammar g such that L(r) = L(g)

Grammars that generate regular sets are called regular grammars:

They have productions in one of 2 forms:

1. A → aA
2. A → a
where A is any non-terminal and a is any terminal symbol

These are also called type 3 grammars (Chomsky)

13
More regular languages

Example: the set of strings containing an even number of zeros and an

even number of ones

1
s0 s1
1
0 0 0 0
1
s2 s3
1
The RE is (00 | 11)∗((01 | 10)(00 | 11)∗(01 | 10)(00 | 11)∗)∗

14
More regular expressions

What about the RE (a | b)∗abb ?

a jb

a b b
s0 s1 s2 s3

State s0 has multiple transitions on a!

⇒ nondeterministic finite automaton
a b
s0 {s0, s1} {s0}
s1 – {s2}
s2 – {s3}

15
Finite automata

A non-deterministic finite automaton (NFA) consists of:

1. a set of states S = {s0, . . . , sn}

2. a set of input symbols Σ (the alphabet)
3. a transition function move mapping state-symbol pairs to sets of
states
4. a distinguished start state s0
5. a set of distinguished accepting or final states F

A Deterministic Finite Automaton (DFA) is a special case of an NFA:

1. no state has a ε-transition, and

2. for each state s and input symbol a, there is at most one edge labelled
a leaving s

A DFA accepts x iff. ∃ a unique path through the transition graph from s0 to
a final state such that the edges spell x.
16
DFAs and NFAs are equivalent

1. DFAs are clearly a subset of NFAs

2. Any NFA can be converted into a DFA, by simulating sets of
simultaneous states:
• each DFA state corresponds to a set of NFA states
• possible exponential blowup

17
NFA to DFA using the subset construction: example 1
ajb

a b b
s0 s1 s2 s3

a b
{s0} {s0, s1} {s0}
{s0, s1} {s0, s1} {s0, s2}
{s0, s2} {s0, s1} {s0, s3}
{s0, s3} {s0, s1} {s0}
b

b
a

a b b
fs0 g fs0 ; s1 g fs0 ; s2 g fs0 ; s3 g

a
a

18
Constructing a DFA from a regular expression

DFA
minimized

RE DFA

NFA
ε moves

RE →NFA w/ε moves

build NFA for each term
connect them with ε moves
NFA w/ε moves to DFA
construct the simulation
the “subset” construction
DFA → minimized DFA
merge compatible states
DFA → RE
construct Rkij = Rk−1
ik (Rk−1 ∗ k−1 S k−1
kk ) Rk j Ri j

19
RE to NFA
ε
N(ε)
a
N(a)

N(A) A
ε ε

N(A|B)
ε ε
N(B) B

N(AB) N(A) A N(B) B

N(A∗) ε
ε
ε
N(A) A

20
RE to NFA: example

N(A) A
ε ε

a|b
ε ε
N(B) B

a
2 3

ε ε

(a|b)∗ 0
ε
1 6
ε
7

ε ε

4 5
b

a b b
abb 7 8 9 10

21
NFA to DFA: the subset construction
Input: NFA N
Output: A DFA D with states Dstates and transitions Dtrans such that L(D) = L(N)
Method: Let s be a state in N and T be a set of states, and using the following operations:

Operation Definition
ε-closure(s) set of NFA states reachable from NFA state s on ε-transitions
alone
ε-closure(T ) set of NFA states reachable from some NFA state s in T on
ε-transitions alone
move(T, a) set of NFA states to which there is a transition on input symbol
a from some NFA state s in T

add state T = ε-closure(s0) unmarked to Dstates

while ∃ unmarked state T in Dstates
mark T
for each input symbol a
U = ε-closure(move(T, a))
if U 6∈ Dstates then add U to Dstates unmarked
Dtrans[T, a] = U
endfor
endwhile

ε-closure(s0) is the start state of D

A state of D is final if it contains at least one final state in N

22
NFA to DFA using subset construction: example 2
ε

a
2 3

ε ε

ε ε a b b
0 1 6 7 8 9 10

ε ε

4 5
b

a b
A B C
A = {0, 1, 2, 4, 7} D = {1, 2, 4, 5, 6, 7, 9}
B B D
B = {1, 2, 3, 4, 6, 7, 8} E = {1, 2, 4, 5, 6, 7, 10}
C B C
C = {1, 2, 4, 5, 6, 7}
D B E
E B C
b

b
b
a
a

a b b
A B D E

a
a

23
Limits of regular languages

Not all languages are regular

One cannot construct DFAs to recognize these languages:

• L = {pk qk }
• L = {wcwr | w ∈ Σ∗}

Note: neither of these is a regular expression!

(DFAs cannot count!)

But, this is a little subtle. One can construct DFAs for:

• alternating 0’s and 1’s

(ε | 1)(01)∗(ε | 0)
• sets of pairs of 0’s and 1’s
(01 | 10)+

24
Ramification - Internet Protocol

How does your browser establish a connection with a web server?

• The client sends a SYN message to the server.

• In response, the server replies with a SYN-ACK.

• Finally the client sends an ACK back to the server.

This is done through two DFAs in the client and server, respectively.

25
Ramification - Intrusion Detection
Code Operating System
FILE * f;
f=fopen("demo", "r"); ------> SYS_OPEN
strcpy(...); //vulnerability
if (!f)
printf("Fail to open\n"); ------> SYS_WRITE
else
fgets(f, buf); ------> SYS_READ
...

A DFA will be exercised simultaneously with the program on the OS side

to detect intrusion.

CompilerD L3
No ratings yet
CompilerD L3
36 pages
Compiler Construction Lecture 3-4
No ratings yet
Compiler Construction Lecture 3-4
78 pages
19CSE401 CD 02 Scanners
No ratings yet
19CSE401 CD 02 Scanners
82 pages
Lec02 Lexicalanalyzer
100% (1)
Lec02 Lexicalanalyzer
50 pages
Lexical Analysis for Programmers
No ratings yet
Lexical Analysis for Programmers
67 pages
Recognition of Tokens
No ratings yet
Recognition of Tokens
34 pages
CS 346: Compilers: Lexical Analyzer Lexical Analyzer
No ratings yet
CS 346: Compilers: Lexical Analyzer Lexical Analyzer
52 pages
2 - Compilers (Lexical Analysis)
No ratings yet
2 - Compilers (Lexical Analysis)
60 pages
Chapter 2
No ratings yet
Chapter 2
99 pages
File 1675742677 110405 LexicalAnalysis-Continue1
No ratings yet
File 1675742677 110405 LexicalAnalysis-Continue1
39 pages
Compilers CH 3
No ratings yet
Compilers CH 3
58 pages
Lexical Analysis and Token Recognition
100% (3)
Lexical Analysis and Token Recognition
51 pages
Chapter 3 - Lexical Analysis
100% (1)
Chapter 3 - Lexical Analysis
51 pages
Chapter 3 - Lexical Analysis
No ratings yet
Chapter 3 - Lexical Analysis
51 pages
Chapter 3 - Lexical Analysis
No ratings yet
Chapter 3 - Lexical Analysis
51 pages
Chapter 3 Implementation - of - Lexical - Analysis
No ratings yet
Chapter 3 Implementation - of - Lexical - Analysis
63 pages
UNIT-I - Lexical Analysis
No ratings yet
UNIT-I - Lexical Analysis
51 pages
Module 1
No ratings yet
Module 1
20 pages
Lecture 3 Lexical Analyzer
No ratings yet
Lecture 3 Lexical Analyzer
44 pages
Lexical Analysis All Token List and Diffence
No ratings yet
Lexical Analysis All Token List and Diffence
4 pages
Lecture Week 03
No ratings yet
Lecture Week 03
24 pages
Formal Language & Automata Basics
No ratings yet
Formal Language & Automata Basics
24 pages
CD - Unit1 - Lecture4 5 6 7
No ratings yet
CD - Unit1 - Lecture4 5 6 7
50 pages
Chapter 2
No ratings yet
Chapter 2
56 pages
Lexical Analysis
No ratings yet
Lexical Analysis
47 pages
Regular Expression, DFA and NFA: Prepared By: Prof. J. S. Dhobi Prof. M. D. Mehta
No ratings yet
Regular Expression, DFA and NFA: Prepared By: Prof. J. S. Dhobi Prof. M. D. Mehta
82 pages
Formal Languages for CS Students
No ratings yet
Formal Languages for CS Students
31 pages
Chapter-2 Compiler Design
No ratings yet
Chapter-2 Compiler Design
98 pages
Patterns, Automata, and Regular Expressions
No ratings yet
Patterns, Automata, and Regular Expressions
4 pages
Lecture 06
No ratings yet
Lecture 06
27 pages
Formal Languages Part 1 Including Regular Expressions: Basic Concepts For Symbols, Strings, and Languages
No ratings yet
Formal Languages Part 1 Including Regular Expressions: Basic Concepts For Symbols, Strings, and Languages
4 pages
Regular Expressions: Reading: Chapter 3
No ratings yet
Regular Expressions: Reading: Chapter 3
39 pages
Regular Expression
No ratings yet
Regular Expression
106 pages
Compilation Techniques
No ratings yet
Compilation Techniques
21 pages
Compiler Course: Lexical Analysis
No ratings yet
Compiler Course: Lexical Analysis
50 pages
Lect 03
No ratings yet
Lect 03
19 pages
Lexical Analysis
No ratings yet
Lexical Analysis
36 pages
Week 02
No ratings yet
Week 02
28 pages
CH 3 - Regular Languages Amd Regular Grammars
No ratings yet
CH 3 - Regular Languages Amd Regular Grammars
67 pages
Regular Expression & Autometa
No ratings yet
Regular Expression & Autometa
62 pages
CH 3 Myppt
No ratings yet
CH 3 Myppt
59 pages
1st Phase Lexical Analyzer
No ratings yet
1st Phase Lexical Analyzer
33 pages
TCS Notes
No ratings yet
TCS Notes
14 pages
Token, Lexemes and Regular Expression
No ratings yet
Token, Lexemes and Regular Expression
22 pages
Regular Expressions: Definitions Equivalence To Finite Automata
No ratings yet
Regular Expressions: Definitions Equivalence To Finite Automata
29 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Compiler Design Lab Manual
No ratings yet
Compiler Design Lab Manual
32 pages
Regular Expressions and Tokens
No ratings yet
Regular Expressions and Tokens
14 pages
Regular Expressions & Automata
No ratings yet
Regular Expressions & Automata
4 pages
3 Regex
No ratings yet
3 Regex
16 pages
Topic 3
No ratings yet
Topic 3
66 pages
Finite Automata: A Simple Computing Model
No ratings yet
Finite Automata: A Simple Computing Model
53 pages
3-Lexical Analysis Part2
No ratings yet
3-Lexical Analysis Part2
39 pages
CH 3 - Regular Languages Amd Regular Grammars
No ratings yet
CH 3 - Regular Languages Amd Regular Grammars
67 pages
Scanner
No ratings yet
Scanner
56 pages
HW 21
No ratings yet
HW 21
2 pages
HW 06
No ratings yet
HW 06
2 pages
HW 20
No ratings yet
HW 20
2 pages
HW 17
No ratings yet
HW 17
2 pages
HW 04
No ratings yet
HW 04
2 pages
HW 12
No ratings yet
HW 12
3 pages
HW 27
No ratings yet
HW 27
2 pages
HW 16
No ratings yet
HW 16
2 pages
HW 31
No ratings yet
HW 31
2 pages
HW 05
No ratings yet
HW 05
2 pages
HW 30
No ratings yet
HW 30
2 pages
HW 14
No ratings yet
HW 14
2 pages
HW 02
No ratings yet
HW 02
2 pages
HW 03
No ratings yet
HW 03
2 pages
HW 13
No ratings yet
HW 13
2 pages
HW 25
No ratings yet
HW 25
2 pages
HW 22
No ratings yet
HW 22
2 pages
HW 07
No ratings yet
HW 07
3 pages
HW 08
No ratings yet
HW 08
2 pages
HW 11
No ratings yet
HW 11
2 pages
HW 24
No ratings yet
HW 24
2 pages
HW 28
No ratings yet
HW 28
2 pages
HW 29
No ratings yet
HW 29
1 page
HW 10
No ratings yet
HW 10
2 pages
HW 01
No ratings yet
HW 01
2 pages
HW 26
No ratings yet
HW 26
2 pages
HW 19
No ratings yet
HW 19
2 pages
HW 18
No ratings yet
HW 18
2 pages
HW 09
No ratings yet
HW 09
3 pages
HW 23
No ratings yet
HW 23
2 pages
5iseries Product Manual
100% (1)
5iseries Product Manual
76 pages
File Handling IO Basics
No ratings yet
File Handling IO Basics
57 pages
Simulado
No ratings yet
Simulado
6 pages
Technical Requirement Document Template
No ratings yet
Technical Requirement Document Template
4 pages
JSS2 Basic Programming Guide
No ratings yet
JSS2 Basic Programming Guide
11 pages
MATLAB Basics for Engineering
No ratings yet
MATLAB Basics for Engineering
18 pages
Java Area Calculation Examples
No ratings yet
Java Area Calculation Examples
5 pages
Web Dev: Client vs Server Validation
No ratings yet
Web Dev: Client vs Server Validation
7 pages
Distributed Computing Lecture Notes
No ratings yet
Distributed Computing Lecture Notes
20 pages
KVS No.2 XII-CS Practical List 2022-23
No ratings yet
KVS No.2 XII-CS Practical List 2022-23
3 pages
Bibliography
No ratings yet
Bibliography
19 pages
Pseudocode-Computer Science
No ratings yet
Pseudocode-Computer Science
3 pages
Python Basics for Beginners
No ratings yet
Python Basics for Beginners
17 pages
Lilypond Double Mark Example
No ratings yet
Lilypond Double Mark Example
2 pages
DATA WINDOW OBJECT 和 CONTROL 的基本技術掌握
No ratings yet
DATA WINDOW OBJECT 和 CONTROL 的基本技術掌握
21 pages
Wiki PLC Turntable
No ratings yet
Wiki PLC Turntable
6 pages
Ds Lab Assignment-6
No ratings yet
Ds Lab Assignment-6
12 pages
HTML Block and Inline Elements
No ratings yet
HTML Block and Inline Elements
8 pages
Unit - 2 PHP Basics: Learning Objectives
No ratings yet
Unit - 2 PHP Basics: Learning Objectives
45 pages
Cassandra Notes
No ratings yet
Cassandra Notes
6 pages
Java Basics for Beginners
No ratings yet
Java Basics for Beginners
11 pages
Informatics Practices Project On "Students Marksheet": Submitted By
No ratings yet
Informatics Practices Project On "Students Marksheet": Submitted By
28 pages
Expressions-Objective Type Q & A
0% (1)
Expressions-Objective Type Q & A
18 pages
Programmer's Guide For LIFE2's Rainflow Counting Algorithm
No ratings yet
Programmer's Guide For LIFE2's Rainflow Counting Algorithm
30 pages
DS Unit1 Ppt2
No ratings yet
DS Unit1 Ppt2
20 pages
SQL Between Operator
No ratings yet
SQL Between Operator
5 pages
Chemputation and The Standardization of Chemical Informatics
No ratings yet
Chemputation and The Standardization of Chemical Informatics
16 pages
AI Assignment Guide for Students
No ratings yet
AI Assignment Guide for Students
3 pages
Tips To Comment The Code
No ratings yet
Tips To Comment The Code
15 pages
Syllabus UCS301 2023 Scheme
No ratings yet
Syllabus UCS301 2023 Scheme
1 page

Code Source Tokens Scanner Parser IR

Uploaded by

Code Source Tokens Scanner Parser IR

Uploaded by

Scanner

A scanner must recognize the units of syntax

A scanner must recognize the units of syntax

We need a powerful notation to specify these patterns

Patterns are often specified as regular languages

1. ε is a RE denoting the set {ε}

(r)(s) is a RE denoting L(r)L(s)

If we adopt a precedence for operators, the extra parentheses can go

Numbers can get much more complicated

Most programming language tokens can be described with REs

We can use REs to build scanners automatically

1. a|b denotes {a, b}

Two tables control the recognizer

To change languages, we can just change tables

Scanner generators automatically construct code from RE-like

A key issue in automation is an interface to the parser

lex is a scanner generator supplied with UNIX

• emits C code for scanner

Can we place a restriction on the form of a grammar to ensure that it

For any RE r, ∃ a grammar g such that L(r) = L(g)

Grammars that generate regular sets are called regular grammars:

They have productions in one of 2 forms:

These are also called type 3 grammars (Chomsky)

Example: the set of strings containing an even number of zeros and an

What about the RE (a | b)∗abb ?

State s0 has multiple transitions on a!

A non-deterministic finite automaton (NFA) consists of:

1. a set of states S = {s0, . . . , sn}

A Deterministic Finite Automaton (DFA) is a special case of an NFA:

1. no state has a ε-transition, and

1. DFAs are clearly a subset of NFAs

RE →NFA w/ε moves

N(AB) N(A) A N(B) B

add state T = ε-closure(s0) unmarked to Dstates

ε-closure(s0) is the start state of D

Not all languages are regular

One cannot construct DFAs to recognize these languages:

Note: neither of these is a regular expression!

But, this is a little subtle. One can construct DFAs for:

• alternating 0’s and 1’s

How does your browser establish a connection with a web server?

• The client sends a SYN message to the server.

• In response, the server replies with a SYN-ACK.

• Finally the client sends an ACK back to the server.

A DFA will be exercised simultaneously with the program on the OS side

You might also like