Chapter 2-Lexical Analysis
Chapter 2-Lexical Analysis
Chapter 2
Lexical Analysis
1
Objective
At the end of this session students will be able to:
Understand the basic roles of lexical analyzer (LA): Lexical Analysis Versus Parsing,
Tokens , Patterns, and Lexemes, Attributes for Tokens and Lexical Errors.
Understand the specification of Tokens: Strings and Languages, Operations on
program and groups the characters into meaningful sequences called lexemes.
For each lexeme, the lexical analyzer produces as output a token of the form {token-
Where,
token- name is an abstract symbol that is used during syntax analysis , and
attribute-value points to an entry in the symbol table for this token.
3
Example
For example, suppose a source program contains the assignment statement
1. position is a lexeme that would be mapped into a token {id, 1}, where id is an
abstract symbol standing for identifier and 1 points to the symbol table entry for
position.
The symbol-table entry for an identifier holds information about the identifier,
such as its name and type.
2. The assignment symbol = is a lexeme that is mapped into the token {=}. Since this
3. initial is a lexeme that is mapped into the token {id, 2} , where 2 points to the 4
Contd…
4. + is a lexeme that is mapped into the token {+}.
5. rate is a lexeme that is mapped into the token { id, 3 }, where 3 points to the
5
Contd…
Parser uses tokens
produced by the LA to
create a tree-like
intermediate representation
that depicts the
grammatical structure of
Ittheuses
token
thestream.
syntax tree and
the information in the
symbol table to check the
source program for
semantic consistency with
the language definition.
An important part of
semantic analysis is type
In the process of translating a checking, where the
source program into target code, a compiler checks that each
compiler may construct one or operator has matching
more intermediate representations, operands
which are easy to produce and The machine-independent code-
easy to translate into the target optimization phase attempts to
machine improve the intermediate code so
that better target code will result.
Usually better means faster, but
other objectives may be desired,
such as shorter code, or target
code that consumes less power.
6
The Role of Lexical Analyzer
Source token To semantic
Lexical
program Parser analysis
Analyzer
getNextToken
Symbol
Table
It is common for the lexical analyzer to interact with the symbol table.
In some cases, information regarding the kind of identifier may be read from
the symbol table by the lexical analyzer to assist it in determining the proper
specialized techniques that serve only the lexical task, not the job of parsing.
In addition, specialized buffering techniques for reading input characters can
speed up the compiler significantly.
3. Compiler portability is enhanced:- Input-device-specific peculiarities can be
Pattern is a rule describing the set of lexemes that can represent a particular token in
source program
It is a description of the form that the lexemes of a token may take.
In the case of a keyword as a token, the pattern is just the sequence of characters 10
Contd…
One efficient but complex brute-force approach is to read character by character
What we’d like is a set of tools that will allow us to easily create and modify a
lexical analyzer that has the same run-time efficiency as the brute-force
method.
=The
if (c first tool==
nextchar() that‘c’)
we{ use to attack this problem is Deterministic Finite Automata
if(DFA).
(c = nextchar() == ‘l’) {
// code to handle the rest of either “class” or any identifier that starts
with “cl”
} else if (c == ‘a’) {
// code to handle the rest of either “case” or any identifier that starts
with “ca”
} else {
// code to handle any identifier that starts with c
}
} else if ( c = …..) { 11
Lexical analysis and Types of token
Types of token
Contd…
14
Lexical Analyzer - Toenization
Use tools like Lex, flex, javacc… Easy to implement but not as efficient as the first
19
Lexical Errors
Are primarily of two kinds.
20
Handling Lexical Errors
It is hard for a LA to tell , without the aid of other components, that there is a
source-code error.
For instance, if the string fi is encountered for the first time in a java program
proceed because none of the patterns for tokens matches any prefix of the
remaining input.
The simplest recovery strategy is "panic mode" recovery.
We delete successive characters from the remaining input , until the lexical
analyzer can find a well-formed token at the beginning of what input is left.
This recovery technique may confuse the parser, but in an interactive computing
21
Contd…
Other possible error-recovery actions are:
5. Pre-scanning
22
Specification of Tokens
Regular expressions are an important notation for specifying lexeme patterns.
While they cannot express all possible patterns, they are very effective in specifying
specification of tokens
An alphabet is any finite set of symbols. Typical examples of symbols are letters ,
systems. 27
Contd…
A string over an alphabet is a finite sequence of symbols drawn from that
alphabet .
In language theory, the terms "sentence" and "word" are often used as
synonyms for "string."
The length of a string s, usually written |s|, is the number of occurrences of
symbols in s.
For example, banana is a string of length six.
Abstract languages like , the empty set, or {Ɛ} , the set containing only the
of all grammatically correct English sentences, although the latter two languages
28
Terms for Parts of String
The following string-related terms are commonly used:
1. A prefix of string S is any string obtained by removing zero or more symbols from the
end of s.
For example, ban, banana, and Ɛ are prefixes of banana.
2. A suffix of string s is any string obtained by removing zero or more symbols from
the beginning of s.
For example, nana, banana, and Ɛ are suffixes of banana.
3. A substring of s is obtained by deleting any prefix and any suffix from s.
For instance, banana, nan, and Ɛ are substrings of banana.
4. The proper prefixes, suffixes, and substrings of a string s are those, prefixes,
suffixes, and substrings, respectively, of s that are not Ɛ or not equal to s itself.
5. A subsequence of s is any string formed by deleting zero or more not necessarily
consecutive positions of s.
29
Operations on Languages
The following string-related terms are commonly used:
In lexical analysis, the most important operations on languages are union,
length one.
Here are some other languages that can be constructed from languages L and D, using
tokens
Regular expressions are means for specifying regular languages
Example: ID letter_(letter|digit)*
letter A|B|…|Z|a|b|…|z|_
digit 0|1|2|…|9
Each regular expression is a pattern specifying the form of strings
The regular expressions are built recursively out of smaller regular expressions,
Table: The algebraic laws that hold for arbitrary regular expressions r, s, 36
Regular Definitions
If ∑ is an alphabet of basic symbols, then a regular definition is a sequence of
Where
definitions of the form :
1. Each di is a new symbol, not in ∑ and not
d1 r1
the same as any other of the d's, and
d2 r2
2. Each ri is a regular expression over the
…
alphabet ∑ U { d1 ,d2 , ... ,di-1}.
dn rn
Examples (b) Regular Definition for Java statements
(a) Regular Definition for Java Identifiers
stmt if expr then stmt
ID letter(letter|digit)* | if expr then stmt else stmt
letter A|B|…|Z|a|b|…|z|_ |Ɛ
expr term relop term
(c) Regular 0|1|2|…|9
digitDefinition for unsigned numbers
| term
such as 5280 , 0.0 1 234, 6. 336E4, or 1.
89E-4 term id
digit 0|1|2|…|9 | number
digits digit digit*
optFrac . digts|Ɛ
optExp (E(+|-|Ɛ) digts|Ɛ 37
Recognition of Regular Expressions
1. Starting point is the language grammar to understand the tokens:
stmt if expr then stmt
| if expr then stmt else stmt
|Ɛ
expr term relop term
| term
term id
| number
2. The next step is to formalize the patterns: 3. We also need to handle whitespaces:
digit [0-9]
ws (blank | tab | newline)+
Digits digit+
number digit(.digits)? (E[+-]? Digit)?
letter [A-Za-z_]
id letter (letter|digit)*
If if
Then then
Else else
Relop < | > | <= | >= | = | <> 41
Transition Diagram
Lexemes Token Names Attribute Value
Any ws - -
if if -
then then -
else else -
Any id id Pointer to table entry
Any number number Pointer to table entry
< relop LT
<= relop LE
= relop EQ
<> relop NE
> relop GT
>= relop GE
Fig. Transition diagram Transition diagram for reserved words and identifiers 42
Finite Automata
The lexical analyzer tools use finite automata, at the heart of the transition, to convert the
1. Finite automata are recognizers ; they simply say "yes" or "no" about each
possible input string.
2. Finite automata come in two flavors :
A. Nondeterministic finite automata (NFA) have no restrictions on the labels of
their edges . A symbol can label several edges out of the same state, and
Ɛ, the empty string, is a possible label.
B. Deterministic finite automata (DFA) have, for each state, and for each
symbol of its input alphabet exactly one edge with that symbol leaving
that state.
Both deterministic and nondeterministic finite automata are capable of recognizing
0
a
A Transition
45
Example
b b
1) q1 c q2 a q4
c c
a b
2)
Java Compiler Compiler (JavaCC) is the most popular Lexical analyzer and parser
generation such as tree building (via a tool called JJTree included with JavaCC),
54
Flow for using JavaCC
52.
When there is more than one rule that matches the input, JavaCC uses the
following strategy:
2. If two different rules match the same string, use the rule that appears
For example, the “else” string matches both the ELSE and IDENTIFIER rules,
61
Using the generated TokenManager in JavaCC
To create a Java program that uses the lexical analyzer created by JavaCC, you need
to instantiate a variable of type simpleTokenManager, where simple is the name of
your .jj file.
The constructor for simpleTokenManager requires an object of type
SimpleCharStream.
The constructor for SimpleCharStream requires a standard Java.io.InputStream.
Thus, you can use the general lexical analyzer as follows:
Token t;
simpleTokenManager tm;
Java.io.InputStream infile;
tm = new simpleTokenManager(new SimpleCharStream(infile));
infile = new Java.io.FileInputstream(“Inputfile.txt”);
t = tm.getNextToken();
while (t.kind != simpleConstants.EOF)
{
/* Process t */ 62
Thank
Thank You
You ...
...
?
63