Compiler 2

Lexical analysis is the first phase of a compiler.
It takes modified source code from

language preprocessors that are written in the form of sentences. The lexical analyzer breaks
these syntaxes into a series of tokens, by removing any whitespace or comments in the source
code.
If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer works
closely with the syntax analyzer. It reads character streams from the source code, checks for
legal tokens, and passes the data to the syntax analyzer when it demands.
Tokens
Lexemes are said to be a sequence of characters (alphanumeric) in a token. There are some
predefined rules for every lexeme to be identified as a valid token. These rules are defined by
grammar rules, by means of a pattern. A pattern explains what can be a token, and these patterns
are defined by means of regular expressions.
In programming language, keywords, constants, identifiers, strings, numbers, operators and

punctuations symbols can be considered as tokens.
For example, in C language, the variable declaration line
int value = 100;
contains the tokens:
int (keyword), value (identifier), = (operator), 100 (constant) and ;

(symbol).
Specifications of Tokens
Let us understand how the language theory undertakes the following terms:
Alphabets
Any finite set of symbols {0,1} is a set of binary alphabets, {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is

a set of Hexadecimal alphabets, {a-z, A-Z} is a set of English language alphabets.
Strings
Any finite sequence of alphabets (characters) is called a string. Length of the string is the total
number of occurrence of alphabets, e.g., the length of the string tutorialspoint is 14 and is
denoted by |tutorialspoint| = 14. A string having no alphabets, i.e. a string of zero length is
known as an empty string and is denoted by ε (epsilon).
Special symbols
A typical high-level language contains the following symbols:-
Addition(+), Subtraction(-), Modulo(%), Multiplication(*),

Arithmetic Symbols
Division(/)
Punctuation Comma(,), Semicolon(;), Dot(.), Arrow(->)
Assignment =
Special Assignment +=, /=, *=, -=
Comparison ==, !=, <, <=, >, >=
Preprocessor #
Location Specifier &
Logical &, &&, |, ||, !
Shift Operator >>, >>>, <<, <<<
Language
A language is considered as a finite set of strings over some finite set of alphabets. Computer
languages are considered as finite sets, and mathematically set operations can be performed on
them. Finite languages can be described by means of regular expressions.
Bootstrapping
 Bootstrapping is widely used in the compilation development.

 Bootstrapping is used to produce a self-hosting compiler. Self-hosting compiler is a type of
compiler that can compile its own source code.
 Bootstrap compiler is used to compile the compiler and then you can use this compiled compiler
to compile everything else as well as future versions of itself.
A compiler can be characterized by three languages:
1. Source Language
2. Target Language
3. Implementation Language
The T- diagram shows a compiler SCIT for Source S, Target T, implemented in I.
Follow some steps to produce a new language L for machine A:
1. Create a compiler SCAA for subset, S of the desired language, L using language "A" and that
compiler runs on machine A.
2. Create a compiler LCSA for language L written in a subset of L.

3. Compile LCSA using the compiler SCAA to obtain LCAA. LCAA is a compiler for language L, which
runs on machine A and produces code for machine A.
The process described by the T-diagrams is called bootstrapping.
Regular Expressions
The lexical analyzer needs to scan and identify only a finite set of valid string/token/lexeme that
belong to the language in hand. It searches for the pattern defined by the language rules.
Regular expressions have the capability to express finite languages by defining a pattern for
finite strings of symbols. The grammar defined by regular expressions is known as regular
grammar. The language defined by regular grammar is known as regular language.
Regular expression is an important notation for specifying patterns. Each pattern matches a set of
strings, so regular expressions serve as names for a set of strings. Programming language tokens
can be described by regular languages. The specification of regular expressions is an example of
a recursive definition. Regular languages are easy to understand and have efficient
implementation.
There are a number of algebraic laws that are obeyed by regular expressions, which can be used
to manipulate regular expressions into equivalent forms.
Operations
The various operations on languages are:
 Union of two languages L and M is written as
L U M = {s | s is in L or s is in M}
 Concatenation of two languages L and M is written as

LM = {st | s is in L and t is in M}
 The Kleene Closure of a language L is written as
L* = Zero or more occurrence of language L.
Notations
If r and s are regular expressions denoting the languages L(r) and L(s), then
 Union : (r)|(s) is a regular expression denoting L(r) U L(s)

 Concatenation : (r)(s) is a regular expression denoting L(r)L(s)
 Kleene closure : (r)* is a regular expression denoting (L(r))*
 (r) is a regular expression denoting L(r)
Precedence and Associativity

 *, concatenation (.), and | (pipe sign) are left associative
 * has the highest precedence
 Concatenation (.) has the second highest precedence.
 | (pipe sign) has the lowest precedence of all.
Representing valid tokens of a language in regular expression
If x is a regular expression, then:
x* means zero or more occurrence of x.
i.e., it can generate { e, x, xx, xxx, xxxx, … }
x+ means one or more occurrence of x.
i.e., it can generate { x, xx, xxx, xxxx … } or x.x*
x? means at most one occurrence of x
i.e., it can generate either {x} or {e}.
[a-z] is all lower-case alphabets of English language.
[A-Z] is all upper-case alphabets of English language.
[0-9] is all natural digits used in mathematics.
Representation occurrence of symbols using regular expressions
letter = [a – z] or [A – Z]
digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]
sign = [ + | - ]
Representation of language tokens using regular expressions

Decimal = (sign)?(digit)+
Identifier = (letter)(letter | digit)*
The only problem left with the lexical analyzer is how to verify the validity of a regular
expression used in specifying the patterns of keywords of a language. A well-accepted solution is
to use finite automata for verification.
Finite Automata
Finite automata is a state machine that takes a string of symbols as input and changes its state
accordingly. Finite automata is a recognizer for regular expressions. When a regular expression
string is fed into finite automata, it changes its state for each literal. If the input string is
successfully processed and the automata reaches its final state, it is accepted, i.e., the string just
fed was said to be a valid token of the language in hand.
The mathematical model of finite automata consists of:
 Finite set of states (Q)

 Finite set of input symbols (Σ)
 One Start state (q0)
 Set of final states (qf)
 Transition function (δ)
The transition function (δ) maps the finite set of state (Q) to a finite set of input symbols (Σ), Q ×
Σ➔Q
Finite Automata Construction
Let L(r) be a regular language recognized by some finite automata (FA).
 States : States of FA are represented by circles. State names are of the state is written
inside the circle.
 Start state : The state from where the automata starts, is known as start state. Start state
has an arrow pointed towards it.
 Intermediate states : All intermediate states has at least two arrows; one pointing to and
another pointing out from them.
 Final state : If the input string is successfully parsed, the automata is expected to be in
this state. Final state is represented by double circles. It may have any odd number of
arrows pointing to it and even number of arrows pointing out from it. The number of odd
arrows are one greater than even, i.e. odd = even+1.
 Transition : The transition from one state to another state happens when a desired
symbol in the input is found. Upon transition, automata can either move to next state or
stay in the same state. Movement from one state to another is shown as a directed arrow,
where the arrows points to the destination state. If automata stays on the same state, an
arrow pointing from a state to itself is drawn.
Example : We assume FA accepts any three digit binary value ending in digit 1. FA = {Q(q0, qf),
Σ(0,1), q0, qf, δ}
Longest Match Rule

When the lexical analyzer read the source-code, it scans the code letter by letter; and when a
whitespace, operator symbol, or special symbols occurs, it decides that a word is completed.
For example:
int intvalue;
While scanning both lexemes till ‘int’, the lexical analyzer cannot determine whether it is a
keyword int or the initials of identifier int value.
The Longest Match Rule states that the lexeme scanned should be determined based on the
longest match among all the tokens available.
The lexical analyzer also follows rule priority where a reserved word, e.g., a keyword, of a
language is given priority over user input. That is, if the lexical analyzer finds a lexeme that
matches with any existing reserved word, it should generate an error.
Optimization of DFA
To optimize the DFA you have to follow the various steps. These are as follows:
Step 1: Remove all the states that are unreachable from the initial state via any set of the
transition of DFA.
Step 2: Draw the transition table for all pair of states.
Step 3: Now split the transition table into two tables T1 and T2. T1 contains all final states and
T2 contains non-final states.
Step 4: Find the similar rows from T1 such that:
1. δ (q, a) = p
2. δ (r, a) = p
That means, find the two states which have same value of a and b and remove one of them.
Step 5: Repeat step 3 until there is no similar rows are available in the transition table T1.
Step 6: Repeat step 3 and step 4 for table T2 also.
Step 7: Now combine the reduced T1 and T2 tables. The combined transition table is the
transition table of minimized DFA.
Example
Solution:
Step 1: In the given DFA, q2 and q4 are the unreachable states so remove them.
Step 2: Draw the transition table for rest of the states.
Step 3:
Now divide rows of transition table into two sets as:
1. One set contains those rows, which start from non-final sates:
2. Other set contains those rows, which starts from final states.
Step 4: Set 1 has no similar rows so set 1 will be the same.

Step 5: In set 2, row 1 and row 2 are similar since q3 and q5 transit to same state on 0 and 1. So
skip q5 and then replace q5 by q3 in the rest.
Step 6: Now combine set 1 and set 2 as:
Now it is the transition table of minimized DFA.
Transition diagram of minimized DFA:
Fig: Minimized DFA
LEX
 Lex is a program that generates lexical analyzer. It is used with YACC parser generator.
 The lexical analyzer is a program that transforms an input stream into a sequence of tokens.
 It reads the input stream and produces the source code as output through implementing the
lexical analyzer in the C program.
The function of Lex is as follows:
 Firstly lexical analyzer creates a program lex.1 in the Lex language. Then Lex compiler runs the
lex.1 program and produces a C program lex.yy.c.
 Finally C compiler runs the lex.yy.c program and produces an object program a.out.
 a.out is lexical analyzer that transforms an input stream into a sequence of tokens.
Lex file format

A Lex program is separated into three sections by %% delimiters. The formal of Lex source is as
follows:
{ definitions }
1. %%
2. { rules }
3. %%
4. { user subroutines }
Definitions include declarations of constant, variable and regular definitions.
Rules define the statement of form p1 {action1} p2 {action2}....pn {action}.
Where pi describes the regular expression and action1 describes the actions what action the
lexical analyzer should take when pattern pi matches a lexeme.
User subroutines are auxiliary procedures needed by the actions. The subroutine can be loaded
with the lexical analyzer and compiled separately.

Compiler 2

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Compiler 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Compiler 2

Uploaded by

Copyright:

Available Formats

Lexical analysis is the first phase of a compiler.

It takes modified source code from

In programming language, keywords, constants, identifiers, strings, numbers, operators and

For example, in C language, the variable declaration line

int value = 100;

contains the tokens:

int (keyword), value (identifier), = (operator), 100 (constant) and ;

Any finite set of symbols {0,1} is a set of binary alphabets, {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is

A typical high-level language contains the following symbols:-

Addition(+), Subtraction(-), Modulo(%), Multiplication(*),

 Bootstrapping is widely used in the compilation development.

A compiler can be characterized by three languages:

The T- diagram shows a compiler SCIT for Source S, Target T, implemented in I.

Follow some steps to produce a new language L for machine A:

2. Create a compiler LCSA for language L written in a subset of L.

The process described by the T-diagrams is called bootstrapping.

 Union of two languages L and M is written as

 Concatenation of two languages L and M is written as

 The Kleene Closure of a language L is written as

L* = Zero or more occurrence of language L.

 Union : (r)|(s) is a regular expression denoting L(r) U L(s)

Precedence and Associativity

Representing valid tokens of a language in regular expression

If x is a regular expression, then:

x* means zero or more occurrence of x.

i.e., it can generate { e, x, xx, xxx, xxxx, … }

x+ means one or more occurrence of x.

i.e., it can generate { x, xx, xxx, xxxx … } or x.x*

x? means at most one occurrence of x

i.e., it can generate either {x} or {e}.

[a-z] is all lower-case alphabets of English language.

[A-Z] is all upper-case alphabets of English language.

[0-9] is all natural digits used in mathematics.

Representation occurrence of symbols using regular expressions

Representation of language tokens using regular expressions

Identifier = (letter)(letter | digit)*

The mathematical model of finite automata consists of:

 Finite set of states (Q)

Finite Automata Construction

Let L(r) be a regular language recognized by some finite automata (FA).

Longest Match Rule

Step 2: Draw the transition table for all pair of states.

Step 4: Find the similar rows from T1 such that:

Step 6: Repeat step 3 and step 4 for table T2 also.

Step 2: Draw the transition table for rest of the states.

Now divide rows of transition table into two sets as:

Step 4: Set 1 has no similar rows so set 1 will be the same.

Step 6: Now combine set 1 and set 2 as:

Now it is the transition table of minimized DFA.

Transition diagram of minimized DFA:

Fig: Minimized DFA

The function of Lex is as follows:

Lex file format

Definitions include declarations of constant, variable and regular definitions.

Rules define the statement of form p1 {action1} p2 {action2}....pn {action}.

You might also like