Compiler 2
Compiler 2
Compiler 2
If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer works
closely with the syntax analyzer. It reads character streams from the source code, checks for
legal tokens, and passes the data to the syntax analyzer when it demands.
Tokens
Lexemes are said to be a sequence of characters (alphanumeric) in a token. There are some
predefined rules for every lexeme to be identified as a valid token. These rules are defined by
grammar rules, by means of a pattern. A pattern explains what can be a token, and these patterns
are defined by means of regular expressions.
Specifications of Tokens
Let us understand how the language theory undertakes the following terms:
Alphabets
Strings
Any finite sequence of alphabets (characters) is called a string. Length of the string is the total
number of occurrence of alphabets, e.g., the length of the string tutorialspoint is 14 and is
denoted by |tutorialspoint| = 14. A string having no alphabets, i.e. a string of zero length is
known as an empty string and is denoted by ε (epsilon).
Special symbols
Language
A language is considered as a finite set of strings over some finite set of alphabets. Computer
languages are considered as finite sets, and mathematically set operations can be performed on
them. Finite languages can be described by means of regular expressions.
Bootstrapping
1. Source Language
2. Target Language
3. Implementation Language
1. Create a compiler SCAA for subset, S of the desired language, L using language "A" and that
compiler runs on machine A.
Regular Expressions
The lexical analyzer needs to scan and identify only a finite set of valid string/token/lexeme that
belong to the language in hand. It searches for the pattern defined by the language rules.
Regular expressions have the capability to express finite languages by defining a pattern for
finite strings of symbols. The grammar defined by regular expressions is known as regular
grammar. The language defined by regular grammar is known as regular language.
Regular expression is an important notation for specifying patterns. Each pattern matches a set of
strings, so regular expressions serve as names for a set of strings. Programming language tokens
can be described by regular languages. The specification of regular expressions is an example of
a recursive definition. Regular languages are easy to understand and have efficient
implementation.
There are a number of algebraic laws that are obeyed by regular expressions, which can be used
to manipulate regular expressions into equivalent forms.
Operations
The various operations on languages are:
L U M = {s | s is in L or s is in M}
Notations
If r and s are regular expressions denoting the languages L(r) and L(s), then
letter = [a – z] or [A – Z]
digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]
sign = [ + | - ]
The only problem left with the lexical analyzer is how to verify the validity of a regular
expression used in specifying the patterns of keywords of a language. A well-accepted solution is
to use finite automata for verification.
Finite Automata
Finite automata is a state machine that takes a string of symbols as input and changes its state
accordingly. Finite automata is a recognizer for regular expressions. When a regular expression
string is fed into finite automata, it changes its state for each literal. If the input string is
successfully processed and the automata reaches its final state, it is accepted, i.e., the string just
fed was said to be a valid token of the language in hand.
The transition function (δ) maps the finite set of state (Q) to a finite set of input symbols (Σ), Q ×
Σ➔Q
States : States of FA are represented by circles. State names are of the state is written
inside the circle.
Start state : The state from where the automata starts, is known as start state. Start state
has an arrow pointed towards it.
Intermediate states : All intermediate states has at least two arrows; one pointing to and
another pointing out from them.
Final state : If the input string is successfully parsed, the automata is expected to be in
this state. Final state is represented by double circles. It may have any odd number of
arrows pointing to it and even number of arrows pointing out from it. The number of odd
arrows are one greater than even, i.e. odd = even+1.
Transition : The transition from one state to another state happens when a desired
symbol in the input is found. Upon transition, automata can either move to next state or
stay in the same state. Movement from one state to another is shown as a directed arrow,
where the arrows points to the destination state. If automata stays on the same state, an
arrow pointing from a state to itself is drawn.
Example : We assume FA accepts any three digit binary value ending in digit 1. FA = {Q(q0, qf),
Σ(0,1), q0, qf, δ}
For example:
int intvalue;
While scanning both lexemes till ‘int’, the lexical analyzer cannot determine whether it is a
keyword int or the initials of identifier int value.
The Longest Match Rule states that the lexeme scanned should be determined based on the
longest match among all the tokens available.
The lexical analyzer also follows rule priority where a reserved word, e.g., a keyword, of a
language is given priority over user input. That is, if the lexical analyzer finds a lexeme that
matches with any existing reserved word, it should generate an error.
Optimization of DFA
To optimize the DFA you have to follow the various steps. These are as follows:
Step 1: Remove all the states that are unreachable from the initial state via any set of the
transition of DFA.
Step 3: Now split the transition table into two tables T1 and T2. T1 contains all final states and
T2 contains non-final states.
1. δ (q, a) = p
2. δ (r, a) = p
That means, find the two states which have same value of a and b and remove one of them.
Step 5: Repeat step 3 until there is no similar rows are available in the transition table T1.
Step 7: Now combine the reduced T1 and T2 tables. The combined transition table is the
transition table of minimized DFA.
Example
Solution:
Step 1: In the given DFA, q2 and q4 are the unreachable states so remove them.
Step 3:
1. One set contains those rows, which start from non-final sates:
2. Other set contains those rows, which starts from final states.
LEX
Lex is a program that generates lexical analyzer. It is used with YACC parser generator.
The lexical analyzer is a program that transforms an input stream into a sequence of tokens.
It reads the input stream and produces the source code as output through implementing the
lexical analyzer in the C program.
Firstly lexical analyzer creates a program lex.1 in the Lex language. Then Lex compiler runs the
lex.1 program and produces a C program lex.yy.c.
Finally C compiler runs the lex.yy.c program and produces an object program a.out.
a.out is lexical analyzer that transforms an input stream into a sequence of tokens.
{ definitions }
1. %%
2. { rules }
3. %%
4. { user subroutines }
Where pi describes the regular expression and action1 describes the actions what action the
lexical analyzer should take when pattern pi matches a lexeme.
User subroutines are auxiliary procedures needed by the actions. The subroutine can be loaded
with the lexical analyzer and compiled separately.