[go: up one dir, main page]

0% found this document useful (0 votes)
22 views43 pages

unit 1

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 43

UNIT -1

INTRODUCTION TO COMPILERS

1.1 OVERVIEW OF LANGUAGE PROCESSING SYSTEM

Figure 1.1.1
Preprocessor
A preprocessor produce input to compilers. They may perform the following
functions.
1. Macro processing: A preprocessor may allow a user to define macros
that are short hands for longer constructs.
2. File inclusion: A preprocessor may include header files into the program text.
3. Rational preprocessor: these preprocessors augment older
languages with more modern flow-of-control and data structuring
facilities.
4. Language Extensions: These preprocessor attempts to add
capabilities to the language by certain amounts to build-in macro
Compiler
Compiler is a translator program that translates a program written in (HLL) the source
program and translates it into an equivalent program in (MLL) the target program. As
an important part of a compiler is error showing to the programmer.

Source program Compiler Target program

Error message
Executing a program written in HLL programming language is basically of two parts.
The source program must first be compiled translated into a object program. Then the
results object program is loaded into a memory and is executed.

Source pgm Compiler obj pgm

Obj pgm input Obj pgm opj pgm output

Assembler: programmers found it difficult to write or read programs in machine


language. They begin to use a mnemonic (symbols) for each machine instruction,
which they would subsequently translate into machine language. Such a mnemonic
machine language is now called an assembly language. Programs known as
assembler were written to automate the translation of assembly language in to
machine language. The input to an assembler program is called source program, the
output is a machine language translation (object program).

Interpreter: An interpreter is a program that appears to execute a source program as


if it were machine language.

Figure 1.1.2
Languages such as BASIC, COBOL, and LISP can be translated using
interpreters. JAVA also uses interpreter. The process of interpretation can be
carried out in following phases.
1. Lexical analysis
2. Syntax analysis
3. Semantic analysis
4. Direct Execution
Advantages:
 Modification of user program can be easily made and implemented as
execution proceeds.
 Debugging a program and finding errors is simplified task for a program used for
interpretation.
 The interpreter for the language makes it machine independent.
machine
Disadvantages:
 The execution of the program is slower. Memory consumption is more.
Loader and Link-editor:
 Once the assembler procedures an object program, that program must be
placed into memory and executed. The assembler could place the object
o
program directly in memory and transfer control to it, thereby causing the
machine language program to be execute.
 This would waste core by leaving the assembler in memory while the user`s
program was being executed. Also the programmer would have to retranslate
his program with each execution, thus wasting translation time. To overcome
the problem of wasted translation time and memory, system programmers
developed another component called loader
 “A loader is a program that places programs into memory and prepares them
for execution.” It would be more efficient if subroutines could be translated into
object form the loader could ”relocate” directly behind the user`s program. The
task of adjusting programs o they may be placed in arbitrary core locatio
locations is
called relocation. Relocation loaders perform four functions.
Translator
 A translator is a program that takes as input a program written in one language
and produces as output a program in another language. Beside program
translation, the translator performs another very important role, the error-
error
detection. Any violation of d HLL specification would be detected and reported
to the programmers. Important role of translator are:
Translating the HLL program input into an equivalent ml program.
Providing diagnostic messages wherever the programmer violates
specification of the HLL.

1.2 STRUCTURE OF THE COMPILER


Phases of a compiler: A compiler operates in phases. A phase is a logically
interrelated operation that takes source program in one representation and
produces output in another representation. The phases of a compiler are shown
in below
There are two phases of compilation.
a. Analysis (Machine Independent/Language Dependent)
b. Synthesis(Machine Dependent/Language independent)

Phases of a Compiler
Compilation process is partitioned into No-of-sub processes called ‘phases’.

Lexical Analyzer: Lexical Analyzer reads the source program character by character
and returns the tokens of the source program. A token describes a pattern of characters
having same meaning in the source program. (such as identifiers, operators, keywords,
numbers, delimiters and so on)
Figure 1.2.1
Syntax Analyzer: A Syntax Analyzer creates the syntactic structure (generally a parse
tree) of the given program.
 A syntax analyzer is also called as a parser.
 A parse tree describes a syntactic structure.
 In a parse tree, all terminals are at leaves.
 All inner nodes are non--terminals in a context free grammar.
Semantic Analyzer
 It checks the source program for semantic errors and col lects the type information
collects
for the subsequent code
code-generation phase.
 It uses hierarchical structure determined by the syntax analysis phase to identify the
syntax-analysis
operators and operands of expressions and statements.
 An important component of semantic analysis is
i type checking.
 Normally semantic information cannot be represented by a context-free language
used in syntax analyzers.
 Context-free grammars used in the syntax analysis are integrated with attributes
(semantic rules) the result is a syntax-directed translation, and Attribute grammars.
Intermediate Code Generation
 A compiler may produce an explicit intermediate codes representing the source
program.
 These intermediate codes are generally machine (architecture independent). But the
level of intermediate codes is close to the level of machine codes.
 An intermediate form called "three-address code” which is like the assembly
language for a machine in which every memory location can act like a register.
Three-address code consists of a sequence of instructions, each of which has at
most three operands.
Code Optimizer
 The code optimizer optimizes the code produced by the intermediate code generator
in the terms of time and space.
 The code optimization phase attempts to improve the intermediate code, so that
faster-running machine code will result.
 Optimization may involve:
a. Detection and removal of dead(unreachable) code
b. Calculation of constant expressions and terms
c. Collapsing of repeated expressions into temporary storage.
d. Loop controlling
e. Moving code outside of loops
f. Removal of unnecessary temporary variables.
Code Generator
 Produces the target language in a specific architecture.
 The target program is normally is a re-locatable object file containing the machine
codes.
 This phase involves:
- Allocation of registers and memory
- Generation of correct references
- Generation of correct types
- Generation of machine code.

Symbol table
 An essential function of a compiler is to record the identifiers used in the source
program and collect information about various attributes of each identifier.
 A symbol table is a data structure containing a record for each identifier, with fields
for the attributes of the identifier.
 The data structure allows us to find the record for each identifier quickly and to store
or retrieve data from that record quickly.
 When an identifier in the source program is detected by the lexical analyzer, the
identifier is entered into the symbol table.
 The attributes of an identifier cannot normally be determined during lexical analysis.
For example, in a Pascal declaration like
var position* initial, rate : real ;
The type real is not known when position, initial, and rate are seen by the lexical
analyzer. The remaining phases enter information about identifiers into the symbol
table and then use this information in various ways.
Error Detection and Reporting
 Each compiler phase can encounter errors. However, after detecting an error, a
phase must somehow deal with that error, so that compilation can proceed, allowing
further errors in the source program to be detected.
 The lexical phase can detect errors where the characters remaining in the input do
not form any token of the language.
 Errors where the token stream violates the structure rules (syntax) of the language
are determined by the syntax analysis phase.
 During semantic analysis the compiler tries to detect constructs that have the right
syntactic structure but no meaning to the operation.
Figure 1.2.1.1
1.3 COUSINS OF COMPILER
1. Preprocessor
2. Assembler
3. Loader and Link-editor
Preprocessor
A preprocessor is a program that processes its input data to produce output that is
used as input to another program. The output is said to be a preprocessed form of the
input data, which is often used by some subsequent programs like compilers.
They may perform the following functions:
1. Macro processing 3. Rational Preprocessors
2. File Inclusion 4. Language extension
1. Macro processing:
A macro is a rule or pattern that specifies how a certain input sequence should
be mapped to an output sequence according to a defined procedure. The mapping
process that instantiates a macro into a specific output sequence is known as macro
expansion.
2. File Inclusion:
Preprocessor includes header files into the program text. When the preprocessor
finds an #include directive it replaces it by the entire content of the specified file.
3. Rational Preprocessors:
These processors change older languages with more modern flow-of-control and
data-structuring facilities.
4. Language extension
These processors attempt to add capabilities to the language by what amounts to
built-in macros. For example, the language Equel is a database query language
embedded in C.
Assembler
Assembler creates object code by translating assembly instruction mnemonics
into machine code. There are two types of assemblers:
One-pass assemblers go through the source code once and assume that all
symbols will be defined before any instruction that references them.
Two-pass assemblers create a table with all symbols and their values in the first pass,
and then use the table in a second pass to generate code.
Linker and Loader
A linker or link editor is a program that takes one or more objects generated by
a compiler and combines them into a single executable program. Three tasks of the
linker are
1. Searches the program to find library routines used by program, e.g. printf(), math
routines.
2. Determines the memory locations that code from each module will occupy and
relocates its instructions by adjusting absolute references
3. Resolves references among files.
A loader is the part of an operating system that is responsible for loading
programs in memory, one of the essential stages in the process of starting a program.

1.4 GROUPING OF THE PHASES


Compiler can be grouped into front and back ends:

Front end: analysis (machine independent)


These normally include lexical and syntactic analysis, the creation of the symbol
table, semantic analysis and the generation of intermediate code. It also includes error
handling that goes along with each of these phases.

Back end: synthesis (machine dependent)


It includes code optimization phase and code generation along with the
necessary error handling and symbol table operations.
Compiler passes
A collection of phases is done only once (single pass) or multiple times (multi pass)
Single pass: usually requires everything to be defined before being used in
source program.
Multi pass: compiler may have to keep entire program representation in
memory.
Several phases can be grouped into one single pass and the activities of these phases
are interleaved during the pass. For example, lexical analysis, syntax analysis, semantic
analysis and intermediate code generation might be grouped into one pass.

1.5 COMPILER CONSTRUCTION TOOLS


In addition to software-development tools, other more specialized tools have been
developed for helping implement various phases of a compiler. Some general tools
have been created for the automatic design of specific compiler components.
These took use specialized languages for specifying and implementing the
component, and many use algorithms that are quite sophisticated.
The most successful tools are those that hide the details of the generation algorithm
and produce components that can be easily integrated into the remainder of a compiler.
The following is a list of some useful compiler-construction tools:
Parser generators
These produce syntax analyzers, normally from input that is based on a context-
free grammar. In early compilers, syntax analysis consumed not only a large fraction of
the running time of a com- compiler, but a large fraction of the intellectual effort of
writing a compiler. This phase is now considered one of the easiest to implement.
Many parser generators utilize powerful parsing algorithms that are too complex
to be carried out by hand.
Scanner generators
These automatically generate lexical analyzers, normally from a specification
based on regular expressions. The basic organization of the resulting lexical analyzer is
in effect a finite automaton.
Syntax-directed translation engines
These produce collections of routines that walk the parse tree, generating
intermediate code.
The basic idea is that one or more "translations" are associated with each node
of the parse tree, and each translation is defined in terms of translations at its neighbor
nodes in the tree.
Automatic code generators
Such a tool takes a collection of rules that define the translation of each
operation of the intermediate language into the machine language for the target
machine.
The rules must include sufficient detail that we can handle the different possible
access methods for data; e.g. variables may be in registers, in a fixed (static) location in
memory, or may be allocated a position on a stack. The basic technique is "template
matching”.
The intermediate code statements are replaced by "templates" that represent
sequences of machine instructions, in such a way that the assumptions about storage of
variables match from template to template.
Since there are usually many options regarding where variables are to be placed
(e.g., in one of several registers or in memory), there are many possible ways to "tile"
intermediate code with a given set of templates, and it is necessary to select a good
tiling without a combinatorial explosion in running time of the compiler.
Data-flow engines
Much of the information needed to perform good code optimization involves "data-
flow analysis," the gathering of information about how values are transmitted from one
part of a program to each other part.
Different tasks of this nature can be performed by essentially the same routine,
with the user supplying details of the relationship between intermediate code statements
and the information being gathered.

1.6 LEXICAL ANALYSIS


A simple way to build lexical analyzer is to construct a diagram that illustrates
the structure of the tokens of the source language, and then to hand-translate the
diagram into a program for finding tokens. Efficient lexical analyzers can be produced in
this manner.
Role of Lexical Analyser
The lexical analyzer is the first phase of compiler. Its main task is to read the
input characters and produces output a sequence of tokens that the parser uses for
syntax analysis. As in the figure, upon receiving a “get next token” command from the
parser the lexical analyzer reads input characters until it can identify the next token.

Figure 1.6.1

Interaction of lexical analyzer with parser


Since the lexical analyzer is the part of the compiler that reads the source text, it
may also perform certain secondary tasks at the user interface. One such task is
stripping out from the source program comments and white space in the form of blank,
tab, and new line character. Another is correlating error messages from the compiler
with the source program.
Issues in Lexical Analysis
There are several reasons for separating the analysis phase of compiling into
lexical analysis and parsing
 Simpler design is the most important consideration.
 The separation of lexical analysis from syntax analysis often allows us to simplify
one or the other of these phases.
 Compiler efficiency is improved.
 Compiler portability is enhanced.

Tokens Patterns and Lexemes.


There is a set of strings in the input for which the same token is produced as
output. This set of strings is described by a rule called a pattern associated with the
token. The pattern is set to match each string in the set.
In most programming languages, the following constructs are treated as tokens:
keywords, operators, identifiers, constants, literal strings, and punctuation symbols such
as parentheses, commas, and semicolons.
Lexeme
Collection or group of characters forming tokens is called Lexeme. A lexeme is a
sequence of characters in the source program that is matched by the pattern for the
token. For example in the Pascal’s statement const pi = 3.1416; the substring pi is a
lexeme for the token identifier.

Patterns
A pattern is a rule describing a set of lexemes that can represent a particular
token in source program. The pattern for the token const in the above table is just the
single string const that spells out the keyword.

Certain language conventions impact the difficulty of lexical analysis. Languages


such as FORTRAN require a certain constructs in fixed positions on the input line. Thus
the alignment of a lexeme may be important in determining the correctness of a source
program.

Attributes of Token
The lexical analyzer returns to the parser a representation for the token it has
found. The representation is an integer code if the token is a simple construct such as a
left parenthesis, comma, or colon. The representation is a pair consisting of an integer
code and a pointer to a table if the token is a more complex element such as an
identifier or constant.
The integer code gives the token type, the pointer points to the value of that
token. Pairs are also retuned whenever we wish to distinguish between instances of a
token.
The attributes influence the translation of tokens.
i) Constant : value of the constant
ii) Identifiers: pointer to the corresponding symbol table entry.
1.7 Lexical Errors
A character sequence which is not possible to scan into any valid token is a lexical
error. Important facts about the lexical error:
 Lexical errors are not very common, but it should be managed by a scanner
 Misspelling of identifiers, operators, keyword are considered as lexical errors
 Generally, a lexical error is caused by the appearance of some illegal character,
mostly at the beginning of a token.
Error Recovery in Lexical Analyzer
Here, are a few most common error recovery techniques:
 Removes one character from the remaining input
 In the panic mode, the successive characters are always ignored until we reach a
well-formed token
 By inserting the missing character into the remaining input
 Replace a character with another character
 Transpose two serial characters

1.8 INPUT BUFFERING


The lexical analyzer scans the characters of the source program one a t a time to
discover tokens. Often, however, many characters beyond the next token many have to
be examined before the next token itself can be determined.
For this and other reasons, it is desirable for the lexical analyzer to read its input
from an input buffer. Fig. 1.9 shows a buffer divided into two halves of, say 100
characters each. One pointer marks the beginning of the token being discovered. A look
ahead pointer scans ahead of the beginning point, until the token is discovered .
we view the position of each pointer as being between the character last read
and the character next to be read. In practice each buffering scheme adopts one
convention either a pointer is at the symbol last read or the symbol it is ready to read.
The distance which the look ahead pointer may have to travel past the actual
token may be large. For example, in a PL/I program we may see:

DECLARE (ARG1, ARG2… ARG n)


Without knowing whether DECLARE is a keyword or an array name until we see the
character that follows the right parenthesis. In either case, the token itself ends at the
second E. If the look ahead pointer travels beyond the buffer half in which it began, the
other half must be loaded with the next characters from the source file.
Since the buffer shown in above figure is of limited size there is an implied
constraint on how much look ahead can be used before the next token is discovered.
In the above example, if the look ahead traveled to the left half and all the way
through the left half to the middle, we could not reload the right half, because we would
lose characters that had not yet been grouped into tokens.
While we can make the buffer larger if we chose or use another buffering
scheme, we cannot ignore the fact that overhead is limited.
BUFFER PAIRS
· A buffer is divided into two N-character halves, as shown below

Figure 1.8.1

Buffer Pairs

 Each buffer is of the same size N, and N is usually the number of characters on
one block. E.g., 1024 or 4096 bytes.
 Using one system read command we can read N characters into a buffer.
 If fewer than N characters remain in the input file, then a special character,
represented by eof, marks the end of the source file.
Two pointers to the input are maintained:
1. Pointer lexeme beginning, marks the beginning of the current lexeme,
whose extent we are attempting to determine.
2. Pointer forward scans ahead until a pattern match is found.
Once the next lexeme is determined, forward is set to the character at its right end.
The string of characters between the two pointers is the current lexeme.
After the lexeme is recorded as an attribute value of a token returned to the
parser, lexeme beginning is set to the character immediately after the lexeme just
found.

Advancing forward pointer


Advancing forward pointer requires that we first test whether we have reached
the end of one of the buffers, and if so, we must reload the other buffer from the input,
and move forward to the beginning of the newly loaded buffer.
If the end of second buffer is reached, we must again reload the first buffer with
input and the pointer wraps to the beginning of the buffer.
Code to advance forward pointer:
if forward at end of first half then begin reload second half;
forward := forward + 1 end
else if forward at end of second half then begin reload second half;
move forward to beginning of first half end
else forward := forward + 1;
Sentinels
For each character read, we make two tests: one for the end of the buffer, and
one to determine what character is read. We can combine the buffer-end test with the
test for the current character if we extend each buffer to hold a sentinel character at the
end.
The sentinel is a special character that cannot be part of the source program, and
a natural choice is the character eof. The sentinel arrangement is as shown below:
Figure 1.8.2 Sentinels at end of each buffer half
Note that eof retains its use as a marker for the end of the entire input. Any eof that
appears other than at the end of a buffer means that the input is at an end.
Code to advance forward pointer:
forward : = forward + 1;
if forward ↑ = eof then begin
if forward at end of first half then begin reload second half; forward := forward +1
end
else if forward at end of second half then begin reload first half;
move forward to beginning of first half end
else /* eof within a buffer signifying end of input */ terminate lexical analysis
end

1.9 SPECIFICATION OF TOKENS


There are 3 specifications of tokens:
1 Strings
2 Language
3 Regular expression

Strings and Languages


 An alphabet or character class is a finite set of symbols.
 A string over an alphabet is a finite sequence of symbols drawn from that
alphabet.
 A language is any countable set of strings over some fixed alphabet.
 In language theory, the terms "sentence" and "word" are often used as synonyms
for "string." The length of a string s, usually written |s|, is the number of
occurrences of symbols in s.
 For example, banana is a string of length six. The empty string, denoted ε, is the
string of length zero.
Operations on strings
The following string-related terms are commonly used:
1. A prefix of string s is any string obtained by removing zero or more
symbols from the end of string s. For example, ban is a prefix of banana.
2. A suffix of string s is any string obtained by removing zero or more
symbols from the beginning of s. For example, nana is a suffix of banana.
3. A substring of s is obtained by deleting any prefix and any suffix from s.
For example, nan is a substring of banana.
4. The proper prefixes, suffixes, and substrings of a string s are those
prefixes, suffixes, and substrings, respectively of s that are not ε or not
equal to s itself.
5. A subsequence of s is any string formed by deleting zero or more not
necessarily consecutive positions of s
6. For example, baan is a subsequence of banana.
Operations on languages
The following are the operations that can be applied to languages:
1. Union
2. Concatenation
3. Kleene closure
4. Positive closure
The following example shows the operations on strings: Let L={0,1} and S={a,b,c}

Regular Expressions
 Each regular expression r denotes a language L(r).
 Here are the rules that define the regular expressions over some alphabet Σ and
the languages that those expressions denote:
1. 1.ε is a regular expression, and L(ε) is { ε }, that is, the language whose sole
member is the empty string.
2. If ‘a’ is a symbol in Σ, then ‘a’ is a regular expression, and L(a) = {a}, that is, the
language with one string, of length one, with ‘a’ in its one position.
3. Suppose r and s are regular expressions denoting the languages L(r) and L(s).
Then,
a. (r)|(s) is a regular expression denoting the language L(r) U L(s).
b. (r)(s) is a regular expression denoting the language L(r)L(s).
c. (r)* is a regular expression denoting (L(r))*.
d. (r) is a regular expression denoting L(r).
4. The unary operator * has highest precedence and is left associative.
5. Concatenation has second highest precedence and is left associative.
6. | has lowest precedence and is left associative.

Regular set
A language that can be defined by a regular expression is called a regular set. If
two regular expressions r and s denote the same regular set, we say they are
equivalent and write r = s.
There are a number of algebraic laws for regular expressions that can be used to
manipulate into equivalent forms.
For instance, r|s = s|r is commutative; r|(s|t)=(r|s)|t is associative.

Regular Definitions
Giving names to regular expressions is referred to as a Regular definition. If Σ is
an alphabet of basic symbols, then a regular definition is a sequence of definitions of
the form
dl → r 1
d2 → r2

………
dn → rn
1. Each di is a distinct name.
2. Each ri is a regular expression over the alphabet Σ U {dl, d2,. . . , di-l}.
Example: Identifiers is the set of strings of letters and digits beginning with a letter.
Regular
definition for this set:

letter → A | B | …. | Z | a | b | …. | z | digit → 0 | 1 | …. | 9
id → letter ( letter | digit ) *

Short hands
Certain constructs occur so frequently in regular expressions that it is convenient
to introduce notational short hands for them.
1. One or more instances (+):
 The unary postfix operator + means “ one or more instances of” .
 If r is a regular expression that denotes the language L(r), then ( r )+ is a
regular expression that denotes the language (L (r ))+
 Thus the regular expression a+ denotes the set of all strings of one or
more a’s.
 The operator + has the same precedence and associativity as the
operator *.
2. Zero or one instance ( *):
 The unary postfix operator * means “zero or one instance of”.
 The notation r* is a shorthand for r | ε.
 If ‘r’ is a regular expression, then ( r )* is a regular expression that denotes
the language
3. Character Classes:
 The notation [abc] where a, b and c are alphabet symbols denotes the
regular expression a | b | c.
 Character class such as [a – z] denotes the regular expression a | b | c | d
| ….|z.
 We can describe identifiers as being strings generated by the regular
expression, [A–Za–z][A– Za–z0–9]*
Non-regular Set
A language which cannot be described by any regular expression is a non-
regular set.Example: The set of all strings of balanced parentheses and repeating
strings cannot be described by a regular expression. This set can be specified by a
context-free grammar.

1.10 RECOGNITION OF TOKENS


Consider the following grammar fragment: stmt → if expr then stmt
| if expr then stmt else stmt | ε
expr → term relop term
| term
term → id | num
where the terminals if , then, else, relop, id and num generate sets of strings given by
the following regular definitions:
if → if
then → then
else → else
relop → <|<=|=|<>|>|>=
id → letter(letter|digit)*
num → digit+ (.digit+)?(E(+|-)?digit+)?
For this language fragment the lexical analyzer will recognize the keywords if,
then, else, as well as the lexemes denoted by relop, id, and num. To simplify matters,
we assume keywords are reserved; that is, they cannot be used as identifiers.
Transition diagrams
It is a diagrammatic representation to depict the action that will take place when a
lexical analyzer is called by the parser to get the next token. It is used to keep track of
information about the characters that are seen as the forward pointer scans the input.

1.11 A LANGUAGE FOR SPECIFYING LEXICAL ANALYZER


There is a wide range of tools for constructing lexical analyzers.
LEX
YACC
Lex is a computer program that generates lexical analyzers. Lex is commonly
used with the yacc parser generator.
Creating a lexical analyzer
• First, a specification of a lexical analyzer is prepared by creating a
program lex.l in the Lex language. Then, lex.l is run through the Lex compiler
to produce a C program lex.yy.c.
• Finally, lex.yy.c is run through the C compiler to produce an object progra
m a.out, which is the lexical analyzer that transforms an input stream into a
sequence of tokens.

Figure 1.11
Creating a lexical analyzer with lex

Lex Specification
A Lex program consists of three parts:
{ definitions }
%%
{ rules }
%%
{ user subroutines }
Definitions include declarations of variables, constants, and regular definitions
Rules are statements of the form
p1 {action1}
p2 {action2}

pn {actionn}
where pi is regular expression and actioni describes what action the lexical analyzer
should take when pattern pi matches a lexeme. Actions are written in C code.
User subroutines separately and loaded with the lexical analyzer.are auxiliary
procedures needed by the actions. These can be compiled
YACC- Yet Another Compiler-Compiler
YACC provides a general tool for describing the input to a computer program. The Yacc
user specifies the structures of his input, together with code to be invoked as each such
structure is recognized.
YACC turns such a specification into a subroutine that handles the input process;
frequently, it is convenient and appropriate to have most of the flow of control in the
user's application handled by this subroutine.

1.12 FINITE AUTOMATA


Finite Automata is one of the mathematical models that consist of a number of
states and edges. It is a transition diagram that recognizes a regular expression or
grammar.
There are two types of Finite Automata :
a. Non-deterministic Finite Automata (NFA)
b. Deterministic Finite Automata (DFA)

Deterministic Finite Automata


DFA is a special case of a NFA in which
i) no state has an ε-transition.
ii) There is at most one transition from each state on any input.
DFA has five tuples denoted by
M = {Qd, Ʃ, δ, q0, fd}
Qd – finite set of states
Ʃ – finite set of input symbols

Construction of DFA from regular expression


expres
The following steps are involved in the construction of DFA from regular expression:
 Convert RE to epsilon NFA using Thomson’s rules
 Construct minimized DFA using Subset Construction
Converting a regular expression to a NFA - Thompson's Algorithm
We will use the rules which defined a regular expression as a basis for the
construction:
1. The NFA representing the empty string is:

2. If the regular expression is just a character, eg. a,, then the corresponding NFA is
:

represented by a choice of transitions from a node; thus


3. The union operator is represented
a|b can be represented as:

4. Concatenation simply involves connecting one NFA to the other; eg. ab is:

5. The Kleene closure must allow for taking zero or more instances of the letter
from the input; thus a* looks like:
We can covert from an NFA to a DFA using subset construction.
To perform this operation, let us define two functions:
 The -closure function takes a state and returns the set of states reachable
from it based on (one or more) -transitions.
ons. Note that this will always include
the state itself. We should be able to get from a state to any state in its -closure
without consuming any input.
 The function move takes a state and a character, and returns the set of states
reachable by one transition
tran on this character.
We can generalize both these functions to apply to sets of states by taking the union of
the application to individual states.
Eg. If A, B and C are states, move({A,B,C},`a') = move(A,`a') move(B,`a')
move(C,`a').

The Subset Construction Algorithm


1. Create the start state of the DFA by taking the closure of the start state of the
-closure
NFA.
2. Perform the following for the new DFA state:
For each possible input symbol:
 Apply move to the newly created state and the input symbol; this
newly-created th will
return a set of states.
 Apply the closure to this set of states, possibly resulting in a new set.
-closure
This set of NFA states will be a single state in the DFA.
3. Each time we generate a new DFA state, we must apply step 2 to it. The process
ete when applying step 2 does not yield any new states.
is complete
4. The finish states of the DFA are those which contain any of the finish states of
the NFA.
Example:
Construct the DFA for the regular expression (a/b)*ab.

Solution:

a
3 4
ε
ε
Sta ε ε a b
RTr 1 2 7 8 9

ε
ε
5 6
b

ε -closure (1) = {1, 2, 3, 5, 8} A


Move (A,a) = {4,9}
Move (A,b) = {6}

ε -closure ({4, 9}) = {4, 7, 8, 2, 3, 5, 9} = {2, 3, 4, 5, 7, 8, 9} B


Move (B,a) = {4,9}
Move (B,b) = {6,10}

ε -closure ({6}) = {6,7,8,2,3,5} = {2,3,5,6,7,8} C


Move (C,a) = {4,9}
Move(C,b) = {6}

ε -closure ({6, 10}) = {6,7,8,2,3,5,10} = {2,3,5,6,7,8,10} D


Move (D,a) = {4,9}
Move (D,b) = {6}
Transition Table:
State Input
a b
->A B C
B B D
C B C
*D B C

Start a b
A B C

a
a

b
b
D

Minimization of DFA

DFA minimization stands for converting a given DFA to its equivalent DFA with
minimum number of states.

Minimization of DFA
DFA Minimization using Equivalence Theorem

If X and Y are two states in a DFA, we can combine these two states into {X, Y} if they
are not distinguishable. Two states are distinguishable, if there is at least one string S,
such that one of δ (X, S) and δ (Y, S) is accepting and another is not accepting. Hence,
a DFA is minimal if and only if all the states are distinguishable.
Step 1 − All the states Q are divided in two partitions − final states and non-final
states and are denoted by P0. All the states in a partition are 0th equivalent. Take a
counter k and initialize it with 0.
Step 2 − Increment k by 1. For each partition in Pk, divide the states in Pk into two
partitions if they are k-distinguishable. Two states within this partition X and Y are k-
distinguishable if there is an input S such that δ(X, S) and δ(Y, S) are (k-1)-
distinguishable.
Step 3 − If Pk ≠ Pk-1, repeat Step 2, otherwise go to Step 4.
Step 4 − Combine kth equivalent sets and make them the new states of the reduced
DFA.
Example
Let us consider the following DFA −

q δ(q,0) δ(q,1)

a b c

b a d

c e f

d e f

e e f

f f f
Let us apply the above algorithm to the above DFA −
 P0 = {(c,d,e), (a,b,f)}
 P1 = {(c,d,e), (a,b),(f)}
 P2 = {(c,d,e), (a,b),(f)}
Hence, P1 = P2.
There are three states in the reduced DFA. The reduced DFA is as follows –

Q δ(q,0) δ(q,1)

(a, b) (a, b) (c,d,e)

(c,d,e) (c,d,e) (f)

(f) (f) (f)

1.12 LEX
Lex is a program generator designed for lexical processing of character input
streams. It accepts a high-level, problem oriented specification for character string
matching, and produces a program in a general purpose language which recognizes
regular expressions.
The regular expressions are specified by the user in the source specifications
given to Lex. The Lex written code recognizes these expressions in an input stream and
partitions the input stream into strings matching the expressions.
At the boundaries between strings program sections provided by the user are
executed. The Lex source file associates the regular expressions and the program
fragments.
As each expression appears in the input to the program written by Lex, the
corresponding fragment is executed.
The user supplies the additional code beyond expression matching needed to
complete his tasks, possibly including code written by other generators. The program
that recognizes the expressions is generated in the general purpose programming
language employed for the user's program fragments.
Thus, a high level expression language is provided to write the string
expressions to be matched while the user's freedom to write actions is unimpaired. This
avoids forcing the user who wishes to use a string manipulation language for input
analysis to write processing programs in the same and often inappropriate string
handling language.
Lex is not a complete language, but rather a generator representing a new
language feature which can be added to different programming languages, called ``host
languages.'' Just as general purpose languages can produce code to run on different
computer hardware, Lex can write code in different host languages.
The host language is used for the output code generated by Lex and also for the
program fragments added by the user. Compatible run-time libraries for the different
host languages are also provided.
This makes Lex adaptable to different environments and different users. Each
application may be directed to the combination of hardware and host language
appropriate to the task, the user's background, and the properties of local
implementations.
Lex turns the user's expressions and actions (called source in this memo) into
the host general-purpose language; the generated program is named yylex. The yylex
program will recognize expressions in a stream (called input in this memo) and perform
the specified actions for each expression as it is detected. See Figure 1.
+-------+
Source -> | Lex | -> yylex
+-------+
+-------+
Input -> | yylex | -> Output
+-------+
For a trivial example, consider a program to delete from the input all blanks or
tabs at the ends of lines.
%%
[ \t]+$ ;
is all that is required. The program contains a %% delimiter to mark the beginning of the
rules, and one rule. This rule contains a regular expression which matches one or more
instances of the characters blank or tab (written \t for visibility, in accordance with the C
language convention) just prior to the end of a line.
The brackets indicate the character class made of blank and tab; the + indicates
``one or more ...''; and the $ indicates ``end of line,'' as in QED. No action is specified,
so the program generated by Lex (yylex) will ignore these characters. Everything else
will be copied. To change any remaining string of blanks or tabs to a single blank, add
another rule:
%%
[ \t]+$ ;
[ \t]+ printf(" ");
The finite automaton generated for this source will scan for both rules at once,
observing at the termination of the string of blanks or tabs whether or not there is a
newline character, and executing the desired rule action. The first rule matches all
strings of blanks or tabs at the end of lines, and the second rule all remaining strings of
blanks or tabs.
Lex Source.
The general format of Lex source is:
{definitions}
%%
{rules}
%%
{user subroutines}
where the definitions and the user subroutines are often omitted. The second %% is
optional, but the first is required to mark the beginning of the rules. The absolute
minimum Lex program is thus
%%
(no definitions, no rules) which translates into a program which copies the input to the
output unchanged.
In the outline of Lex programs shown above, the rules represent the user's
control decisions; they are a table, in which the left column contains regular expressions
(see section 3) and the right column contains actions, program fragments to be
executed when the expressions are recognized. Thus an individual rule might appear
integer printf("found keyword INT");
to look for the string integer in the input stream and print the message ``found keyword
INT'' whenever it appears.
In this example the host procedural language is C and the C library function
printf is used to print the string. The end of the expression is indicated by the first blank
or tab character.
If the action is merely a single C expression, it can just be given on the right side
of the line; if it is compound, or takes more than a line, it should be enclosed in braces.
As a slightly more useful example, suppose it is desired to change a number of words
from British to American spelling. Lex rules such as
colour printf("color");
mechanise printf("mechanize");
petrol printf("gas");
would be a start.
Lex Regular Expressions
The definitions of regular expressions are very similar to those in QED [5]. A
regular expression specifies a set of strings to be matched.
It contains text characters (which match the corresponding characters in the
strings being compared) and operator characters (which specify repetitions, choices,
and other features).
The letters of the alphabet and the digits are always text characters; thus the
regular expression integer matches the string integer wherever it appears and the
expression
a57D looks for the string a57D.
Operators
The operator characters are
"\[]^-?.*+|()$/{}%<>
and if they are to be used as text characters, an escape should be used. The quotation
mark operator (") indicates that whatever is contained between a pair of quotes is to be
taken as text characters. Thus
xyz"++"
matches the string xyz++ when it appears. Note that a part of a string may be quoted. It
is harmless but unnecessary to quote an ordinary text character; the expression
"xyz++"
is the same as the one above. Thus by quoting every non-alphanumeric character being
used as a text character, the user can avoid remembering the list above of current
operator characters, and is safe should further extensions to Lex lengthen the list.
An operator character may also be turned into a text character by preceding it with \ as
in
xyz\+\+
which is another, less readable, equivalent of the above expressions. Another use of the
quoting mechanism is to get a blank into an expression; normally, as explained above,
blanks or tabs end a rule. Any blank character not contained within [] (see below) must
be quoted.
Several normal C escapes with \ are recognized: \n is newline, \t is tab, and \b is
backspace. To enter \ itself, use \\. Since newline is illegal in an expression, \n must be
used; it is not required to escape tab and backspace. Every character but blank, tab,
newline and the list above is always a text character.
Character classes.
Classes of characters can be specified using the operator pair []. The
construction [abc] matches a single character, which may be a, b, or c. Within square
brackets, most operator meanings are ignored. Only three characters are special: these
are \ - and ^. The - character indicates ranges. For example,
[a-z0-9<>_]
indicates the character class containing all the lower case letters, the digits, the angle
brackets, and underline.
Ranges may be given in either order. Using - between any pair of characters
which are not both upper case letters, both lower case letters, or both digits is
implementation dependent and will get a warning message. (E.g., [0-z] in ASCII is many
more characters than it is in EBCDIC).
If it is desired to include the character - in a character class, it should be first or
last; thus
[-+0-9]
matches all the digits and the two signs.
In character classes, the ^ operator must appear as the first character after the
left bracket; it indicates that the resulting string is to be complemented with respect to
the computer character set. Thus
[^abc]
matches all characters except a, b, or c, including all special or control characters; or
[^a-zA-Z]
is any character which is not a letter. The \ character provides the usual escapes within
character class brackets.
Arbitrary character.
To match almost any character, the operator character . is the class of all
characters except newline. Escaping into octal is possible although non-portable:
[\40-\176]
matches all printable characters in the ASCII character set, from octal 40 (blank) to octal
176 (tilde).
Optional expressions.
The operator ? indicates an optional element of an expression. Thus
ab?c
matches either ac or abc.
Repeated expressions.
Repetitions of classes are indicated by the operators * and +.
a*
is any number of consecutive a characters, including zero; while
a+
is one or more instances of a. For example,
[a-z]+
is all strings of lower case letters. And
[A-Za-z][A-Za-z0-9]*
indicates all alphanumeric strings with a leading alphabetic character. This is a typical
expression for recognizing identifiers in computer languages.

Alternation and Grouping


The operator | indicates alternation:
(ab|cd)
matches either ab or cd. Note that parentheses are used for grouping, although they are
not necessary on the outside level;
ab|cd
would have sufficed. Parentheses can be used for more complex expressions:
(ab|cd+)?(ef)*
matches such strings as abefef, efefef, cdef, or cddd; but not abc, abcd, or abcdef.
Context sensitivity
Lex will recognize a small amount of surrounding context. The two simplest
operators for this are ^ and $. If the first character of an expression is ^, the expression
will only be matched at the beginning of a line (after a newline character, or at the
beginning of the input stream).
This can never conflict with the other meaning of ^, complementation of character
classes, since that only applies within the [] operators. If the very last character is $, the
expression will only be matched at the end of a line (when immediately followed by
newline).
The latter operator is a special case of the / operator character, which indicates
trailing context. The expression
ab/cd
matches the string ab, but only if followed by cd. Thus
ab$
is the same as
ab/\n
Left context is handled in Lex by start conditions as explained in section 10. If a
rule is only to be executed when the Lex automaton interpreter is in start condition x, the
rule should be prefixed by
<x>
using the angle bracket operator characters. If we considered ``being at the beginning
of a line'' to be start condition ONE, then the ^ operator would be equivalent to
<ONE>
Start conditions are explained more fully later.
Repetitions and Definitions
The operators {} specify either repetitions (if they enclose numbers) or definition
expansion (if they enclose a name). For example
{digit}
looks for a predefined string named digit and inserts it at that point in the expression.
The definitions are given in the first part of the Lex input, before the rules. In contrast,
a{1,5}
looks for 1 to 5 occurrences of a.
Finally, initial % is special, being the separator for Lex source segments.
Lex Action
When an expression written as above is matched, Lex executes the
corresponding action. This section describes some features of Lex which aid in writing
actions. Note that there is a default action, which consists of copying the input to the
output. This is performed on all strings not otherwise matched.
Thus the Lex user who wishes to absorb the entire input, without producing any
output, must provide rules to match everything. When Lex is being used with Yacc, this
is the normal situation. One may consider that actions are what is done instead of
copying the input to the output; thus, in general, a rule which merely copies can be
omitted.
Also, a character combination which is omitted from the rules and which appears
as input is likely to be printed on the output, thus calling attention to the gap in the rules.
One of the simplest things that can be done is to ignore the input. Specifying a C null
statement, ; as an action causes this result. A frequent rule is
[ \t\n] ;
which causes the three spacing characters (blank, tab, and newline) to be ignored.
Another easy way to avoid writing actions is the action character |, which
indicates that the action for this rule is the action for the next rule. The previous example
could also have been written
""
"\t"
"\n"
with the same result, although in different style. The quotes around \n and \t are not
required.
In more complex actions, the user will often want to know the actual text that
matched some expression like [a-z]+. Lex leaves this text in an external character array
named yytext. Thus, to print the name found, a rule like
[a-z]+ printf("%s", yytext);
will print the string in yytext. The C function printf accepts a format argument and data to
be printed; in this case, the format is ``print string'' (% indicating data conversion, and s
indicating string type), and the data are the characters in yytext.
So this just places the matched string on the output. This action is so common
that it may be written as ECHO:
[a-z]+ ECHO;
is the same as the above. Since the default action is just to print the characters
found, one might ask why give a rule, like this one, which merely specifies the default
action? Such rules are often required to avoid matching some other rule which is not
desired.
For example, if there is a rule which matches read it will normally match the
instances of read contained in bread or readjust; to avoid this, a rule of the form [a-z]+ is
needed. This is explained further below.
Sometimes it is more convenient to know the end of what has been found; hence
Lex also provides a count yyleng of the number of characters matched. To count both
the number of words and the number of characters in words in the input, the user might
write [a-zA-Z]+ {words++; chars += yyleng;} which accumulates in chars the number of
characters in the words recognized.
The last character in the string matched can be accessed by
yytext[yyleng-1]
Occasionally, a Lex action may decide that a rule has not recognized the correct span
of characters.
Two routines are provided to aid with this situation. First, yymore() can be called
to indicate that the next input expression recognized is to be tacked on to the end of this
input. Normally, the next input string would overwrite the current entry in yytext. Second,
yyless (n) may be called to indicate that not all the characters matched by the currently
successful expression are wanted right now.
The argument n indicates the number of characters in yytext to be retained.
Further characters previously matched are returned to the input. This provides the same
sort of lookahead offered by the / operator, but in a different form.
Example:
Consider a language which defines a string as a set of characters between
quotation (") marks, and provides that to include a " in a string it must be preceded by a
\. The regular expression which matches that is somewhat confusing, so that it might be
preferable to write
\"[^"]* {
if (yytext[yyleng-1] == '\\')
yymore();
else
... normal user processing
}
which will, when faced with a string such as "abc\"def" first match the five characters
"abc\; then the call to yymore() will cause the next part of the string, "def, to be tacked
on the end.
Note that the final quote terminating the string should be picked up in the code labeled
``normal processing''.
The function yyless() might be used to reprocess text in various circumstances.
Consider the C problem of distinguishing the ambiguity of ``=-a''. Suppose it is desired
to treat this as ``=- a'' but print a message. A rule might be
=-[a-zA-Z] {
printf("Op (=-) ambiguous\n");
yyless(yyleng-1);
... action for =- ...
}
which prints a message, returns the letter after the operator to the input stream, and
treats the operator as ``=-''. Alternatively it might be desired to treat this as ``= -a''. To do
this, just return the minus sign as well as the letter to the input:
=-[a-zA-Z] {
printf("Op (=-) ambiguous\n");
yyless(yyleng-2);
... action for = ...
}
will perform the other interpretation. Note that the expressions for the two cases might
more easily be written
=-/[A-Za-z]
in the first case and
=/-[A-Za-z]
in the second; no backup would be required in the rule action. It is not necessary to
recognize the whole identifier to observe the ambiguity. The possibility of ``=-3'',
however, makes
=-/[^ \t\n]
a still better rule.
In addition to these routines, Lex also permits access to the I/O routines it uses.
They are:
1) input() which returns the next input character;
2) output(c) which writes the character c on the output; and
3) unput(c) pushes the character c back onto the input stream to be read later by
input().
By default these routines are provided as macro definitions, but the user can
override them and supply private versions. These routines define the relationship
between external files and internal characters, and must all be retained or modified
consistently.
They may be redefined, to cause input or output to be transmitted to or from
strange places, including other programs or internal memory; but the character set used
must be consistent in all routines; a value of zero returned by input must mean end of
file; and the relationship between output and input must be retained or the Lex
lookahead will not work.
Lex does not look ahead at all if it does not have to, but every rule ending in + *
? or $ or containing / implies look ahead. Lookahead is also necessary to match an
expression that is a prefix of another expression.
See below for a discussion of the character set used by Lex. The standard Lex
library imposes a 100 character limit on backup.
Another Lex library routine that the user will sometimes want to redefine is
yywrap() which is called whenever Lex reaches an end-of-file. If yywrap returns a 1, Lex
continues with the normal wrapup on end of input.
Sometimes, however, it is convenient to arrange for more input to arrive from a
new source. In this case, the user should provide a yywrap which arranges for new
input and returns 0. This instructs Lex to continue processing. The default yywrap
always returns 1.
IMPORTANT QUESTIONS

PART A

1. Compare compiler & interpreter.


2. What is meant by a cross - compiler?
3. How a source program is analyzed?
4. What are the cousins of compiler?
5. What is an interpreter?
6. What are the functions of preprocessors?
7. Define a symbol table.
8. List four software tools that analyses the source program.
9. What are compiler construction tools?
10. What is the need for separating analysis phase into lexical & parsing?
11. What is a preprocessor?
12. What are machine dependent & machine independent phases?
13. What are the two parts of a compilation? Explain Briefly.
14. What is the difference between compiler and assembler?
15. Describe Possible Error Recovery Actions in Lexical Analyzer.
16. What are the error recovery actions in a lexical analyzer?
17. What are sentinels? Mention its usage?
18. What are the difference between DFA and NFA?
19. Write various parts of LEX Program?
20. Define token.[APR/MAY 2023]
21. What is meant by Non-deterministic finite automata?[ APR/MAY 2023]
22. State Finite Automata.[Nov/Dec 2023]
23. Show the role of logical analyzer.[Nov/Dec 2023]
PART B & C

1. Depict the language processing system & describe about the cousins of the
compiler.
2. Why do we need compiler construction tools? Discuss in detail about some
compiler construction tools?
3. With a neat sketch and suitable example? explain various phases of a compiler
in detail.
4. Explain grouping of phases.
5. Define the following terms i) compiler ii)Interpreter iii)Translator
6. Explain in detail about input buffering.
7. Convert the following regular expression into minimized DFA (a/b)*abb
8. What are the issues in lexical analysis?
9. Write short notes on regular expression to NFA? Construct NFA for the sentence
(a/b)*a
10. How does LEX Work?
11. Demonstrate the output for each phases of compiler by using the expression
a:=b-c*530[Nov/Dec 2023]
12. Explain the Specification and recognition of tokens.[Nov/Dec 2023]
13. Extend the concept of lexical analyzer generator using LEX.[Nov/Dec 2023]
14. Describe the phases of compiler with neat diagram.[Apr/May2023]
15. Describe input buffering techniques in detail.[Apr/May2023]
16. Minimize the given expression (a/b)*abb.[Apr/May2023]

You might also like