unit 1
unit 1
unit 1
INTRODUCTION TO COMPILERS
Figure 1.1.1
Preprocessor
A preprocessor produce input to compilers. They may perform the following
functions.
1. Macro processing: A preprocessor may allow a user to define macros
that are short hands for longer constructs.
2. File inclusion: A preprocessor may include header files into the program text.
3. Rational preprocessor: these preprocessors augment older
languages with more modern flow-of-control and data structuring
facilities.
4. Language Extensions: These preprocessor attempts to add
capabilities to the language by certain amounts to build-in macro
Compiler
Compiler is a translator program that translates a program written in (HLL) the source
program and translates it into an equivalent program in (MLL) the target program. As
an important part of a compiler is error showing to the programmer.
Error message
Executing a program written in HLL programming language is basically of two parts.
The source program must first be compiled translated into a object program. Then the
results object program is loaded into a memory and is executed.
Figure 1.1.2
Languages such as BASIC, COBOL, and LISP can be translated using
interpreters. JAVA also uses interpreter. The process of interpretation can be
carried out in following phases.
1. Lexical analysis
2. Syntax analysis
3. Semantic analysis
4. Direct Execution
Advantages:
Modification of user program can be easily made and implemented as
execution proceeds.
Debugging a program and finding errors is simplified task for a program used for
interpretation.
The interpreter for the language makes it machine independent.
machine
Disadvantages:
The execution of the program is slower. Memory consumption is more.
Loader and Link-editor:
Once the assembler procedures an object program, that program must be
placed into memory and executed. The assembler could place the object
o
program directly in memory and transfer control to it, thereby causing the
machine language program to be execute.
This would waste core by leaving the assembler in memory while the user`s
program was being executed. Also the programmer would have to retranslate
his program with each execution, thus wasting translation time. To overcome
the problem of wasted translation time and memory, system programmers
developed another component called loader
“A loader is a program that places programs into memory and prepares them
for execution.” It would be more efficient if subroutines could be translated into
object form the loader could ”relocate” directly behind the user`s program. The
task of adjusting programs o they may be placed in arbitrary core locatio
locations is
called relocation. Relocation loaders perform four functions.
Translator
A translator is a program that takes as input a program written in one language
and produces as output a program in another language. Beside program
translation, the translator performs another very important role, the error-
error
detection. Any violation of d HLL specification would be detected and reported
to the programmers. Important role of translator are:
Translating the HLL program input into an equivalent ml program.
Providing diagnostic messages wherever the programmer violates
specification of the HLL.
Phases of a Compiler
Compilation process is partitioned into No-of-sub processes called ‘phases’.
Lexical Analyzer: Lexical Analyzer reads the source program character by character
and returns the tokens of the source program. A token describes a pattern of characters
having same meaning in the source program. (such as identifiers, operators, keywords,
numbers, delimiters and so on)
Figure 1.2.1
Syntax Analyzer: A Syntax Analyzer creates the syntactic structure (generally a parse
tree) of the given program.
A syntax analyzer is also called as a parser.
A parse tree describes a syntactic structure.
In a parse tree, all terminals are at leaves.
All inner nodes are non--terminals in a context free grammar.
Semantic Analyzer
It checks the source program for semantic errors and col lects the type information
collects
for the subsequent code
code-generation phase.
It uses hierarchical structure determined by the syntax analysis phase to identify the
syntax-analysis
operators and operands of expressions and statements.
An important component of semantic analysis is
i type checking.
Normally semantic information cannot be represented by a context-free language
used in syntax analyzers.
Context-free grammars used in the syntax analysis are integrated with attributes
(semantic rules) the result is a syntax-directed translation, and Attribute grammars.
Intermediate Code Generation
A compiler may produce an explicit intermediate codes representing the source
program.
These intermediate codes are generally machine (architecture independent). But the
level of intermediate codes is close to the level of machine codes.
An intermediate form called "three-address code” which is like the assembly
language for a machine in which every memory location can act like a register.
Three-address code consists of a sequence of instructions, each of which has at
most three operands.
Code Optimizer
The code optimizer optimizes the code produced by the intermediate code generator
in the terms of time and space.
The code optimization phase attempts to improve the intermediate code, so that
faster-running machine code will result.
Optimization may involve:
a. Detection and removal of dead(unreachable) code
b. Calculation of constant expressions and terms
c. Collapsing of repeated expressions into temporary storage.
d. Loop controlling
e. Moving code outside of loops
f. Removal of unnecessary temporary variables.
Code Generator
Produces the target language in a specific architecture.
The target program is normally is a re-locatable object file containing the machine
codes.
This phase involves:
- Allocation of registers and memory
- Generation of correct references
- Generation of correct types
- Generation of machine code.
Symbol table
An essential function of a compiler is to record the identifiers used in the source
program and collect information about various attributes of each identifier.
A symbol table is a data structure containing a record for each identifier, with fields
for the attributes of the identifier.
The data structure allows us to find the record for each identifier quickly and to store
or retrieve data from that record quickly.
When an identifier in the source program is detected by the lexical analyzer, the
identifier is entered into the symbol table.
The attributes of an identifier cannot normally be determined during lexical analysis.
For example, in a Pascal declaration like
var position* initial, rate : real ;
The type real is not known when position, initial, and rate are seen by the lexical
analyzer. The remaining phases enter information about identifiers into the symbol
table and then use this information in various ways.
Error Detection and Reporting
Each compiler phase can encounter errors. However, after detecting an error, a
phase must somehow deal with that error, so that compilation can proceed, allowing
further errors in the source program to be detected.
The lexical phase can detect errors where the characters remaining in the input do
not form any token of the language.
Errors where the token stream violates the structure rules (syntax) of the language
are determined by the syntax analysis phase.
During semantic analysis the compiler tries to detect constructs that have the right
syntactic structure but no meaning to the operation.
Figure 1.2.1.1
1.3 COUSINS OF COMPILER
1. Preprocessor
2. Assembler
3. Loader and Link-editor
Preprocessor
A preprocessor is a program that processes its input data to produce output that is
used as input to another program. The output is said to be a preprocessed form of the
input data, which is often used by some subsequent programs like compilers.
They may perform the following functions:
1. Macro processing 3. Rational Preprocessors
2. File Inclusion 4. Language extension
1. Macro processing:
A macro is a rule or pattern that specifies how a certain input sequence should
be mapped to an output sequence according to a defined procedure. The mapping
process that instantiates a macro into a specific output sequence is known as macro
expansion.
2. File Inclusion:
Preprocessor includes header files into the program text. When the preprocessor
finds an #include directive it replaces it by the entire content of the specified file.
3. Rational Preprocessors:
These processors change older languages with more modern flow-of-control and
data-structuring facilities.
4. Language extension
These processors attempt to add capabilities to the language by what amounts to
built-in macros. For example, the language Equel is a database query language
embedded in C.
Assembler
Assembler creates object code by translating assembly instruction mnemonics
into machine code. There are two types of assemblers:
One-pass assemblers go through the source code once and assume that all
symbols will be defined before any instruction that references them.
Two-pass assemblers create a table with all symbols and their values in the first pass,
and then use the table in a second pass to generate code.
Linker and Loader
A linker or link editor is a program that takes one or more objects generated by
a compiler and combines them into a single executable program. Three tasks of the
linker are
1. Searches the program to find library routines used by program, e.g. printf(), math
routines.
2. Determines the memory locations that code from each module will occupy and
relocates its instructions by adjusting absolute references
3. Resolves references among files.
A loader is the part of an operating system that is responsible for loading
programs in memory, one of the essential stages in the process of starting a program.
Figure 1.6.1
Patterns
A pattern is a rule describing a set of lexemes that can represent a particular
token in source program. The pattern for the token const in the above table is just the
single string const that spells out the keyword.
Attributes of Token
The lexical analyzer returns to the parser a representation for the token it has
found. The representation is an integer code if the token is a simple construct such as a
left parenthesis, comma, or colon. The representation is a pair consisting of an integer
code and a pointer to a table if the token is a more complex element such as an
identifier or constant.
The integer code gives the token type, the pointer points to the value of that
token. Pairs are also retuned whenever we wish to distinguish between instances of a
token.
The attributes influence the translation of tokens.
i) Constant : value of the constant
ii) Identifiers: pointer to the corresponding symbol table entry.
1.7 Lexical Errors
A character sequence which is not possible to scan into any valid token is a lexical
error. Important facts about the lexical error:
Lexical errors are not very common, but it should be managed by a scanner
Misspelling of identifiers, operators, keyword are considered as lexical errors
Generally, a lexical error is caused by the appearance of some illegal character,
mostly at the beginning of a token.
Error Recovery in Lexical Analyzer
Here, are a few most common error recovery techniques:
Removes one character from the remaining input
In the panic mode, the successive characters are always ignored until we reach a
well-formed token
By inserting the missing character into the remaining input
Replace a character with another character
Transpose two serial characters
Figure 1.8.1
Buffer Pairs
Each buffer is of the same size N, and N is usually the number of characters on
one block. E.g., 1024 or 4096 bytes.
Using one system read command we can read N characters into a buffer.
If fewer than N characters remain in the input file, then a special character,
represented by eof, marks the end of the source file.
Two pointers to the input are maintained:
1. Pointer lexeme beginning, marks the beginning of the current lexeme,
whose extent we are attempting to determine.
2. Pointer forward scans ahead until a pattern match is found.
Once the next lexeme is determined, forward is set to the character at its right end.
The string of characters between the two pointers is the current lexeme.
After the lexeme is recorded as an attribute value of a token returned to the
parser, lexeme beginning is set to the character immediately after the lexeme just
found.
Regular Expressions
Each regular expression r denotes a language L(r).
Here are the rules that define the regular expressions over some alphabet Σ and
the languages that those expressions denote:
1. 1.ε is a regular expression, and L(ε) is { ε }, that is, the language whose sole
member is the empty string.
2. If ‘a’ is a symbol in Σ, then ‘a’ is a regular expression, and L(a) = {a}, that is, the
language with one string, of length one, with ‘a’ in its one position.
3. Suppose r and s are regular expressions denoting the languages L(r) and L(s).
Then,
a. (r)|(s) is a regular expression denoting the language L(r) U L(s).
b. (r)(s) is a regular expression denoting the language L(r)L(s).
c. (r)* is a regular expression denoting (L(r))*.
d. (r) is a regular expression denoting L(r).
4. The unary operator * has highest precedence and is left associative.
5. Concatenation has second highest precedence and is left associative.
6. | has lowest precedence and is left associative.
Regular set
A language that can be defined by a regular expression is called a regular set. If
two regular expressions r and s denote the same regular set, we say they are
equivalent and write r = s.
There are a number of algebraic laws for regular expressions that can be used to
manipulate into equivalent forms.
For instance, r|s = s|r is commutative; r|(s|t)=(r|s)|t is associative.
Regular Definitions
Giving names to regular expressions is referred to as a Regular definition. If Σ is
an alphabet of basic symbols, then a regular definition is a sequence of definitions of
the form
dl → r 1
d2 → r2
………
dn → rn
1. Each di is a distinct name.
2. Each ri is a regular expression over the alphabet Σ U {dl, d2,. . . , di-l}.
Example: Identifiers is the set of strings of letters and digits beginning with a letter.
Regular
definition for this set:
letter → A | B | …. | Z | a | b | …. | z | digit → 0 | 1 | …. | 9
id → letter ( letter | digit ) *
Short hands
Certain constructs occur so frequently in regular expressions that it is convenient
to introduce notational short hands for them.
1. One or more instances (+):
The unary postfix operator + means “ one or more instances of” .
If r is a regular expression that denotes the language L(r), then ( r )+ is a
regular expression that denotes the language (L (r ))+
Thus the regular expression a+ denotes the set of all strings of one or
more a’s.
The operator + has the same precedence and associativity as the
operator *.
2. Zero or one instance ( *):
The unary postfix operator * means “zero or one instance of”.
The notation r* is a shorthand for r | ε.
If ‘r’ is a regular expression, then ( r )* is a regular expression that denotes
the language
3. Character Classes:
The notation [abc] where a, b and c are alphabet symbols denotes the
regular expression a | b | c.
Character class such as [a – z] denotes the regular expression a | b | c | d
| ….|z.
We can describe identifiers as being strings generated by the regular
expression, [A–Za–z][A– Za–z0–9]*
Non-regular Set
A language which cannot be described by any regular expression is a non-
regular set.Example: The set of all strings of balanced parentheses and repeating
strings cannot be described by a regular expression. This set can be specified by a
context-free grammar.
Figure 1.11
Creating a lexical analyzer with lex
Lex Specification
A Lex program consists of three parts:
{ definitions }
%%
{ rules }
%%
{ user subroutines }
Definitions include declarations of variables, constants, and regular definitions
Rules are statements of the form
p1 {action1}
p2 {action2}
…
pn {actionn}
where pi is regular expression and actioni describes what action the lexical analyzer
should take when pattern pi matches a lexeme. Actions are written in C code.
User subroutines separately and loaded with the lexical analyzer.are auxiliary
procedures needed by the actions. These can be compiled
YACC- Yet Another Compiler-Compiler
YACC provides a general tool for describing the input to a computer program. The Yacc
user specifies the structures of his input, together with code to be invoked as each such
structure is recognized.
YACC turns such a specification into a subroutine that handles the input process;
frequently, it is convenient and appropriate to have most of the flow of control in the
user's application handled by this subroutine.
2. If the regular expression is just a character, eg. a,, then the corresponding NFA is
:
4. Concatenation simply involves connecting one NFA to the other; eg. ab is:
5. The Kleene closure must allow for taking zero or more instances of the letter
from the input; thus a* looks like:
We can covert from an NFA to a DFA using subset construction.
To perform this operation, let us define two functions:
The -closure function takes a state and returns the set of states reachable
from it based on (one or more) -transitions.
ons. Note that this will always include
the state itself. We should be able to get from a state to any state in its -closure
without consuming any input.
The function move takes a state and a character, and returns the set of states
reachable by one transition
tran on this character.
We can generalize both these functions to apply to sets of states by taking the union of
the application to individual states.
Eg. If A, B and C are states, move({A,B,C},`a') = move(A,`a') move(B,`a')
move(C,`a').
Solution:
a
3 4
ε
ε
Sta ε ε a b
RTr 1 2 7 8 9
ε
ε
5 6
b
Start a b
A B C
a
a
b
b
D
Minimization of DFA
DFA minimization stands for converting a given DFA to its equivalent DFA with
minimum number of states.
Minimization of DFA
DFA Minimization using Equivalence Theorem
If X and Y are two states in a DFA, we can combine these two states into {X, Y} if they
are not distinguishable. Two states are distinguishable, if there is at least one string S,
such that one of δ (X, S) and δ (Y, S) is accepting and another is not accepting. Hence,
a DFA is minimal if and only if all the states are distinguishable.
Step 1 − All the states Q are divided in two partitions − final states and non-final
states and are denoted by P0. All the states in a partition are 0th equivalent. Take a
counter k and initialize it with 0.
Step 2 − Increment k by 1. For each partition in Pk, divide the states in Pk into two
partitions if they are k-distinguishable. Two states within this partition X and Y are k-
distinguishable if there is an input S such that δ(X, S) and δ(Y, S) are (k-1)-
distinguishable.
Step 3 − If Pk ≠ Pk-1, repeat Step 2, otherwise go to Step 4.
Step 4 − Combine kth equivalent sets and make them the new states of the reduced
DFA.
Example
Let us consider the following DFA −
q δ(q,0) δ(q,1)
a b c
b a d
c e f
d e f
e e f
f f f
Let us apply the above algorithm to the above DFA −
P0 = {(c,d,e), (a,b,f)}
P1 = {(c,d,e), (a,b),(f)}
P2 = {(c,d,e), (a,b),(f)}
Hence, P1 = P2.
There are three states in the reduced DFA. The reduced DFA is as follows –
Q δ(q,0) δ(q,1)
1.12 LEX
Lex is a program generator designed for lexical processing of character input
streams. It accepts a high-level, problem oriented specification for character string
matching, and produces a program in a general purpose language which recognizes
regular expressions.
The regular expressions are specified by the user in the source specifications
given to Lex. The Lex written code recognizes these expressions in an input stream and
partitions the input stream into strings matching the expressions.
At the boundaries between strings program sections provided by the user are
executed. The Lex source file associates the regular expressions and the program
fragments.
As each expression appears in the input to the program written by Lex, the
corresponding fragment is executed.
The user supplies the additional code beyond expression matching needed to
complete his tasks, possibly including code written by other generators. The program
that recognizes the expressions is generated in the general purpose programming
language employed for the user's program fragments.
Thus, a high level expression language is provided to write the string
expressions to be matched while the user's freedom to write actions is unimpaired. This
avoids forcing the user who wishes to use a string manipulation language for input
analysis to write processing programs in the same and often inappropriate string
handling language.
Lex is not a complete language, but rather a generator representing a new
language feature which can be added to different programming languages, called ``host
languages.'' Just as general purpose languages can produce code to run on different
computer hardware, Lex can write code in different host languages.
The host language is used for the output code generated by Lex and also for the
program fragments added by the user. Compatible run-time libraries for the different
host languages are also provided.
This makes Lex adaptable to different environments and different users. Each
application may be directed to the combination of hardware and host language
appropriate to the task, the user's background, and the properties of local
implementations.
Lex turns the user's expressions and actions (called source in this memo) into
the host general-purpose language; the generated program is named yylex. The yylex
program will recognize expressions in a stream (called input in this memo) and perform
the specified actions for each expression as it is detected. See Figure 1.
+-------+
Source -> | Lex | -> yylex
+-------+
+-------+
Input -> | yylex | -> Output
+-------+
For a trivial example, consider a program to delete from the input all blanks or
tabs at the ends of lines.
%%
[ \t]+$ ;
is all that is required. The program contains a %% delimiter to mark the beginning of the
rules, and one rule. This rule contains a regular expression which matches one or more
instances of the characters blank or tab (written \t for visibility, in accordance with the C
language convention) just prior to the end of a line.
The brackets indicate the character class made of blank and tab; the + indicates
``one or more ...''; and the $ indicates ``end of line,'' as in QED. No action is specified,
so the program generated by Lex (yylex) will ignore these characters. Everything else
will be copied. To change any remaining string of blanks or tabs to a single blank, add
another rule:
%%
[ \t]+$ ;
[ \t]+ printf(" ");
The finite automaton generated for this source will scan for both rules at once,
observing at the termination of the string of blanks or tabs whether or not there is a
newline character, and executing the desired rule action. The first rule matches all
strings of blanks or tabs at the end of lines, and the second rule all remaining strings of
blanks or tabs.
Lex Source.
The general format of Lex source is:
{definitions}
%%
{rules}
%%
{user subroutines}
where the definitions and the user subroutines are often omitted. The second %% is
optional, but the first is required to mark the beginning of the rules. The absolute
minimum Lex program is thus
%%
(no definitions, no rules) which translates into a program which copies the input to the
output unchanged.
In the outline of Lex programs shown above, the rules represent the user's
control decisions; they are a table, in which the left column contains regular expressions
(see section 3) and the right column contains actions, program fragments to be
executed when the expressions are recognized. Thus an individual rule might appear
integer printf("found keyword INT");
to look for the string integer in the input stream and print the message ``found keyword
INT'' whenever it appears.
In this example the host procedural language is C and the C library function
printf is used to print the string. The end of the expression is indicated by the first blank
or tab character.
If the action is merely a single C expression, it can just be given on the right side
of the line; if it is compound, or takes more than a line, it should be enclosed in braces.
As a slightly more useful example, suppose it is desired to change a number of words
from British to American spelling. Lex rules such as
colour printf("color");
mechanise printf("mechanize");
petrol printf("gas");
would be a start.
Lex Regular Expressions
The definitions of regular expressions are very similar to those in QED [5]. A
regular expression specifies a set of strings to be matched.
It contains text characters (which match the corresponding characters in the
strings being compared) and operator characters (which specify repetitions, choices,
and other features).
The letters of the alphabet and the digits are always text characters; thus the
regular expression integer matches the string integer wherever it appears and the
expression
a57D looks for the string a57D.
Operators
The operator characters are
"\[]^-?.*+|()$/{}%<>
and if they are to be used as text characters, an escape should be used. The quotation
mark operator (") indicates that whatever is contained between a pair of quotes is to be
taken as text characters. Thus
xyz"++"
matches the string xyz++ when it appears. Note that a part of a string may be quoted. It
is harmless but unnecessary to quote an ordinary text character; the expression
"xyz++"
is the same as the one above. Thus by quoting every non-alphanumeric character being
used as a text character, the user can avoid remembering the list above of current
operator characters, and is safe should further extensions to Lex lengthen the list.
An operator character may also be turned into a text character by preceding it with \ as
in
xyz\+\+
which is another, less readable, equivalent of the above expressions. Another use of the
quoting mechanism is to get a blank into an expression; normally, as explained above,
blanks or tabs end a rule. Any blank character not contained within [] (see below) must
be quoted.
Several normal C escapes with \ are recognized: \n is newline, \t is tab, and \b is
backspace. To enter \ itself, use \\. Since newline is illegal in an expression, \n must be
used; it is not required to escape tab and backspace. Every character but blank, tab,
newline and the list above is always a text character.
Character classes.
Classes of characters can be specified using the operator pair []. The
construction [abc] matches a single character, which may be a, b, or c. Within square
brackets, most operator meanings are ignored. Only three characters are special: these
are \ - and ^. The - character indicates ranges. For example,
[a-z0-9<>_]
indicates the character class containing all the lower case letters, the digits, the angle
brackets, and underline.
Ranges may be given in either order. Using - between any pair of characters
which are not both upper case letters, both lower case letters, or both digits is
implementation dependent and will get a warning message. (E.g., [0-z] in ASCII is many
more characters than it is in EBCDIC).
If it is desired to include the character - in a character class, it should be first or
last; thus
[-+0-9]
matches all the digits and the two signs.
In character classes, the ^ operator must appear as the first character after the
left bracket; it indicates that the resulting string is to be complemented with respect to
the computer character set. Thus
[^abc]
matches all characters except a, b, or c, including all special or control characters; or
[^a-zA-Z]
is any character which is not a letter. The \ character provides the usual escapes within
character class brackets.
Arbitrary character.
To match almost any character, the operator character . is the class of all
characters except newline. Escaping into octal is possible although non-portable:
[\40-\176]
matches all printable characters in the ASCII character set, from octal 40 (blank) to octal
176 (tilde).
Optional expressions.
The operator ? indicates an optional element of an expression. Thus
ab?c
matches either ac or abc.
Repeated expressions.
Repetitions of classes are indicated by the operators * and +.
a*
is any number of consecutive a characters, including zero; while
a+
is one or more instances of a. For example,
[a-z]+
is all strings of lower case letters. And
[A-Za-z][A-Za-z0-9]*
indicates all alphanumeric strings with a leading alphabetic character. This is a typical
expression for recognizing identifiers in computer languages.
PART A
1. Depict the language processing system & describe about the cousins of the
compiler.
2. Why do we need compiler construction tools? Discuss in detail about some
compiler construction tools?
3. With a neat sketch and suitable example? explain various phases of a compiler
in detail.
4. Explain grouping of phases.
5. Define the following terms i) compiler ii)Interpreter iii)Translator
6. Explain in detail about input buffering.
7. Convert the following regular expression into minimized DFA (a/b)*abb
8. What are the issues in lexical analysis?
9. Write short notes on regular expression to NFA? Construct NFA for the sentence
(a/b)*a
10. How does LEX Work?
11. Demonstrate the output for each phases of compiler by using the expression
a:=b-c*530[Nov/Dec 2023]
12. Explain the Specification and recognition of tokens.[Nov/Dec 2023]
13. Extend the concept of lexical analyzer generator using LEX.[Nov/Dec 2023]
14. Describe the phases of compiler with neat diagram.[Apr/May2023]
15. Describe input buffering techniques in detail.[Apr/May2023]
16. Minimize the given expression (a/b)*abb.[Apr/May2023]