CD Notes-2
CD Notes-2
CD Notes-2
A pass is a single time the compiler passes over (goes through) the sources code or some other representation
of it.
When the source code of the language to be translated becomes complex and large, the compiler was broken
down in to multiple (relatively independent) phases. The advantage of having different phases is that the
development of the compiler can be distributed among a team of developers. Typically, most compilers have
at least two phases called front end and back end, while they could be either one-pass or multi-pass.
Phase is used to classify compilers according to the construction, while pass is used to classify compilers
according to how they operate.
For example, the front-end phases of lexical analysis, syntax analysis, semantic analysis, and intermediate
code generation might be grouped together into one pass. Code optimization might be an optional pass. Then
there could be a back-end pass consisting of code generation for a particular target machine.
1
BOOTSTRAPPING :
Bootstrapping is the process of writing a compiler (or assembler) in the source programming language that
it intends to compile. Applying this technique leads to a self-hosting compiler.
The development of compilers for new programming languages first developed in an existing language but
then rewritten in the new language and compiled by itself, is another example of the bootstrapping notion.
Using an existing language to bootstrap a new language is one way to solve the "chicken or the egg"
causality dilemma.
A compiler is characterized by three languages:
1. Source Language
2. Target Language
3. Implementation Language
Notation: represents a compiler for Source S, Target T, implemented in I. The T-diagram shown above
is also used to depict the same compiler.
The process illustrated by the T-diagrams is called bootstrapping and can be summarized by the equation:
Compiler-Construction Tools
The compiler writer, like any software developer, can profitably use modern software development
environments containing tools such as language editors, debuggers, version managers, profilers, test harnesses,
2
and so on. In addition to these general software-development tools, other more specialized tools have been
created to help implement various phases of a compiler.
These tools use specialized languages for specifying and implementing specific components, and many use
quite sophisticated algorithms. The most successful tools are those that hide the details of the generation
algorithm and produce components that can be easily integrated into the remainder of the compiler. Some
commonly used compiler-construction tools include
1. Parser generators that automatically produce syntax analyzers from a grammatical description of a
programming language.
2. Scanner generators that produce lexical analyzers from a regular-expression description of the tokens of a
language.
3. Syntax-directed translation engines that produce collections of routines for walking a parse tree and
generating intermediate code.
4. Code-generator generators that produce a code generator from a collection of rules for translating each
operation of the intermediate language into the machine language for a target machine.
5. Data-flow analysis engines that facilitate the gathering of information about how values are transmitted
from one part of a program to each other part. Data-flow analysis is a key part of code optimization.
6. Compiler-construction toolk2ts that provide an integrated set of routines for constructing various phases of
a compiler.
3
Applications of Compiler Technology
Compiler design is not only about compilers, and many people use the technology learned by studying
compilers in school, yet have never, strictly speaking, written (even part of) a compiler for a major
programming language. Compiler technology has other important uses as well. Additionally, compiler design
impacts several other areas of computer science
Implementation of High-Level Programming Languages
A high-level programming language defines a programming abstraction: the programmer expresses an
algorithm using the language, and the compiler must translate that program to the target language. Generally,
higher-level programming languages are easier to program in, but are less efficient, that is, the target programs
run more slowly.
Optimizations for Computer Architectures
The rapid evolution of computer architectures has also led to an insatiable demand for new compiler
technology. Almost all high-performance systems take advantage of the same two basic techniques:
parallelism and memory hierarchies. Parallelism can be found at several levels: at the instruction level, where
multiple operations are executed simultaneously and at the processor level, where different threads of the same
application are run on different processors.
Design of New Computer Architectures
In the early days of computer architecture design, compilers were developed after the machines were built.
That has changed. Since programming in highlevel languages is the norm, the performance of a computer
system is determined not by its raw speed but also by how well compilers can exploit its features.
Thus, in modern computer architecture development, compilers are developed in the processor-design stage,
and compiled code, running on simulators, is used to evaluate the proposed architectural features.
Program Translations
While we normally think of compiling as a translation from a high-level language to the machine level, the
same technology can be applied to translate between different kinds of languages. The following are some of
the important applications of program-translation techniques.
Software Productivity Tools
Programs are arguably the most complicated engineering artifacts ever produced; they consist of many many
details, every one of which must be correct before the program will work completely. As a result, errors are
rampant in programs; errors may crash a system, produce wrong results, render a system vulnerable to security
attacks, or even lead to catastrophic failures in critical systems. Testing is the primary technique for locating
errors in programs.
Programming Language Basics
The Static/Dynamic Distinction
Environments and States
Static Scope and Block Structure
Explicit Access Control
Parameter Passing Mechanisms
4
Recognition of Tokens
A token is a pair consisting of a token name and an optional attribute value. The token name is an abstract
symbol representing a kind of lexical unit, e.g., a particular keyword, or a sequence of input characters
denoting an identifier. The token names are the input symbols that the parser processes
A pattern is a description of the form that the lexemes of a token may take.In the case of a keyword as a token,
the pattern is just the sequence of characters that form the keyword. For identifiers and some other tokens, the
pattern is a more complex structure that is matched by many strings.
A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified
by the lexical analyzer as an instance of that token.
Example.
For relop, we use the comparison operators of languages like Pascal or SQL, where = is "equals" and <> is
"not equals," because it presents an interesting structure of lexemes.
The terminals of the grammar, which are if, then, else, relop, id, and number, are the names of tokens as far
as the lexical analyzer is concerned. The patterns for these tokens are described using regular definitions, as
in Fig. 3.11.
For this language, the lexical analyzer will recognize the keywords if , then, and else, as well as lexemes that
match the patterns for relop, id, and number.
5
To simplify matters, we make the common assumption that keywords are also reserved words: that is, they
are not identifiers, even though their lexemes match the pattern for identifiers. In addition, we assign the
lexical analyzer the job of stripping out whitespace, by recognizing the "token" ws defined by:
ws ( blank I tab ( newline )+
Here, blank, tab, and newline are abstract symbols that we use to express the ASCII characters of the same
names. Token ws is different from the other tokens in that, when we recognize it, we do not return it to the
parser, but rather restart the lexical analysis from the character that follows the whitespace. It is the following
token that gets returned to the parser.
6
The Lexical- Analyzer Generator Lex
A tool called Lex, or in a more recent implementation Flex, that allows one to specify a lexical analyzer by
specifying regular expressions to describe patterns for tokens. The input notation for the Lex tool is referred
to as the Lex language and the tool itself is the Lex compiler. Behind the scenes, the Lex compiler transforms
the input patterns into a transition diagram and generates code, in a file called lex . yy . c, that simulates this
transition diagram. The mechanics of how this translation from regular expressions to transition diagrams
occurs is the subject of the next sections; here we only learn the Lex language.
Use of Lex
An input file, which we call lex.1, is written in the Lex language and describes the lexical analyzer to be
generated. The Lex compiler transforms lex. 1 to a C program, in a file that is always named lex. yy . c. The
latter file is compiled by the C compiler into a file called a. out, as always. The C-compiler output is a working
lexical analyzer that can take a stream of input characters and produce a stream of tokens. The normal use of
the compiled C program, referred to as a. out in Fig. 3.22, is as a subroutine of the parser. It is a C function
that returns an integer, which is a code for one of the possible token names. The attribute value, whether it be
another numeric code, a pointer to the symbol table, or nothing, is placed in a global variable yylval which is
shared between the lexical analyzer and parser, thereby making it simple to return both the name and an
attribute value of a token.
7
The declarations section includes declarations of variables, manifest constants (identifiers declared to stand
for a constant, e.g., the name of a token), and regular definitions. The translation rules each have the form
Pattern { Action }
Each pattern is a regular expression, which may use the regular definitions of the declaration section. The
actions are fragments of code, typically written in C, although many variants of Lex using other languages
have been created. The third section holds whatever additional functions are used in the actions.
Alternatively, these functions can be compiled separately and loaded with the lexical analyzer.
8
THE LEXICAL-ANALYZER GENERATOR LEX
Pattern is listed first in situations where the longest matching prefix matches two or more patterns. The action
taken when id is matched is threefold:
1. Function installID0 is called to place the lexeme found in the symbol table.
2. This function returns a pointer to the symbol table, which is placed in global variable yylval, where it can
be used by the parser or a later component of the compiler. Note that installID () has available to it two
variables that are set automatically by the lexical analyzer that Lex generates:
(a) yytext is a pointer to the beginning of the lexeme, analogous to 1exemeBegin
(b) yyleng is the length of the lexeme found.
3. The token name I D is returned to the parser. The action taken when a lexeme matching the pattern number
is similar, using the auxiliary function installNum() .
9
Parse Trees and Derivations
A parse tree is a graphical representation of a derivation that filters out the order in which productions are
applied to replace nonterminals. Each interior node of a parse tree represents the application of a production.
The interior node is labeled with the ont terminal A in the head of the production; the children of the node are
labeled, from left to right, by the symbols in the body of the production by which this A was replaced during
the derivation.
For example, the parse tree for -(id + id) for the below grammar
10
Ambiguity
A grammar that produces more than one parse tree for some sentence is said to be ambiguous. Put another
way, an ambiguous grammar is one that produces more than one leftmost derivation or more than one
rightmost derivation for the same sentence.
The arithmetic expression grammar permits two distinct leftmost derivations for the sentence id + id * id:
The corresponding parse trees appear in Following figure
Note that the parse tree of Fig. 4.5(a) reflects the commonly assumed precedence of + and *, while the tree of
Fig. 4.5(b) does not. That is, it is customary to treat operator * as having higher precedence than +,
corresponding to the fact that we would normally evaluate an expression like a + b * c as a + (b * c), rather
than as (a + b) * c.
11
Eliminating Ambiguity
Sometimes an ambiguous grammar can be rewritten to eliminate the ambiguity. As an example, we shall
eliminate the ambiguity from the following "danglingelse" grammar:
12
Elimination of Left Recursion
13
Left Factoring
14
Handle Pruning
15
16
17
18
19
Applications of Syntax-Directed Translation
The main application in this section is the construction of syntax trees. Since some compilers use syntax trees
as an intermediate representation, a common form of SDD turns its input string into a tree. To complete the
translation to intermediate code, the compiler may then walk the syntax tree, using another set of rules that
are in effect an SDD on the syntax tree rather than the parse tree.
We consider two SDD's for constructing syntax trees for expressions. The first, an S-attributed definition, is
suitable for use during bottom-up parsing. The second, L-attributed, is suitable for use during top-down
parsing.
Construction of Syntax Trees
We shall implement the nodes of a syntax tree by objects with a suitable number of fields. Each object will
have an op field that is the label of the node. The objects will have additional fields as follows:
If the node is a leaf, an additional field holds the lexical value for the leaf. A constructor function Leaf
( op, val) creates a leaf object. Alternatively, if nodes are viewed as records, then Leaf returns a pointer
to a new record for a leaf.
If the node is an interior node, there are as many additional fields as the node has children in the syntax
tree. A constructor function Node takes two or more arguments: Node(op, cl, ca, . . . , ck) creates an
object with first field op and k additional fields for the k children cl, . . . , ck
20
21
22
Types and Declarations
Type checking uses logical rules to reason about the behavior of a program at run time. Specifically, it ensures
that the types of the operands match the type expected by an operator. For example, the && operator in Java
expects its two operands to be booleans; the result is also of type boolean.
Translation Applications. From the type of a name, a compiler can determine the storage that will be needed
for that name at run time. Type information is also needed to calculate the address denoted by an array
reference, to insert explicit type conversions, and to choose the right version of an arithmetic operator, among
other things.
Type Expressions
Types have structure, which we shall represent using type expressions: a type expression is either a basic type
or is formed by applying an operator called a type constructor to a type expression. The sets of basic types and
constructors depend on the language to be checked.
Example 6.8 : The array type int [2][31 can be read as "array of 2 arrays of 3 integers each" and written as a
type expression array(2, array(3, integer)). This type is represented by the tree in Fig. 6.14. The operator array
takes two parameters, a number and a type.
Back patching usually refers to the process of resolving forward branches that have been planted in
the code, e.g. at 'if' statements, when the value of the target becomes known, e.g. when the closing
brace or matching 'else' is encountered.
23
Code generator.
The final phase in our compiler model is the code generator. It takes as input the intermediate representation
(IR) produced by the front end of the compiler, along with relevant symbol table information, and produces
as output a semantically equivalent target program, The target program must preserve the semantic meaning
of the source program and be of high quality; that is,it must make effective use of the available resources of
the target machine.
A code generator has three primary tasks: instruction selection, register allocation and ssignment, and
instruction ordering.
24
Issues in the Design of a Code Generator
1. Input to the Code Generator
The input to the code generator is the intermediate representation of the source program produced by the
front end, along with information in the symbol table that is used to determine the run-time addresses of the
data objects denoted by the names in the IR.
The many choices for the IR include three-address representations such as quadruples, triples, indirect
triples; virtual machine representations such as bytecodes and stack-machine code; linear representations
such as postfix notation; and graphical representations such as syntax trees and DAG's.
2. The Target Program
The most common target-machine architectures are RISC (reduced instruction set computer), CISC
(complex instruction set computer), and stack based. A RISC machine typically has many registers, three-
address instructions, simple addressing modes, and a relatively simple instruction-set architecture.
In contrast, a CISC machine typically has few registers, two-address instructions, a variety of addressing
modes, several register classes, variable-length instructions, and instructions with side effects.
Producing an absolute machine-language program as output has the advantage that it can be placed in a
fixed location in memory and immediately executed. Programs can be compiled and executed quickly.
Producing a relocatable machine-language program (often called an object module) as output allows
subprograms to be compiled separately. A set of relocatable object modules can be linked together and
loaded for execution by a linking loader.
3. Instruction Selection
The code generator must map the IR program into a code sequence that can be executed by the target
machine. The complexity of performing this mapping is determined by a factors such as
the level of the IR the nature of the instruction-set architecture the desired quality of the generated code.
If the IR is high level, the code generator may translate each IR statement into a sequence of machine
instructions using code templates. Such statementby- statement code generation, If the IR reflects some of
the low-level details of the underlying machine, then the code generator can use this information to generate
more efficient code sequences.
if the target machine has an "increment" instruction (INC), then the three-address statement a = a + 1 may be
implemented more efficiently by the single instruction INC a, rather than by a more obvious sequence that
loads a into a register, adds one to the register, and then stores the result back into a:
4. Register Allocation
A key problem in code generation is deciding what values to hold in what registers. Registers are the fastest
computational unit on the target machine, but we usually do not have enough of them to hold all values.
Values not held in registers need to reside in memory. Instructions involving register operands
are invariably shorter and faster than those involving operands in memory, so efficient utilization of registers
is particularly important. The use of registers is often subdivided into two subproblems:
1. Register allocation, during which we select the set of variables that will reside in registers at each point in
the program.
2. Register assignment, during which we pick the specific register that a variable will reside in.
5. Evaluation Order
The order in which computations are performed can affect the efficiency of the target code. As we shall see,
some computation orders require fewer registers to hold intermediate results than others. Initially, we shall
avoid the problem by generating code for the three-address statements in the order in which they have been
produced by the intermediate code generator.
25
Next-Use Information
Knowing when the value of a variable will be used next is essential for generating good code. If the value of
a variable that is currently in a register will never be referenced subsequently, then that register can be assigned
to another variable.
The use of a name in a three-address statement is defined as follows. Suppose three-address statement i assigns
a value to x. If statement j has x as an operand, and control can flow from statement i to j along a path that has
no intervening assignments to x, then we say statement j uses the value of x computed at statement i. We
further say that x is live at statement i.
We wish to determine for each three-address statement x = y + z what the next uses of x, y, and z are. For the
present, we do not concern ourselves with uses outside the basic block containing this three-address statement.
Our algorithm to determine liveness and next-use information makes a backward pass over each basic block.
We store the information in the symbol table. We can easily scan a stream of three-address statements to find
the ends of basic
blocks as in Algorithm 8.5. Since procedures can have arbitrary side effects, we assume for convenience that
each procedure call starts a new basic block.
Algorithm: Determining the liveness and next-use information for each statement in a basic block.
INPUT: A basic block B of three-address statements. We assume that the symbol table initially shows all
nontemporary variables in B as being live on exit.
OUTPUT: At each statement i: x = y + z in B, we attach to i the liveness and next-use information of x, y, and z.
METHOD: We start at the last statement in B and scan backwards to the beginning of B. At each statement i:
x = y + z in B, we do the following:
1. Attach to statement i the information currently found in the symbol table regarding the next use and liveness
of x, y, and y.
2. In the symbol table, set x to "not live" and "no next use."
3. In the symbol table, set y and z to "live" and the next uses of y and z to i
Here we have used + as a symbol representing any operator. If the three-address statement i is of the form
x = + y or x = y, the steps are the same as above, ignoring z. Note that the order of steps (2) and (3) may not
be interchanged because x may be y or x.
26