[go: up one dir, main page]

0% found this document useful (0 votes)
79 views11 pages

Additional Note CSC 409

The compilation process involves multiple phases that take the source code through lexical analysis, syntax analysis, semantic analysis, code optimization, and code generation. An intermediate representation is generated after semantic analysis to make the code easier to optimize and generate for different target machines. Common intermediate representations include triple code and quadruples that break expressions into sub-expressions.

Uploaded by

Sadiq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views11 pages

Additional Note CSC 409

The compilation process involves multiple phases that take the source code through lexical analysis, syntax analysis, semantic analysis, code optimization, and code generation. An intermediate representation is generated after semantic analysis to make the code easier to optimize and generate for different target machines. Common intermediate representations include triple code and quadruples that break expressions into sub-expressions.

Uploaded by

Sadiq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Compilation Process

The compilation process is a sequence of various phases. Each phase takes input
from its previous stage, has its own representation of source program, and feeds its
output to the next phase of the compiler. Let us understand the phases of a compiler.
Lexical Analysis

The first phase of scanner works as a text scanner. This phase scans the source code
as a stream of characters and converts it into meaningful lexemes. Lexical analyzer
represents these lexemes in the form of tokens as:
<token-name, attribute-value>

Syntax Analysis

The next phase is called the syntax analysis or parsing. It takes the token produced by
lexical analysis as input and generates a parse tree (or syntax tree). In this phase,
token arrangements are checked against the source code grammar, i.e. the parser
checks if the expression made by the tokens is syntactically correct.
Parse Tree

Parse Tree
A parse tree is a graphical depiction of a derivation. It is convenient to see how strings
are derived from the start symbol. The start symbol of the derivation becomes the root
of the parse tree. Let us see this by an example from the last topic.
We take the left-most derivation of a + b * c
The left-most derivation is:
E → E * E
E → E + E * E
E → id + E * E
E → id + id * E
E → id + id * id

Step 1:

E→E*E

Step 2:
E→E+E*E

Step 3:

E → id + E * E

Step 4:

E → id + id * E

Step 5:
E → id + id * id

In a parse tree:

 All leaf nodes are terminals.


 All interior nodes are non-terminals.
 In-order traversal gives original input string.
A parse tree depicts associativity and precedence of operators. The deepest sub-tree
is traversed first, therefore the operator in that sub-tree gets precedence over the
operator which is in the parent nodes.

Ambiguity
A grammar G is said to be ambiguous if it has more than one parse tree (left or right
derivation) for at least one string.
Example
E → E + E
E → E – E
E → id

For the string id + id – id, the above grammar generates two parse trees:
The language generated by an ambiguous grammar is said to be inherently
ambiguous. Ambiguity in grammar is not good for a compiler construction. No method
can detect and remove ambiguity automatically, but it can be removed by either re-
writing the whole grammar without ambiguity, or by setting and following associativity
and precedence constraints.

Associativity
If an operand has operators on both sides, the side on which the operator takes this
operand is decided by the associativity of those operators. If the operation is left-
associative, then the operand will be taken by the left operator or if the operation is
right-associative, the right operator will take the operand.
Example
Operations such as Addition, Multiplication, Subtraction, and Division are left
associative. If the expression contains:
id op id op id

it will be evaluated as:


(id op id) op id

For example, (id + id) + id


Operations like Exponentiation are right associative, i.e., the order of evaluation in the
same expression will be:
id op (id op id)

For example, id ^ (id ^ id)


Precedence
If two different operators share a common operand, the precedence of operators
decides which will take the operand. That is, 2+3*4 can have two different parse trees,
one corresponding to (2+3)*4 and another corresponding to 2+(3*4). By setting
precedence among operators, this problem can be easily removed. As in the previous
example,
mathematically * (multiplication) has precedence over + (addition), so the expression
2+3*4 will always be interpreted as:
2 + (3 * 4)

These methods decrease the chances of ambiguity in a language or its grammar.

Semantic Analysis

Semantic analysis checks whether the parse tree constructed follows the rules of
language. For example, assignment of values is between compatible data types, and
adding string to an integer. Also, the semantic analyzer keeps track of identifiers, their
types and expressions; whether identifiers are declared before use or not etc. The
semantic analyzer produces an annotated syntax tree as an output.

Intermediate Code Generation

After semantic analysis the compiler generates an intermediate code of the source
code for the target machine. It represents a program for some abstract machine. It is in
between the high-level language and the machine language. This intermediate code
should be generated in such a way that it makes it easier to be translated into the
target machine code.

Code Optimization

The next phase does code optimization of the intermediate code. Optimization can be
assumed as something that removes unnecessary code lines, and arranges the
sequence of statements in order to speed up the program execution without wasting
resources (CPU, memory).

Code Generation

In this phase, the code generator takes the optimized representation of the
intermediate code and maps it to the target machine language. The code generator
translates the intermediate code into a sequence of (generally) re-locatable machine
code. Sequence of instructions of machine code performs the task as the intermediate
code would do.
Symbol Table

It is a data-structure maintained throughout all the phases of a compiler. All the


identifier's names along with their types are stored here. The symbol table makes it
easier for the compiler to quickly search the identifier record and retrieve it. The symbol
table is also used for scope management.

Intermediate Source Code


A source code can directly be translated into its target machine code, then why at all
we need to translate the source code into an intermediate code which is then
translated to its target code? Let us see the reasons why we need an intermediate
code.

 If a compiler translates the source language to its target machine language without having
the option for generating intermediate code, then for each new machine, a full native
compiler is required.
 Intermediate code eliminates the need of a new full compiler for every unique machine by
keeping the analysis portion same for all the compilers.
 The second part of compiler, synthesis, is changed according to the target machine.
 It becomes easier to apply the source code modifications to improve code performance by
applying code optimization techniques on the intermediate code.

Intermediate Representation
Intermediate codes can be represented in a variety of ways and they have their own
benefits.
 High Level IR - High-level intermediate code representation is very close to the source
language itself. They can be easily generated from the source code and we can easily
apply code modifications to enhance performance. But for target machine optimization, it is
less preferred.
 Low Level IR - This one is close to the target machine, which makes it suitable for register
and memory allocation, instruction set selection, etc. It is good for machine-dependent
optimizations.
Intermediate code can be either language specific (e.g., Byte Code for Java) or
language independent (three-address code).

Three-Address Code
Intermediate code generator receives input from its predecessor phase, semantic
analyzer, in the form of an annotated syntax tree. That syntax tree then can be
converted into a linear representation, e.g., postfix notation. Intermediate code tends to
be machine independent code. Therefore, code generator assumes to have unlimited
number of memory storage (register) to generate code.
For example:
a = b + c * d;

The intermediate code generator will try to divide this expression into sub-expressions
and then generate the corresponding code.
r1 = c * d;
r2 = b + r1;
a = r2

r being used as registers in the target program.


A three-address code has at most three address locations to calculate the expression.
A three-address code can be represented in two forms : quadruples and triples.

Quadruples

Each instruction in quadruples presentation is divided into four fields: operator, arg1,
arg2, and result. The above example is represented below in quadruples format:

Op arg1 arg2 result

* c d r1

+ b r1 r2

+ r2 r1 r3
= r3 a

Triples

Each instruction in triples presentation has three fields : op, arg1, and arg2.The results
of respective sub-expressions are denoted by the position of expression. Triples
represent similarity with DAG and syntax tree. They are equivalent to DAG while
representing expressions.

Op arg1 arg2

* c d

+ b (0)

+ (1) (0)

= (2)

Triples face the problem of code immovability while optimization, as the results are
positional and changing the order or position of an expression may cause problems.

Indirect Triples

This representation is an enhancement over triples representation. It uses pointers


instead of position to store results. This enables the optimizers to freely re-position the
sub-expression to produce an optimized code.

Declarations
A variable or procedure has to be declared before it can be used. Declaration involves
allocation of space in memory and entry of type and name in the symbol table. A
program may be coded and designed keeping the target machine structure in mind, but
it may not always be possible to accurately convert a source code to its target
language.
Taking the whole program as a collection of procedures and sub-procedures, it
becomes possible to declare all the names local to the procedure. Memory allocation is
done in a consecutive manner and names are allocated to memory in the sequence
they are declared in the program. We use offset variable and set it to zero {offset = 0}
that denote the base address.
The source programming language and the target machine architecture may vary in
the way names are stored, so relative addressing is used. While the first name is
allocated memory starting from the memory location 0 {offset=0}, the next name
declared later, should be allocated memory next to the first one.
Example:
We take the example of C programming language where an integer variable is
assigned 2 bytes of memory and a float variable is assigned 4 bytes of memory.
int a;
float b;

Allocation process:
{offset = 0}

int a;
id.type = int
id.width = 2

offset = offset + id.width


{offset = 2}

float b;
id.type = float
id.width = 4

offset = offset + id.width


{offset = 6}

To enter this detail in a symbol table, a procedure enter can be used. This method may
have the following structure:
enter(name, type, offset)

This procedure should create an entry in the symbol table, for variable name, having its
type set to type and relative address offset in its data area.

ERROR Recovery

A parser should be able to detect and report any error in the program. It is expected
that when an error is encountered, the parser should be able to handle it and carry on
parsing the rest of the input. Mostly it is expected from the parser to check for errors
but errors may be encountered at various stages of the compilation process. A
program may have the following kinds of errors at various stages:
 Lexical : name of some identifier typed incorrectly
 Syntactical : missing semicolon or unbalanced parenthesis
 Semantical : incompatible value assignment
 Logical : code not reachable, infinite loop
There are four common error-recovery strategies that can be implemented in the
parser to deal with errors in the code.

Panic mode

When a parser encounters an error anywhere in the statement, it ignores the rest of the
statement by not processing input from erroneous input to delimiter, such as semi-
colon. This is the easiest way of error-recovery and also, it prevents the parser from
developing infinite loops.

Statement mode

When a parser encounters an error, it tries to take corrective measures so that the rest
of inputs of statement allow the parser to parse ahead. For example, inserting a
missing semicolon, replacing comma with a semicolon etc. Parser designers have to
be careful here because one wrong correction may lead to an infinite loop.

Error productions

Some common errors are known to the compiler designers that may occur in the code.
In addition, the designers can create augmented grammar to be used, as productions
that generate erroneous constructs when these errors are encountered.

You might also like