Program Analysis and Translation Phases
Principles of Programming Languages
Program Analysis and Translation Phases
• In order to execute computer programs, there is need for the source
programs to be translated into machine-understandable code.
• In order to achieve this, the program to be executed must first be
analyzed to determine first if it is a valid program from the point of view of
its:
• lexical conventions (lexical analysis),
• syntactical form (syntactical analysis), and
• semantic validity (semantic analysis).
• During analysis, the various components will correctly report errors
found, allowing the programmer to understand the location and nature of
the errors found.
• After the program has been confirmed to be valid, it is translated (code
generation) and possibly optimized (code optimization) before execution.
Program Analysis and Translation Phases
• Two Ends of Program Analysis
• Front end Analysis
• Back end Analysis
• Two Ends of Program translation
• Interpretation
• Compilation
Compilation and Interpretation
• Compiler and Interpreter
• Compiler
• A compiler is a computer program (or set of programs) that
transforms source code written in a computer language (the source
language) into another computer language (the target language,
often having a binary form known as object code). The most
common reason for wanting to transform source code is to create an
executable program.
• Interpreter
• An interpreted language is a programming language whose
programs are not directly executed by the host CPU but rather
executed (or said to be interpreted) by a software program known as
an interpreter. The source code of the program is often translated to
a form that is more convenient to interpret, which may be some form
of machine language for a virtual machine (such as Java's bytecode).
Compilation and Interpretation
• Classes of Compiler
• Source-to-source compiler: A source-to-source compiler is a type of
compiler that takes a high level language as its input and outputs a
high-level language. For example, an automatic parallelizing compiler
will frequently take in a high-level language program as an input and
then transform the code and annotate it with parallel code annotations
or language constructs
• Stage compiler:, A stage compiler is a compiler that compiles to assembly
language of a theoretical machine most Prolog implementations that use
what is known as the “Warren Abstract Machine" (or WAM). Bytecode
compilers for Java (executed on a Java Virtual Machine (JVM)) and
Python (executed on the CPython virtual machine), and many more are
also a subtype of this. In a sense, languages compiled in this manner are
evaluated in a hybrid compilation/interpretation mode, where the
source code is compiled into byte code, which is then interpreted at
runtime.
Compilation and Interpretation
• Classes of Compilation
• Dynamic compilation: Dynamic compilation is a process used by some programming
language implementations to gain performance during program execution (i.e. Run
Time).
• Thus it allows code optimizations to be made at Runtime.
• The best-known language that uses this technique is Java.
• Runtime environments using dynamic compilation typically have programs run
slowly for the first few minutes, and then after that, most of the compilation and
recompilation is done and it runs faster as execution goes.
• Due to this initial performance lag, dynamic compilation is undesirable in certain
cases.
• In most implementations of dynamic compilation, some optimizations that could be
done at the initial compile time are delayed until further compilation at runtime,
causing further unnecessary slowdowns. Just-in-time compilation is a form of
dynamic compilation.
Compiler and Interpreter
• Classes of Compilation
• Just-in-time compiler: Just-in-time compilation (JIT), also known as dynamic
translation, is a technique for improving the runtime performance of a computer
program.
• It converts code at runtime prior to executing it natively, for example bytecode into
native machine code.
• It has advantages over statically compiling the code at development time (code
development time), and further recompile the code if this is found to be
advantageous, and may be able to enforce security guarantees.
• Several modern runtime environments, such as Microsoft's .NET Framework and
most implementations of Java, rely on JIT compilation for high-speed code
execution. JIT techniques offer generally offer better performance than
interpreters.
Compiler and Interpreter
• Classes of Compilation
• Ahead-of-time compile: Ahead-of-time compilation (AOT) refers to the act of
compiling an intermediate language, such as Java bytecode or .NET Common
Intermediate Language (CIL), into a system-dependent binary.
• Most languages that can be compiled to an intermediate language (such as
bytecode) take advantage of just-in-time compilation.
• JIT compiles intermediate code into binary code for a native run while the
intermediate code is executing, which may decrease an application's performance.
• Thus Ahead-of-time compilation eliminates the need for this step by performing the
compilation before execution rather than during execution.
Compilation vs Interpretation
Compilation
Interpretation
Compilation vs Interpretation
Compilation
Source Target
program Compiler
program
Input Target program Output
Interpretation
Compilation vs Interpretation
Compilation
Source Target
program Compiler
program
Input Target program Output
Interpretation
Source
program
Interpreter Output
Input
Trade-Off Between Compilation and Interpretation
Trade-Off Between Compilation and Interpretation
Advantages of compilation
• Standalone code
• Faster code
• Smaller code
Trade-Off Between Compilation and Interpretation
Advantages of compilation Advantages of interpretation
• Standalone code • Faster development
• Faster code • More flexibility and possibly more
• Smaller code expressive power
• Late binding
• Dynamic features
Compilation-Interpretation Boundary
A translation program that translates source code into machine code or some
intermediate code to i s a compiler if it needs to perform a semantic analysis of
the source code to produce the intermediate Code.
• Java is compiled: the byte code is close to machine code and requires
semantic analysis to produce it.
• Perl is interpreted: the intermediate code is produced only when the
program is run.
• BASIC is interpreted: early interpreters stored program in tokenized form. No
semantic analysis is required to produce this.
Phases of Compilation
Source program
Scanner (lexical analysis)
(character stream)
Token stream
Front end
Parser (syntactic analysis)
Parse tree
Semantic analysis and
Symbol table
code generation
Abstract syntax tree or
other intermediate form
Machine-independent
code improvement
Modified
intermediate form
Back end
Target code generation
Target language
(e.g., assembly)
Modified Machine-specific
target language code improvement
Front end and Back end analysis
Front end analysis
• The front end analyses the source code to build an internal representation
of the program, called the intermediate representation.
• It also manages the symbol table, a data structure mapping each symbol in
the source code to associated information such as location, type and scope.
• This is done over several phases, which includes some of the following:
• Lexical analysis
• Syntactic analysis
• Sematic analysis
Back end analysis
• Back end analysis is done over the following phases:
• Analysis
• Optimization
• Code generation
Front end and Back end Analysis: Front End
Lexical Analysis
• Lexical analysis is the process of converting a sequence of characters into a
sequence of tokens.
• Aprogram or function which performs lexical analysis is called a Lexical analyzer,
Lexer or Scanner.
• A Lexer often exists as a single function, which is called by the parser.
• The lexical specification of a programming language is defined by a set of
rules which defines the Lexer, which are understood by a Lexical Analyzer
generator such as Lex.
• The Lexical Analyzer reads in a stream of characters, identifies the
lexemes in the stream, categorizes them into tokens, and outputs a token
stream. This is called “tokenizing."
• Groups input characters into tokens (e.g., identifiers, keywords,
numbers)
• Remove extraneous characters (e.g., spaces, tabs, newline characters)
and comments
• If the Lexer finds an invalid token, it will report an error.
• The Lexical Analyzer (either generated automatically by a tool like Lex, or
hand-crafted)
Front end and Back end Analysis: Front End
Syntactic Analysis
• Syntax analysis involves parsing (i.e grammatical analysis of) the token
sequence to identify the syntactic structure of the program.
• The parser's output is some form of intermediate representation of the
program's structure, typically a parse tree (i.e. a diagrammatic
representation of the parsed structure of a sentence or string).
• The parse tree replaces the linear sequence of tokens with a tree structure
built according to the rules of a formal grammar, which defines the
language's syntax. This is usually done with reference to a context-free
grammar, which recursively defines components that can make up an
expression and the order in which they must appear. The parse tree is often
analyzed, augmented, and transformed by later phases in the compiler.
***Parsing Organize tokens into a parse tree according to a grammar
describing the language. It ensure the sequence of tokens conforms to the
syntax defined by the language grammar
Front end and Back end Analysis: Front End
Semantic Analysis
• Semantic analysis is the phase in which the compiler adds semantic information to
the parse tree and builds the symbol table.
• This phase performs semantic checks such as type checking (checking for type
errors), or object binding (associating variable and function references with their
definitions), or definite assignment (requiring all local variables to be initialized
before use), rejecting incorrect programs or issuing warnings.
• Semantic analysis usually requires a complete parse tree, meaning that this phase
logically follows the parsing phase, and logically precedes the code generation
phase, though it is often possible to fold multiple phases into one pass over the code
in a compiler implementation.
• Not all rules defining programming languages can be expressed by context-free
grammars alone, for example semantic validity such as type validity and proper
declaration of identifiers. These rules can be formally expressed with attribute
grammars that implement attribute migration across syntax tree nodes when
necessary.
Semantic Analysis
• Generates symbol table and intermediate representation of program (e.g.,
syntax tree) from the parse tree. The symbol table maps each
identifier to information known about it.
• Produce the abstract syntax tree w h i c h includes only important nodes
of the parse tree and also stores annotations of these nodes (e.g.,
pointers from identifiers to symbol table entries).
• Enforces rules not captured by the context-free grammar (e.g use identifier
only after it has been declared).
• Generates code for run-time checks
Front end and Back end Analysis: Back End
The back end entails gathering of program information from the intermediate
representation derived by the front end and hence does accurate analysis required for
compiler optimization. Processes involved at the back end are:
a. Code Optimization
The intermediate language representation is transformed into functionally equivalent
but faster (or smaller) forms. Popular optimizations are inline expansion, dead code
elimination, constant propagation, loop transformation, register allocation or even
automatic parallelization.
b. Code generation
The transformed intermediate language is translated into the output language, usually
the native machine language of the system. This involves resource and storage
decisions, such as deciding which variables to fit into registers and memory and the
selection and scheduling of appropriate machine instructions along with their
associated addressing modes.
Compilation Process of a Sample Pascal Program
{ calculate greatest common divisor }
program gcd( input, output );
var i, j : integer;
begin
read( i, j );
while i <> j do
if i > j then i := i - j
else j := j - i;
writeln( i );
end.
Program After Tokenization (Scanning)
{ calculate greatest common divisor }
program gcd ( input , output ) ;
var i , j : integer ;
begin
read ( i , j ) ;
while i <> j do
if i > j then i := i - j
else j := j - i ;
writeln ( i ) ;
end .
A Context-Free Grammar for the Language in the Example
Rules
Program → program identifier ( identifier More_identifiers ) ; Block .
Block → Labels Constants Types Variables Subroutines
begin Statement More_statements end
More_identifiers → ε
More_identifiers → , identifier More_identifiers
..
Terminals
• program, var, integer, begin, end, while, do, if, then, else
• (, ), ,, ;, :, ., :=, <>, >, -
• identifier
Non-terminals
• Program, More_identifiers, Block, Labels, Constants, Types, . . .
The Parse Tree of the Program
Program
program identifier ( identifier More_identifiers ) .
(gcd) (input)
, identifier More_identifiers
(output)
ε
Block
Lbls Consts Types Variables Subs begin Stmt More_stmts end
ε ε ε var ... ε ... ...
The Abstract Syntax Tree and the Symbol Table
Index Symbol Type
Program
1 integer type
Predefined
2 textfile type
(5) read
3 input 2
4 output 2
(3) (6) read
5 gcd program
defined
User-
6 i 1
(3) (7) while 7 1
j
<> if write
(6) (7) > := := (4) (6) writeln
(6) (7) (6) - (7) - (4)
(6) (7) (7) (6)