Introduction To Compilers
A Compiler is a software that typically takes a high level language (Like C++ and Java) code as
input and converts the input to a lower level language at once. It lists all the errors if the input
code does not follow the rules of its language. This process is much faster than interpreter but it
becomes difficult to debug all the errors together in a program.
A compiler is a translating program that translates the instructions of high level language to
machine level language. A program which is input to the compiler is called a Source program.
This program is now converted to a machine level language by a compiler is known as
the Object code.
Introduction to Compiler
A compiler is a translator that converts the high-level language into the machine
language.
High-level language is written by a developer and machine language can be
understood by the processor.
Compiler is used to show errors to the programmer.
The main purpose of compiler is to change the code written in one language without
changing the meaning of the program.
When you execute a program which is written in HLL programming language then it
executes into two parts.
In the first part, the source program compiled and translated into the object program
(low level language).
In the second part, object program translated into the target program through the
assembler.
Fig: Execution process of source program in Compiler
Language Processors: Assembler, Compiler and Interpreter
Computer programs are generally written in high-level languages (like C++, Python, and Java). A language
processor, or language translator, is a computer program that convert source code from one
programming language to another language or to machine code (also known as object code). They also
find errors during translation.
What is Language Processors?
Compilers, interpreters, translate programs written in high-level languages into machine code that a
computer understands and assemblers translate programs written in low-level or assembly language into
machine code. In the compilation process, there are several stages. To help programmers write error-free
code, tools are available.
Assembly language is machine-dependent, yet mnemonics used to represent instructions in it are not
directly understandable by machine and high-Level language is machine-independent. A computer
understands instructions in machine code, i.e. in the form of 0s and 1s. It is a tedious task to write a
computer program directly in machine code. The programs are written mostly in high-level languages
like Java, C++, Python etc. and are called source code. These source code cannot be executed directly by
the computer and must be converted into machine language to be executed. Hence, a special translator
system software is used to translate the program written in a high-level language into machine code is
called Language Processor and the program after translated into machine code (object program/object
code).
Types of Language Processors
The language processors can be any of the following three types:
1. Compiler
The language processor that reads the complete source program written in high-level language as a
whole in one go and translates it into an equivalent program in machine language is called a Compiler.
Example: C, C++, C#.
In a compiler, the source code is translated to object code successfully if it is free of errors. The compiler
specifies the errors at the end of the compilation with line numbers when there are any errors in the
source code. The errors must be removed before the compiler can successfully recompile the source
code again the object program can be executed number of times without translating it again.
2. Assembler
The Assembler is used to translate the program written in Assembly language into machine code. The
source program is an input of an assembler that contains assembly language instructions. The output
generated by the assembler is the object code or machine code understandable by the computer.
Assembler is basically the 1st interface that is able to communicate humans with the machine. We need
an assembler to fill the gap between human and machine so that they can communicate with each
other. code written in assembly language is some sort of mnemonics(instructions) like ADD, MUL, MUX,
SUB, DIV, MOV and so on. and the assembler is basically able to convert these mnemonics in binary
code. Here, these mnemonics also depend upon the architecture of the machine.
For example, the architecture of intel 8085 and intel 8086 are different.
3. Interpreter
The translation of a single statement of the source program into machine code is done by a language
processor and executes immediately before moving on to the next line is called an interpreter. If there is
an error in the statement, the interpreter terminates its translating process at that statement and
displays an error message. The interpreter moves on to the next line for execution only after the removal
of the error. An Interpreter directly executes instructions written in a programming or scripting
language without previously converting them to an object code or machine code. An interpreter
translates one line at a time and then executes it.
Example: Perl, Python and Matlab.
Difference Between Compiler and Interpreter
Compiler Interpreter
A compiler is a program that converts the entire An interpreter takes a source program and runs
source code of a programming language into it line by line, translating each line as it comes to
executable machine code for a CPU. it.
The compiler takes a large amount of time to
An interpreter takes less amount of time to
analyze the entire source code but the overall
analyze the source code but the overall
execution time of the program is comparatively
execution time of the program is slower.
faster.
The compiler generates the error message only
after scanning the whole program, so debugging Its Debugging is easier as it continues translating
is comparatively hard as the error can be present the program until the error is met.
anywhere in the program.
The compiler requires a lot of memory for It requires less memory than a compiler because
generating object codes. no object code is generated.
Generates intermediate object code. No intermediate object code is generated.
The interpreter is a little vulnerable in case of
For Security purpose compiler is more useful.
security.
Examples: C, C++, C# Examples: Python, Perl, JavaScript, Ruby.
Generation of Programming Languages
Programming languages have evolved significantly over time, moving from fundamental
machine-specific code to complex languages that are simpler to write and understand. Each
new generation of programming languages has improved, allowing developers to create more
efficient, human-readable, and adaptable software. The transition from the first low-level
languages to current, high-level languages offered new tools and ideas that continue to
influence how we write software today.
Generations of Programming language
There are five generations of Programming languages. They are:
The first two generations are called low-level languages. The next three generations are called
high-level languages.
1. First-Generation Language :
The first-generation languages are also called machine languages/ 1G language. This language is
machine-dependent. The machine language statements are written in binary code (0/1 form)
because the computer can understand only binary language.
Advantages :
Fast & efficient as statements are directly written in binary language.
No translator is required.
Disadvantages :
Difficult to learn binary codes.
Difficult to understand – both programs & where the error occurred.
2. Second Generation Language :
The second-generation languages are also called assembler languages/ 2G languages. Assembly
language contains human-readable notations that can be further converted to machine
language using an assembler.
Assembler – converts assembly level instructions to machine-level instructions.
Programmers can write the code using symbolic instruction codes that are meaningful
abbreviations of mnemonics. It is also known as low-level language.
Advantages :
It is easier to understand if compared to machine language.
Modifications are easy.
Correction & location of errors are easy.
Disadvantages :
Assembler is required.
This language is architecture /machine-dependent, with a different instruction set for
different machines.
3. Third-Generation Language :
The third generation is also called procedural language /3 GL. It consists of the use of a series of
English-like words that humans can understand easily, to write instructions. It’s also called High-
Level Programming Language. For execution, a program in this language needs to be translated
into machine language using a Compiler/ Interpreter. Examples of this type of language are C,
PASCAL, FORTRAN, COBOL, etc.
Advantages :
Use of English-like words makes it a human-understandable language.
Lesser number of lines of code as compared to the above 2 languages.
Same code can be copied to another machine & executed on that machine by using compiler-
specific to that machine.
Disadvantages :
Compiler/ interpreter is needed.
Different compilers are needed for different machines.
4. Fourth Generation Language :
The fourth-generation language is also called a non – procedural language/ 4GL. It enables users
to access the database. Examples: SQL, Foxpro, Focus, etc.
These languages are also human-friendly to understand.
Advantages :
Easy to understand & learn.
Less time is required for application creation.
It is less prone to errors.
Disadvantages :
Memory consumption is high.
Has poor control over Hardware.
Less flexible.
5. Fifth Generation Language :
The fifth-generation languages are also called 5GL. It is based on the concept of artificial
intelligence. It uses the concept that rather than solving a problem algorithmically, an
application can be built to solve it based on some constraints, i.e., we make computers learn to
solve any problem. Parallel Processing & superconductors are used for this type of language to
make real artificial intelligence.
Examples: PROLOG, LISP, etc.
Advantages :
Machines can make decisions.
Programmer effort reduces to solve a problem.
Easier than 3GL or 4GL to learn and use.
Disadvantages :
Complex and long code.
More resources are required & they are expensive too.
Compiler Design - Phases of Compiler
The compilation process is a sequence of various phases. Each phase takes input from its
previous stage, has its own representation of source program, and feeds its output to the next
phase of the compiler. Let us understand the phases of a compiler.
Lexical Analysis:
The first phase of scanner works as a text scanner. This phase scans the source code as a stream
of characters and converts it into meaningful lexemes. Lexical analyzer represents these lexemes
in the form of tokens as:
<token-name, attribute-value>
Syntax Analysis:
The next phase is called the syntax analysis or parsing. It takes the token produced by lexical
analysis as input and generates a parse tree (or syntax tree). In this phase, token arrangements
are checked against the source code grammar, i.e. the parser checks if the expression made by
the tokens is syntactically correct.
Semantic Analysis:
Semantic analysis checks whether the parse tree constructed follows the rules of language. For
example, assignment of values is between compatible data types, and adding string to an
integer. Also, the semantic analyzer keeps track of identifiers, their types and expressions;
whether identifiers are declared before use or not etc. The semantic analyzer produces an
annotated syntax tree as an output.
Intermediate Code Generation:
After semantic analysis the compiler generates an intermediate code of the source code for the
target machine. It represents a program for some abstract machine. It is in between the high-
level language and the machine language. This intermediate code should be generated in such a
way that it makes it easier to be translated into the target machine code.
Code Optimization:
The next phase does code optimization of the intermediate code. Optimization can be assumed
as something that removes unnecessary code lines, and arranges the sequence of statements in
order to speed up the program execution without wasting resources (CPU, memory).
Code Generation:
In this phase, the code generator takes the optimized representation of the intermediate code
and maps it to the target machine language. The code generator translates the intermediate
code into a sequence of (generally) re-locatable machine code. Sequence of instructions of
machine code performs the task as the intermediate code would do.
Symbol Table:
It is a data-structure maintained throughout all the phases of a compiler. All the identifier's
names along with their types are stored here. The symbol table makes it easier for the compiler
to quickly search the identifier record and retrieve it. The symbol table is also used for scope
management.
Structure of Optimizing Compiler
Optimization is a program transformation technique, which tries to improve the code by making
it consume less resources (i.e. CPU, Memory) and deliver high speed.
In optimization, high-level general programming constructs are replaced by very efficient low-
level programming codes. A code optimizing process must follow the three rules given below:
The output code must not, in any way, change the meaning of the program.
Optimization should increase the speed of the program and if possible, the program
should demand less number of resources.
Optimization should itself be fast and should not delay the overall compiling process.
Efforts for an optimized code can be made at various levels of compiling the process.
At the beginning, users can change/rearrange the code or use better algorithms to
write the code.
After generating intermediate code, the compiler can modify the intermediate code
by address calculations and improving loops.
While producing the target machine code, the compiler can make use of memory
hierarchy and CPU registers.
Optimization can be categorized broadly into two types: machine independent and machine
dependent.
Machine-independent Optimization:
In this optimization, the compiler takes in the intermediate code and transforms a part of the
code that does not involve any CPU registers and/or absolute memory locations.
For Example :
do
{
item = 10;
value = value + item;
} while(value<100);
Item = 10;
do
{
value = value + item;
} while(value<100);
This code involves repeated assignment of the identifier item , which if we put this way :
should not only save the CPU cycles, but can be used on any processor.
Machine-dependent Optimization :
Machine-dependent optimization is done after the target code has been generated and when
the code is transformed according to the target machine architecture. It involves CPU registers
and may have absolute memory references rather than relative references. Machine-dependent
optimizers put efforts to take maximum advantage of memory hierarchy.
Compiler construction tools
The compiler writer can use some specialized tools that help in implementing various phases of
a compiler. These tools assist in the creation of an entire compiler or its parts. Some commonly
used compiler construction tools include:
1. Parser Generator – It produces syntax analyzers (parsers) from the input that is based on
a grammatical description of programming language or on a context-free grammar. It is
useful as the syntax analysis phase is highly complex and consumes more manual and
compilation time. Example: PIC, EQM
2. Scanner Generator – It generates lexical analyzers from the input that consists of regular
expression description based on tokens of a language. It generates a finite automaton to
recognize the regular expression. Example: Lex
3. Syntax directed translation engines – It generates intermediate code with three address
format from the input that consists of a parse tree. These engines have routines to
traverse the parse tree and then produces the intermediate code. In this, each node of
the parse tree is associated with one or more translations.
4. Automatic code generators – These generators take input in the form of intermediate
code and convert it into machine language.
5. Data-flow analysis engines – It can generate an optimized code. These tools are used in
code optimization.
6. Compiler construction toolkits – It provides an integrated set of routines that aids in
building compiler components or in the construction of various phases of compiler.
Features of compiler construction tools :
Lexical Analyzer Generator: This tool helps in generating the lexical analyzer or scanner of the
compiler. It takes as input a set of regular expressions that define the syntax of the language
being compiled and produces a program that reads the input source code and tokenizes it based
on these regular expressions.
Parser Generator: This tool helps in generating the parser of the compiler. It takes as input a
context-free grammar that defines the syntax of the language being compiled and produces a
program that parses the input tokens and builds an abstract syntax tree.
Code Generation Tools: These tools help in generating the target code for the compiler. They
take as input the abstract syntax tree produced by the parser and produce code that can be
executed on the target machine.
Optimization Tools: These tools help in optimizing the generated code for efficiency and
performance. They can perform various optimizations such as dead code elimination, loop
optimization, and register allocation.
Debugging Tools: These tools help in debugging the compiler itself or the programs that are
being compiled. They can provide debugging information such as symbol tables, call stacks, and
runtime errors.
Profiling Tools: These tools help in profiling the compiler or the compiled code to identify
performance bottlenecks and optimize the code accordingly.
Documentation Tools: These tools help in generating documentation for the compiler and the
programming language being compiled. They can generate documentation for the syntax,
semantics, and usage of the language.
Language Support: Compiler construction tools are designed to support a wide range of
programming languages, including high-level languages such as C++, Java, and Python, as well as
low-level languages such as assembly language.
Cross-Platform Support: Compiler construction tools may be designed to work on multiple
platforms, such as Windows, Mac, and Linux.
User Interface: Some compiler construction tools come with a user interface that makes it
easier for developers to work with the compiler and its associated tools.
Compiler Bootstrapping
It is an approach for making a self-compiling compiler that is a compiler written in the source
programming language that it determine to compile. A bootstrap compiler can compile the
compiler and thus you can use this compiled compiler to compile everything else and the future
versions of itself.
Uses of Bootstrapping
There are various uses of bootstrapping which are as follows −
It can allow new programming languages and compilers to be developed starting from
actual ones.
It allows new features to be combined with a programming language and its compiler.
It also allows new optimizations to be added to compilers.
It allows languages and compilers to be transferred between processors with different
instruction sets
Advantages of Bootstrapping
There are various advantages of bootstrapping which are as follows −
Compiler development can be performed in the higher-level language being compiled.
It is a non-trivial test of the language being compiled.
It is an inclusive consistency check as it must be capable of recreating its object code.
For bootstrapping, a compiler is defined by three languages −
S → Source language it compiles.
T → Target language it generates.
I → Implementation language that it is written in
These languages can be represented using a T-diagram as
Cross Compiler
A compiler is characterized by three languages as its source language, its object language, and
the language in which it is written. These languages may be quite different. A compiler can run
on one machine and produce target code for another machine. Such a compiler is known as a
cross-compiler.
If it can write a cross-compiler for a new language 'L' in execution languages 'S' to
generate a program for machine 'N'.
i.e. LsN
If a current compiler for S runs on machine M and generates a program for M, it is
defined by SMM.
If LSN runs through SMM, we get a compiler LMN, i.e., a compiler from L to N that runs on
M.
Example − Create a cross compiler using bootstrapping when SsM runs on SAA.
Solution − First of all, it represents two compilers with T-diagram.
When SsM runs on SAA, SAM will be generated.