Compiler_unit1
Compiler_unit1
Compiler_unit1
Preprocessor
Interpreter
Assembler
Linker
Linker is a computer program that links and merges various object files together in
order to make an executable file. All these files might have been compiled by
separate assemblers. The major task of a linker is to search and locate referenced
module/routines in a program and to determine the memory location where these
codes will be loaded, making the program instruction to have absolute references.
Loader
Loader is a part of operating system and is responsible for loading executable files
into memory and execute them. It calculates the size of a program (instructions
and data) and creates memory space for it. It initializes various registers to initiate
execution.
Cross-compiler
A compiler that runs on platform (A) and is capable of generating executable code for
platform
(B) is called a cross-compiler.
Source-to-source Compiler
A compiler that takes the source code of one programming language and
translates it into the source code of another programming language is called a
source-to-source compiler.
Phases of Compiler
The compilation process is a sequence of various phases. Each phase takes input
from its previous stage, has its own representation of source program, and feeds its
output to the next phase of the compiler. Let us understand the phases of a
compiler.
Lexical Analysis: The first phase of scanner works as a text scanner. This phase
scans the source code as a stream of characters and converts it into meaningful
lexemes. Lexical analyzer represents these lexemes in the form of tokens as:
<token-name, attribute-value>
Syntax Analysis: The next phase is called the syntax analysis or parsing. It takes
the token produced by lexical analysis as input and generates a parse tree (or syntax
tree). In this phase, token arrangements are checked against the source code
grammar, i.e. the parser checks if the expression made by the tokens is
syntactically correct.
Semantic Analysis: Semantic analysis checks whether the parse tree constructed
follows the rules of language. eg, assignment of values is between compatible data
types, and adding string to an integer. Also, the semantic analyzer keeps track of
identifiers, their types and expressions; whether identifiers are declared before use
or not etc. The semantic analyzer produces an annotated syntax tree as an output.
Code Optimization: The next phase does code optimization of the intermediate
code. Optimization can be assumed as something that removes unnecessary code
lines, and arranges the sequence of statements in order to speed up the program
execution without wasting resources (CPU, memory).
Code Generation: In this phase, the code generator takes the optimized
representation of the intermediate code and maps it to the target machine language.
The code generator translates the intermediate code into a sequence of (generally)
re-locatable machine code. Sequence of instructions of machine code performs the
task as the intermediate code would do.
Lexical analysis is the first phase of a compiler. It takes the modified source code
from language preprocessors that are written in the form of sentences. The lexical
analyzer breaks these syntaxes into a series of tokens, by removing any whitespace
or comments in the source code.
If the lexical analyzer finds a token invalid, it generates an error. The lexical
analyzer works closely with the syntax analyzer. It reads character streams from the
source code, checks for legal tokens, and passes the data to the syntax analyzer
when it demands.
Tokens
Lexemes are said to be a sequence of characters (alphanumeric) in a token. There
are some predefined rules for every lexeme to be identified as a valid token. These
rules are defined by grammar rules, by means of a pattern. A pattern explains what
can be a token, and these patterns are defined by means of regular expressions.
In programming language, keywords, constants, identifiers, strings, numbers,
operators and punctuations symbols can be considered as tokens.
For example, in C language, the variable declaration line
int value = 100;
Specifications of Tokens
Let us understand how the language theory undertakes the following terms:
Alphabets
Strings
Any finite sequence of alphabets is called a string. Length of the string is the total
number of occurrence of alphabets, e.g., the length of the string tutorialspoint is 14
and is denoted by |tutorialspoint| = 14. A string having no alphabets, i.e. a string of
zero length is known as an empty string and is denoted by ε (epsilon).
Special Symbols
Assignment =
Preprocessor #
Examples:
1. Write the regular expression for the language accepting all the string which are starting
with 1 and ending with 0, over ∑ = {0, 1}.
R.E=1(1+0)* 0
2. Write the regular expression for the language starting and ending with a and
having any having any combination of b's in between.
RE.= a b*a
3. Write the regular expression for the language starting with a but not
having consecutive b's.
String= {aa,ababa,aba,aab,aaab,aababab,aaaaab}
R.E= {a+ab}* or a*+(ab)*
4. Write the regular expression for the language accepting all the string in which
any number of a's is followed by any number of b's is followed by any number of
c's.
R.E= a*b*c*
String= {{},aaaaa,bbb.cccc,abc,aaabbcc…..}
5. Write the regular expression for the language over ∑ = {0} having even length of the
string.
String={{},00,0000,000000,00000000} R.E=(00)*
0 1
q0 q0 qf
qf q0 qf
DFA cannot use Empty String NFA can use Empty String
3 (lambda) transition. transition.
DFA rejects the string in case it NFA rejects the string in the
terminates in a state that is event of all branches dying or
7 different from the accepting state. refusing the string.
The above Transition System is NDFA as state b is moving to two different State
i.e. (b,c) by using transition value 0.
0 1
a {} b
b {b,c} b
c {} {}
0 1
a {} b
b {b,c} b
{b,c} {b,c} b
1
b
a
1
1 0 0
{b,c}
Q) Compute the first and follow of the following
1) E->E+B 2) E->E+T
E->T
E->E*B T->T*F
E->B T->F
B->0 F->(E)
B->1 F->id
FIRST and FOLLOW are two functions associated with grammar that help us fill in the entries of
an M-table.
FIRST ()− It is a function that gives the set of terminals that begin the strings derived from the
production rule.
A symbol c is in FIRST (α) if and only if α ⇒ cβ for some sequence β of grammar symbols.
A terminal symbol a is in FOLLOW (N) if and only if there is a derivation from the start symbol S
of the grammar such that S ⇒ αNαβ, where α and β are a (possible empty) sequence of grammar
symbols. In other words, a terminal c is in FOLLOW (N) if c can follow N at some point in a
derivation.
Computation of FOLLOW
Follow (A) is defined as the collection of terminal symbols that occur directly to the right of A.
FOLLOW(A) = {a|S ⇒* αAaβ where α, β can be any strings}
If S is the start symbol, FOLLOW (S) ={$}
If production is of form A → α B β, β ≠ ε.
(a) If FIRST (β) does not contain ε then, FOLLOW (B) = {FIRST (β)}
Or
E->E+B
E->E*B
E->B
B->0
B->1
First(E)={0,1}
First(B)={0,1}
Follow(E)={$,+,0,1,*}
For finding follow of E we will have productions which are having E at the right side
So here two productions are there:
1) E->E+B compare this production with A → α B β, β ≠ ε.
A=E, B=E, α= ε, β=+B, as β value have + and B and + is terminal so it will be added and fist
of B does not consist of ε so first(B) will be added.
2) E->E*B compare this production with A → α B β, β ≠ ε.
A=E, B=E, α= ε, β=*B, as β value have * and B and * is terminal so it will be added and fist
of B does not consist of ε so first(B) will be added.
Follow(B)= )={$,+,0,1,*}
Consider
1) E->E+B compare this production with A → α B β, β ≠ ε.
A=E, B=B, α= E+, β= ε, as β value is ε the production is of form A → αB, then according
to 3rd rule follow(B)=follow(E)
2) E->E*B – same as above
3) E->B -same as above
First(E)={(,id}
First(T)={(,id}
First(F)= {(,id}
Follow(E)={$,(,id ,) }
Consider:
i)E->E+T compare this production with A → α B β, β ≠ ε.
A=E, B=E, α= €, β= +T, as β value have + and T and + is terminal so it will be added and fist of T
does not consist of ε so first(T) will be added.
What is a Parser?
The parser is one of the phases of the compiler which takes a token of string as input and converts
it into the corresponding Intermediate Representation (IR) with the help of an existing grammar.
The parser is also known as Syntax Analyzer.
Classification of Parser
Types of Parsers
The parser is mainly classified into two categories.
1. Top-down Parser
2. Bottom-up Parser
Top-Down Parser
Top-down parser is the parser that generates parse tree for the given input string with the help of
grammar productions by expanding the non-terminals. It starts from the start symbol and ends
down on the terminals. It uses left most derivation.
Further Top-down parser is classified into 2 types: 1) Recursive descent parser and 2) non-
recursive descent parser.
1. Recursive descent parser is also known as the Brute force parser or the backtracking parser.
It basically generates the parse tree by using brute force and backtracking techniques.
2. Non-recursive descent parser is also known as LL(1) parser or predictive parser or without
backtracking parser or dynamic parser. It uses a parsing table to generate the parse tree
instead of backtracking.
Bottom-Up Parser
Bottom-up Parser is the parser that generates the parse tree for the given input string with the help
of grammar productions by compressing the terminals. It starts from terminals and ends upon the
start symbol. It uses the rightmost derivation in reverse order.
Bottom-up parser is classified into two types:
LR parser: This is a bottom-up parser that generates the parse tree for the given string by using
unambiguous grammar. It follows the reverse of the rightmost derivation. LR parser is classified
into four types:
LR(0)
SLR(1)
LALR(1)
Production:
E->E+T (1)
E->T (2)
T->T*F (3)
T->F (4)
F->(E) (5)
F->id (6)
0 id*id+id$
0id5 *id+id$
0F3 *id+id$
0T2 *id+id$
0T2*7 id+id$
0T2*7id5 +id$
0T2*7F10 +id$
0T2 +id$
0E1 +id$
0E1+6 id$
0E1+6id5 $
0E1+6F3 $
0E1+6T9 $
0E1 $
String: id*id+id$
Production:
1) F->id
2) T->F
3) T->T*F
4) E->T
5) E->E+T
State Action Goto
id * + $ E F T
0 S5 1 3 2
1 S6 accept
2 S7 R4
3 R2 R2
4
5 R1 R1 R1
6 S5 3 9
7 S5 10
8
9 R5
10 R3