[go: up one dir, main page]

0% found this document useful (0 votes)
5 views23 pages

Compiler_unit1

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 23

UNIT- 01 Introduction to Compiler Design

Language Processing System


We have learnt that any computer system is made of hardware and software. The
hardware understands a language, which humans cannot understand. So we write
programs in high-level language, which is easier for us to understand and
remember. These programs are then fed into a series of tools and OS components
to get the desired code that can be used by the machine. This is known as
Language Processing System.
The high-level language is converted into binary language in various phases. A
compiler is a program that converts high-level language to assembly language.
Similarly, an assembler is a program that converts the assembly language to
machine-level language.
Let us first understand how a program, using C compiler, is executed on a host machine.
 User writes a program in C language (high-level language).
 The C compiler, compiles the program and translates it to assembly program
(low-level language).
 An assembler then translates the assembly program into machine code (object).
 A linker tool is used to link all the parts of the program together for
execution (executable machine code).
 A loader loads all of them into memory and then the program is executed.
Before diving straight into the concepts of compilers, we should understand a few
other tools that work closely with compilers.

Preprocessor

A preprocessor, generally considered as a part of compiler, is a tool that


produces input for compilers. It deals with macro-processing, augmentation, file
inclusion, language extension, etc.

Interpreter

An interpreter, like a compiler, translates high-level language into low-level


machine language. The difference lies in the way they read the source code or
input. A compiler reads the whole source code at once, creates tokens, checks
semantics, generates intermediate code, executes the whole program and may
involve many passes. In contrast, an interpreter reads a statement from the input,
converts it to an intermediate code, executes it, then takes the next statement in
sequence. If an error occurs, an interpreter stops execution and reports it. whereas
a compiler reads the whole program even if it encounters several errors.

Assembler

An assembler translates assembly language programs into machine code.The


output of an assembler is called an object file, which contains a combination of
machine instructions as well as the data required to place these instructions in
memory.

Linker
Linker is a computer program that links and merges various object files together in
order to make an executable file. All these files might have been compiled by
separate assemblers. The major task of a linker is to search and locate referenced
module/routines in a program and to determine the memory location where these
codes will be loaded, making the program instruction to have absolute references.
Loader

Loader is a part of operating system and is responsible for loading executable files
into memory and execute them. It calculates the size of a program (instructions
and data) and creates memory space for it. It initializes various registers to initiate
execution.

Cross-compiler

A compiler that runs on platform (A) and is capable of generating executable code for
platform
(B) is called a cross-compiler.

Source-to-source Compiler

A compiler that takes the source code of one programming language and
translates it into the source code of another programming language is called a
source-to-source compiler.

Phases of Compiler
The compilation process is a sequence of various phases. Each phase takes input
from its previous stage, has its own representation of source program, and feeds its
output to the next phase of the compiler. Let us understand the phases of a
compiler.
Lexical Analysis: The first phase of scanner works as a text scanner. This phase
scans the source code as a stream of characters and converts it into meaningful
lexemes. Lexical analyzer represents these lexemes in the form of tokens as:

<token-name, attribute-value>

Syntax Analysis: The next phase is called the syntax analysis or parsing. It takes
the token produced by lexical analysis as input and generates a parse tree (or syntax
tree). In this phase, token arrangements are checked against the source code
grammar, i.e. the parser checks if the expression made by the tokens is
syntactically correct.

Semantic Analysis: Semantic analysis checks whether the parse tree constructed
follows the rules of language. eg, assignment of values is between compatible data
types, and adding string to an integer. Also, the semantic analyzer keeps track of
identifiers, their types and expressions; whether identifiers are declared before use
or not etc. The semantic analyzer produces an annotated syntax tree as an output.

Intermediate Code Generation: After semantic analysis the compiler generates


an intermediate code of the source code for the target machine. It represents a
program for some abstract machine. It is in between the high-level language and
the machine language. This intermediate code should be generated in such a way
that it makes it easier to be translated into the target machine code.

Code Optimization: The next phase does code optimization of the intermediate
code. Optimization can be assumed as something that removes unnecessary code
lines, and arranges the sequence of statements in order to speed up the program
execution without wasting resources (CPU, memory).

Code Generation: In this phase, the code generator takes the optimized
representation of the intermediate code and maps it to the target machine language.
The code generator translates the intermediate code into a sequence of (generally)
re-locatable machine code. Sequence of instructions of machine code performs the
task as the intermediate code would do.

Symbol Table: It is a data-structure maintained throughout all the phases of a


compiler. All the identifier's names along with their types are stored here. The
symbol table makes it easier for the compiler to quickly search the identifier record
and retrieve it. The symbol table is also used for scope management.
Chap-02 Lexical analysis

Lexical analysis is the first phase of a compiler. It takes the modified source code
from language preprocessors that are written in the form of sentences. The lexical
analyzer breaks these syntaxes into a series of tokens, by removing any whitespace
or comments in the source code.
If the lexical analyzer finds a token invalid, it generates an error. The lexical
analyzer works closely with the syntax analyzer. It reads character streams from the
source code, checks for legal tokens, and passes the data to the syntax analyzer
when it demands.

Tokens
Lexemes are said to be a sequence of characters (alphanumeric) in a token. There
are some predefined rules for every lexeme to be identified as a valid token. These
rules are defined by grammar rules, by means of a pattern. A pattern explains what
can be a token, and these patterns are defined by means of regular expressions.
In programming language, keywords, constants, identifiers, strings, numbers,
operators and punctuations symbols can be considered as tokens.
For example, in C language, the variable declaration line
int value = 100;

contains the tokens:


int (keyword), value (identifier), = (operator), 100 (constant) and
; (symbol).

Specifications of Tokens
Let us understand how the language theory undertakes the following terms:
Alphabets

Any finite set of symbols {0,1} is a set of binary alphabets,


{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal alphabets, {a-z, A-Z} is
a set of English language alphabets.

Strings

Any finite sequence of alphabets is called a string. Length of the string is the total
number of occurrence of alphabets, e.g., the length of the string tutorialspoint is 14
and is denoted by |tutorialspoint| = 14. A string having no alphabets, i.e. a string of
zero length is known as an empty string and is denoted by ε (epsilon).

Special Symbols

A typical high-level language contains the following symbols:-

Arithmetic Symbols Addition(+), Subtraction(-), Modulo(%), Multiplication(*), Division(/)

Punctuation Comma(,), Semicolon(;), Dot(.), Arrow(->)

Assignment =

Special Assignment +=, /=, *=, -=

Comparison ==, !=, <, <=, >, >=

Preprocessor #

Location Specifier &

Logical &, &&, |, ||, !

Shift Operator >>, >>>, <<, <<<


Regular Expression: The lexical analyzer needs to scan and identify only a finite
set of valid string/token/lexeme that belong to the language in hand. It searches for
the pattern defined by the language rules.
Regular expressions have the capability to express finite languages by defining a
pattern for finite strings of symbols. The grammar defined by regular
expressions is known as regular grammar. The language defined by regular
grammar is known as regular language.
Operations
The various operations on languages are:
 Union of two languages L and M is
written as L U M = {s | s is in L or s is in
M}
 Concatenation of two languages L and M is
written as LM = {st | s is in L and t is in M}
 The Kleene Closure of a language L is written
as L* = Zero or more occurrence of language
L.
Notations
If r and s are regular expressions denoting the languages L(r) and L(s), then
 Union : (r)|(s) is a regular expression denoting L(r) U L(s)
 Concatenation : (r)(s) is a regular expression denoting L(r)L(s)
 Kleene closure : (r)* is a regular expression denoting (L(r))*
 (r) is a regular expression denoting L(r)

Examples:
1. Write the regular expression for the language accepting all the string which are starting
with 1 and ending with 0, over ∑ = {0, 1}.
R.E=1(1+0)* 0
2. Write the regular expression for the language starting and ending with a and
having any having any combination of b's in between.
RE.= a b*a

3. Write the regular expression for the language starting with a but not
having consecutive b's.
String= {aa,ababa,aba,aab,aaab,aababab,aaaaab}
R.E= {a+ab}* or a*+(ab)*

4. Write the regular expression for the language accepting all the string in which
any number of a's is followed by any number of b's is followed by any number of
c's.
R.E= a*b*c*
String= {{},aaaaa,bbb.cccc,abc,aaabbcc…..}
5. Write the regular expression for the language over ∑ = {0} having even length of the
string.
String={{},00,0000,000000,00000000} R.E=(00)*

Finite Automata Construction

Let L(r) be a regular language recognized by some finite automata (FA).


 States : States of FA are represented by circles. State names are written
inside circles.
 Start state : The state from where the automata starts, is known as the start
state. Start state has an arrow pointed towards it.
 Intermediate states : All intermediate states have at least two arrows; one
pointing to and another pointing out from them.
 Final state : If the input string is successfully parsed, the automata is
expected to be in this state. Final state is represented by double circles. It
may have any odd number of arrows pointing to it and even number of
arrows pointing out from it. The number of odd arrows are one greater than
even, i.e. odd = even+1.
 Transition : The transition from one state to another state happens when a
desired symbol in the input is found. Upon transition, automata can either
move to the next state or stay in the same state. Movement from one state to
another is shown as a directed arrow, where the arrows points to the
destination state. If automata stays on the same state, an arrow pointing from
a state to itself is drawn.
Example : We assume FA accepts any three digit binary value ending in digit 1. FA =
{Q(q0, qf), Σ(0,1), q0, qf, δ}

Transition Table (Mapping Function)

0 1
q0 q0 qf
qf q0 qf

Types of Finite Automata

1) Deterministic Finite Automata

2) Non- Deterministic Finite Automata


Difference between DFA and NDFA :

SR.NO. DFA NDFA

DFA stands for Deterministic NFA stands for Nondeterministic


1 Finite Automata. Finite Automata.

For each symbolic representation No need to specify how does the


of the alphabet, there is only one NFA react according to some
2 state transition in DFA. symbol.

DFA cannot use Empty String NFA can use Empty String
3 (lambda) transition. transition.

NFA can be understood as


DFA can be understood as one multiple little machines
4 machine. computing at the same time.

In NFA, each pair of state and


In DFA, the next possible state is input symbol can have many
5 distinctly set. possible next states.

6 DFA is more difficult to construct. NFA is easier to construct.

DFA rejects the string in case it NFA rejects the string in the
terminates in a state that is event of all branches dying or
7 different from the accepting state. refusing the string.

Time needed for executing an Time needed for executing an


8 input string is less. input string is more.

9 All DFA are NFA. Not all NFA are DFA.

10 DFA requires more space. NFA requires less space then


DFA.
11. Dead state may be required. Dead state is not required.
Conversion from NDFA to DFA

The above Transition System is NDFA as state b is moving to two different State
i.e. (b,c) by using transition value 0.

Steps to convert NDFA to DFA

1) Draw Transition table

0 1
a {} b
b {b,c} b
c {} {}

2) Draw Sub-Transition table

0 1
a {} b
b {b,c} b
{b,c} {b,c} b

3) Draw the transition diagram by using Sub-transition table

1
b
a
1

1 0 0

{b,c}
Q) Compute the first and follow of the following
1) E->E+B 2) E->E+T
E->T
E->E*B T->T*F
E->B T->F
B->0 F->(E)
B->1 F->id
FIRST and FOLLOW are two functions associated with grammar that help us fill in the entries of
an M-table.
FIRST ()− It is a function that gives the set of terminals that begin the strings derived from the
production rule.
A symbol c is in FIRST (α) if and only if α ⇒ cβ for some sequence β of grammar symbols.
A terminal symbol a is in FOLLOW (N) if and only if there is a derivation from the start symbol S
of the grammar such that S ⇒ αNαβ, where α and β are a (possible empty) sequence of grammar
symbols. In other words, a terminal c is in FOLLOW (N) if c can follow N at some point in a
derivation.
Computation of FOLLOW
Follow (A) is defined as the collection of terminal symbols that occur directly to the right of A.
FOLLOW(A) = {a|S ⇒* αAaβ where α, β can be any strings}
 If S is the start symbol, FOLLOW (S) ={$}
 If production is of form A → α B β, β ≠ ε.

(a) If FIRST (β) does not contain ε then, FOLLOW (B) = {FIRST (β)}

Or

(b) If FIRST (β) contains ε (i. e. , β ⇒* ε), then

FOLLOW (B) = FIRST (β) − {ε} ∪ FOLLOW (A)

∵ when β derives ε, then terminal after A will follow B.

 If production is of form A → αB, then Follow (B) ={FOLLOW (A)}.

Ex 1) find first and follow

E->E+B
E->E*B
E->B
B->0
B->1

First(E)={0,1}

First(B)={0,1}

Follow(E)={$,+,0,1,*}
For finding follow of E we will have productions which are having E at the right side
So here two productions are there:
1) E->E+B compare this production with A → α B β, β ≠ ε.
A=E, B=E, α= ε, β=+B, as β value have + and B and + is terminal so it will be added and fist
of B does not consist of ε so first(B) will be added.
2) E->E*B compare this production with A → α B β, β ≠ ε.
A=E, B=E, α= ε, β=*B, as β value have * and B and * is terminal so it will be added and fist
of B does not consist of ε so first(B) will be added.

Follow(B)= )={$,+,0,1,*}
Consider
1) E->E+B compare this production with A → α B β, β ≠ ε.
A=E, B=B, α= E+, β= ε, as β value is ε the production is of form A → αB, then according
to 3rd rule follow(B)=follow(E)
2) E->E*B – same as above
3) E->B -same as above

Ex: 2) find first and follow


E->E+T
E->T
T->T*F
T->F
F->(E)
F->id

First(E)={(,id}
First(T)={(,id}
First(F)= {(,id}

Follow(E)={$,(,id ,) }
Consider:
i)E->E+T compare this production with A → α B β, β ≠ ε.
A=E, B=E, α= €, β= +T, as β value have + and T and + is terminal so it will be added and fist of T
does not consist of ε so first(T) will be added.

ii) F->(E) compare this production with A → α B β, β ≠ ε.


A=F, B=E, α= (, β= ), as β value is terminal so it will be added to the follow(E)

Follow(T)= {$,(,id ,) ,*}

i) E->E+T compare this production with A → α B β, β ≠ ε.


A=E, B=T, α= E+, β= € , as as β value is ε the production is of form A → αB, then
according to 3rd rule follow(T)=follow(E)
ii) E->T the production is of form A → αB, then according to 3 rd rule
follow(T)=follow(E)
iii) T->T*F compare this production with A → α B β, β ≠ ε.
A=T, B=T, α= €, β= *F , as as β value is * and F ,* will be added to the follow(T) directly
and the first(F) will be added to the follow(T).
Follow(F) : {$,(,id ,) ,*}
i) T->T*F compare this production with A → α B β, β ≠ ε.
A=T, B=F, α= T*, β= € , as as as β value is ε the production is of form A → αB, then
according to 3rd rule follow(F)=follow(T)
ii) T->F the production is of form A → αB, then according to 3 rd rule
follow(F)=follow(T)

What is a Parser?
The parser is one of the phases of the compiler which takes a token of string as input and converts
it into the corresponding Intermediate Representation (IR) with the help of an existing grammar.
The parser is also known as Syntax Analyzer.
Classification of Parser
Types of Parsers
The parser is mainly classified into two categories.
1. Top-down Parser
2. Bottom-up Parser
Top-Down Parser
Top-down parser is the parser that generates parse tree for the given input string with the help of
grammar productions by expanding the non-terminals. It starts from the start symbol and ends
down on the terminals. It uses left most derivation.
Further Top-down parser is classified into 2 types: 1) Recursive descent parser and 2) non-
recursive descent parser.
1. Recursive descent parser is also known as the Brute force parser or the backtracking parser.
It basically generates the parse tree by using brute force and backtracking techniques.
2. Non-recursive descent parser is also known as LL(1) parser or predictive parser or without
backtracking parser or dynamic parser. It uses a parsing table to generate the parse tree
instead of backtracking.
Bottom-Up Parser
Bottom-up Parser is the parser that generates the parse tree for the given input string with the help
of grammar productions by compressing the terminals. It starts from terminals and ends upon the
start symbol. It uses the rightmost derivation in reverse order.
Bottom-up parser is classified into two types:
LR parser: This is a bottom-up parser that generates the parse tree for the given string by using
unambiguous grammar. It follows the reverse of the rightmost derivation. LR parser is classified
into four types:
 LR(0)
 SLR(1)
 LALR(1)
Production:
E->E+T (1)
E->T (2)
T->T*F (3)
T->F (4)
F->(E) (5)
F->id (6)

stack Input String Action


0 Id*id+id$ -
0id5 *id+id$ Shift(shifting current value frm I/P to
stack)
0F3 *id+id$ Reduce(Stack value will be replaced
by L.H.S of production) you need to
check goto value.
0T2 *id+id$ Reduce
0T2*7 id+id$ Shift
0T2*7id5 +id$ Shift
0T2*7F10 +id$ Reduce
0T2 +id$ Reduce
0E1 +id$ Reduce
0E1+6 Id$ Shift
0E1+6id5 $ Shift
0E1+6F3 $ Reduce
0E1+6T9 $ Reduce
0E1 $ Reduce
Accept(string parsed successfully)

Q) By using given moves generate parsing table.

0 id*id+id$
0id5 *id+id$
0F3 *id+id$
0T2 *id+id$
0T2*7 id+id$
0T2*7id5 +id$
0T2*7F10 +id$
0T2 +id$
0E1 +id$
0E1+6 id$
0E1+6id5 $
0E1+6F3 $
0E1+6T9 $
0E1 $

String: id*id+id$
Production:
1) F->id
2) T->F
3) T->T*F
4) E->T
5) E->E+T
State Action Goto
id * + $ E F T
0 S5 1 3 2
1 S6 accept
2 S7 R4
3 R2 R2
4
5 R1 R1 R1
6 S5 3 9
7 S5 10
8
9 R5
10 R3

Question paper pattern (30 M)

Q1) Attempt any 2 (10 m) 5 marks each


a) from unit 1
b) from unit 1
c) from unit 1
d) from unit 1

Q2) Attempt any 2 (10 m) 5 marks each


a) from unit 2
b) from unit 2
c) from unit 2
d) from unit 2

Q3) Attempt any 2 (10 m) 5 marks each


a) from unit 1
b) from unit 1
c) from unit 2
d) from unit 2

You might also like