Department of Computer Science & Engineering
UNIT-I
INTRODUCTION TO LANGUAGE PROCESSING:
As Computers became inevitable and indigenous part of human life, and several languages
with different and more advanced features are evolved into this stream to satisfy or comfort the user
in communicating with the machine , the development of the translators or mediator Software’s
have become essential to fill the huge gap between the human and machine understanding. This
process is called Language Processing to reflect the goal and intent of the process. On the way to
this process to understand it in a better way, we have to be familiar with some key terms and
concepts explained in following lines.
LANGUAGE TRANSLATORS :
Is a computer program which translates a program written in one (Source) language to its
equivalent program in other [Target}language. The Source programis ahigh level language where as
the Target language can be any thing from the machine language of a target machine (between
Microprocessor to Supercomputer) to another high level languageprogram.
E Twocommonly Used Translators are Compiler and Interpreter
1. Compiler: — Compilerisaprogram, reads program in one language called Source Language
and translates into its equivalent program in another Language called Target Language, in
addition to this its presents the error information to the User.
An Equivalent Program in
other Language o
COMPILER Relocatable Obj
or Target Program
Error Information
YE the target programs an executable machine-language program, itcanthen be called by
the users to process inputs and produce outputs.
Input ——| Target Program 1 __, ouput
Figurel.t: Running the target Program.
1pDepartment of Computer Science & Engineering
2. Interpreter: An interpreter is another commonly used language processor. Instead of producing
1 target program as a single translation unit, an interpreter appears to directly execute the
‘operations specified in the source program on inputs supplied by theuser.
Source Program ——+|
Input Interpreter, -— ouput
Figure 1.2: Running the target Program
LANGUAGE PROCESSING SYSTEM:
Based on the input the translator takes and the output it produces, a language translator can be
called as any one of the following.
Preprocessor: A preprocessor takes the skeletal source program as input and produces an extended
version of it, which is the resultant of expanding the Macros, manifest constants if any, and
including header files etc in the source file, For example, the C preprocessor is a macro processor
thats used automatically by the C compilerto transform our source before actual compilation. Over
and above a preprocessor performs the following activiti
1 Collectsall the modules, files in case if the source program is divided into different modules
stored at different files.
Expands short hands / macros into source language statements.
Compiler: Is a translator that takes as input a source program written in high level language and
converts it into its equivalent target program in machine language. In addition to above the compiler
also
1. Reports to its user the presence of errors in the source program.
Facilitates the user in rectifying the errors, and execute the code.
Assembler: Is a program that takes as input an assembly language program and converts it into
equivalent machine language code.
Loader / Linker: This is a program that takes as input a relocatable code and collects the library
functions, relocatable object files, and produces its equivalent absolute machine code.
Specifically,
I Loading consists of taking the relocatable machine code, altering the relocatable addresses,
and placing the altered instructions and data in memory at the proper locations.
.
Linking allows us to make a single program from several files of relocatable machine
code. These files may have been result of several different compilations, one or more
may be library routines provided by the system available to any program that needs them.
21PDepartment of Computer Science & Engineering Course File : Compiler Design
In addition to these translators, programs like interpreters, text formatters etc., may be used in
language processing system. To translate a program in a high level language program to an
executable one, the Compiler performs by default the compile and linking functions.
Normally the steps in a language processing system includes Preprocessing the skclctal Source
program which produces an extended or expanded source program or a ready to compile unit of
the source program, followed by compiling the resultant, then linking / loading , and finally its
equivalent executable code is produced. As I said earlier not all these steps are mandatory. In
some cases, the Compiler only performs this linking and loading functions implicitly.
The steps involved in a typical language processing system can be understood with following
diagram.
Source Program —_[ Example: filename.C ]
Pr sor
reproces:
if Jc Sharce Progra [ Example: filename.C ]
—t
Compiler
x
Target Assembly Program
Assembler
q
‘Relocatable Machine Code [ Example: filename.obj }
Loader/linker_ | «—Library files
L Relocatable Object files
‘Target Machine Code _[ Example: filename. exe ]
Figurel.3 : Context of a Compiler in Language Processing System
TYPES OF COMPILERS:
Based on the specific input it takes and the output it produces, the Compilers can be classified
into the following types;
Traditional Compilers(C, C++, Pascal): These Compilers convert a source program in a HLL
into its equivalent in native machine code or object code.Department of Computer Science & Engineering Course File : Compiler Design
Interpreters(LISP, SNOBOL, Javal.0): These Compilers first convert Source code into
intermediate code, and then interprets (emulates) it to its equivalent machine code.
Cross-Compilers: These are the compilers that run on one machine and produce code for
another machine.
Incremental Compilers: These compilers separate the source into user defined-steps;
Compiling/recompiling step- by- step; interpreting steps in a given order
Converters (e.g. COBOL to C++): These Programs will be compiling from one high level
language to another.
Just-In-Time (JIT) Compilers (Java, MicosoftNET): These are the runtime compilers from
intermediate language (byte code, MSIL) to executable code or native machine code. These
perform type based verification which makes the executable code more trustworthy
Ahead-of-Time (AOT) Compilers (e.g., NET ngen): These are the pre-compilers to the native
code for Java and NET
Binary Compilation: These compilers will be compiling object code of one platform into object code
of another platform.
PHASES OF A COMPILER:
Due to the complexity of compilation task, a Compiler typically proceeds in a Sequence of
compilation phases. The phases communicate with each other via clearly defined interfaces.
Generally an interface contains a Data structure (e.g., tree), Set of exported functions. Each
phase works on an abstract intermediate representation of the source program, not the source
program text itself (except the first phase)
Compiler Phases are the individual modules which are chronologically executed to perform their
respective Sub-activities, and finally integrate the solutions to give target code.
tis desirable to have relatively few phases, since it takes time to read and write immediate files.
Following diagram (Figurel.4) depicts the phases of a compiler through which it goes during the
compilation. There fore a typical Compiler is having the following Phases:
1. Lexical Analyzer (Scanner), 2. Syntax Analyzer (Parser), 3.Semantic Analyzer,
4.Intermediate Code GeneratorICG), 5.Code Optimizer(CO) , and 6.Code
Generator(CG)
In addition to these, it also has Symbol table management, and Error handler phases. Not all
the phases are mandatory in every Compiler. e.g, Code Optimizer phase is optional in someDepartment of Computer Science & Engineering Course File: Compiler Design
cases. The description is given in next section.
The Phases of compiler divided in to two parts, first three phases we are called as
Analysis part remaining three called as Synthesis part.
Some program
Lexical analyser
Syntax analyser
i
Semantic analyser
[Syinbol-table manager Error handler
Intermediate
code generator
aaa Ta
Code optimiser
SS
Code generat
y
Target program
Figurel.4 : Phases of a Compiler
PHASE, PASSES OF A COMPILER:
In some application we can have a compiler that is organized into what is called passes.
Where a pass is a collection of phases that convert the input from one representation t0 a
completely deferent representation. Each pass makes a complete scan of the input and produces
its output to be processed by the subsequent pass. For example a two pass Assembler.
THE FRONT-END & BACK-END OF A COMPILER
S|PageDepartment of Computer Science & Engineering Course File : Compiler Design
All of these phases of a general Compiler are conceptually divided into The Front-end,
and The Back-end. This division is due to their dependence on either the Source Language or
the Target machine. This model is called an Analysis & Synthesis model of acompiler.
‘The Front-end of the compiler consists of phases that depend primarily on the Source
language and are largely independent on the target machine. For example, front-end of the
compiler includes Scanner, Parser, Creation of Symbol table, Semantic Analyzer, and the
Intermediate Code Generator.
‘The Back-end of the compiler consists of phases that depend on the target machine, and
those portions don’t dependent on the Source language, just the Intermediate language. In this we
have different aspects of Code Optimization phase, code generation along with the necessary
Error handling, and Symbol table operations.
LEXICAL ANALYZER (SCANNER): The Scanner is the first phase that works as interface
between the compiler and the Source language program and performs the following functions:
I Reads the characters in the Source program and groups them into a stream of tokens in
which each token specifies a logically cohesive sequence of characters, such as an
identifier , a Keyword , a punctuation mark, a multi character operator like
ms
‘The character sequence forming a token is called a lexeme of the token.
.
‘The Scanner generates a token-id, and also enters that identifiers name in the Symbol
table if it doesn’t exist.
ws
Also removes the Comments, and unnecessary spaces.
‘The format of the token is < Token name, Attribute value>
SYNTAX ANALYZER (PARSER): The Parser interacts with the Scanner, and its subsequent
phase Semantic Analyzer and performs the following functions:
1 Groups the above received, and recorded token stream into syntactic structures, usually
into a structure called Parse Tree whose leaves are tokens.
‘The interior node of this tree represents the stream of tokens that logically belongs
together.
s
1. Itmeans it checks the syntax of program elements.
SEMANTIC ANALYZER: This phase receives the syntax tree as input, and checks the
semantically correciness of the program. Though the tokens are valid and syntactically correct, it
6|PaDepartment of Computer Science & Engineering
‘may happen that they are not correct semantically. Therefore the semantic analyzer checks the
semantics (meaning) of the statements formed.
1. The Syntactically and Semantically correct structures are produced here in the form of a
Syntax tree or DAG or some other sequential representation like matrix.
INTERMEDIATE CODE GENERATOR(ICG): This phase takes the syntactically and
semantically correct structure as input, and produces its equivalent intermediate notation of the
source program, The Intermediate Code should have two important properties specified below:
1 Itshould be easy to produce,and Easy to translate into the target program.
intermediate code forms are:
mple
1 Three address codes,
Polish notations, etc.
CODE OPTIMIZER: This phase is optional in some Compilers, but so useful and beneficial in
terms of saving development time, effort, and cost. This phase performs the following specific
functions:
1 Attempts to improve the IC so as to have a faster machine code. Typical functions
include “Loop Optimization, Removal of redundant computations, Strength reduction,
Frequency reductions etc.
Sometimes the data structures used in representing the intermediate forms may also be
changed.
CODE GENERATOR: ‘his is the final phase of the compiler and generates the target code,
normally consisting of the relocatable machine code or Assembly code or absolute machine
code.
1 Memory locations are selected for each variable used, and assignment of variables to
registers is done.
1 Intermediate instructions are translated into a sequence of machine instructions.
‘The Compiler also performs the Symbol table management and Error handling throughout the
compilation process. Symbol table is nothing but a data structure that stores different source
Janguage constructs, and tokens generated during the compilation. These two interact with all
phases of the Compiler.
TIPDepartment of Computer Science & Engineering Course File : Compiler Design
For example the source program is an assignment statement; the following figure shows how the
phases of compiler will process the program.
‘The inpat source program is Position=initial+rate*60
position = initial + rate = 60
—
LW
ti = inttofloat (60)
t2 = ids e ti
t3 = id2 + t2
igi = t3
LDF R82. ia3
MULF R2. R2, 860.0
LDF Ri. 142
ADDF R1, Ri. R2
STF ida, Ri
Figurel.5: Translation of an assignment Statement
B[PageDepartment of Computer Science & Engineering
LEXICAL ANALYSIS:
As the first phase of a compiler, the main task of the lexical analyzer is to read the
input characters of the source program, group them into lexemes, and produce as output tokens
for cach lexeme in the source program. This stream of tokens is sent to the parser for syntax
analysis. It is common for the lexical analyzer to interact with the symbol table as well.
When the lexical analyzer discovers a lexeme constituting an identifier, it needs to
enter that lexeme into the symbol table. This process is shown in the following figure.
token
source Lexical. §$[>-————*} to semantic
Parser
program "| Analyzer | analysis
getNextToken
Symbol
Table
Figure 1.6 : Lexical Analyzer
When lexical analyzer identifies the first token it will send it to the parser, the parser
receives the token and calls the lexical analyzer to send next token by issuing the
getNextTokenQ) command. This Process continues until the lexical analyzer identifies all the
tokens. During this process the lexical analyzer will neglect or discard the white spaces and
comment lines.
TOKENS, PATTERNS AND LEXEMES:
A token is a pair consisting of a token mame and an optional attribute value. The token
name is an abstract symbol representing a kind of lexical unit, e.g., a particular keyword, or a
sequence of input characters denoting an identifier. The token names are the input symbols that
the parser processes. In what follows, we shall generally write the name of a token in boldface.
We will often refer to a token by its token name.
A pattern is a description of the form that the lexemes of a token may take [ or match]. In the
case of a keyword as a token, the pattern is just the sequence of characters that form the
keyword. For identifiers and some other tokens, the pattern is a more complex structure that
‘matched by many strings.
orpDepartment of Computer Science & Engineering
Course File : Compiler Design
A lexeme is a sequence of characters in the source program that matches the pattern for a
token and is identified by the lexical analyzer as an instance of that token,
Example: In the following C language statement ,
printf ("Total =
9%6d\n”, score) :
both printf and score are lexemes matching the pattern for token id, and "Total = %d\n"
is a lexeme matching literal [or string].
TOKEN INFORMAL DESCRIPTION SAMPLE LEXEMES
if characters i, f if
else characters e, 1, s, € else
comparison | < or > or <= or >= or == or !=
, —=, or <=. Thus, we shall
introduce a two-buffer scheme that handles large look aheads safely. We then consider an
improvement involving "sentinels" that saves time checking for the ends of buffers.
Buffer Pairs
Because of the amount of time taken to process characters and the large number of characters
that must be processed during the compilation of a large source program, specialized buffering
techniques have been developed to reduce the amount of overhead required to process a single
input character. An important scheme involves two buffers that are alternately reloaded
E = M:*/C: 4:4: 2 ‘eof:
je
lexemeBegin
Figurel.8 : Using a Pair of Input Buffers
Each buffer is of the same size N, and N is usually the size of a disk block, e.g., 4096
bytes. Using one system read command we can read N characters in to a buffer, rather than
using one system call per character. If fewer than N characters remain in the input file, then a
special character, represented by eof, marks the end of the source file and is different from any
possible character of the source program.
1. Two pointers to the input are maintained:
1, The Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent
‘we are attempting to determine.
2. Pointer forward scans ahead until a paitern match is found; the exact strategy
whereby this determination is made will be covered in the balance of this chapter.
[PageDepartment of Computer Science & Engineering Course File : Compiler Design
Once the next lexeme is determined, forward is set to the character at its right end. Then,
after the lexeme is recorded as an attribute value of a token returned to the parser, lexemeBegin
is set to the character immediately after the lexeme just found. In Fig, we see forward has passed
the end of the next lexeme, ** (the FORTRAN exponentiation operator), and must be retracted
one position to its left.
Advancing forward requires that we first test whether we have reached the end of one
of the buffers, and if so, we must reload the other buifer from the input, and move forward to
the beginning of the newly loaded buffer. As long as we never need to look so far ahead of the
actual lexeme that the sum of the lexeme's length plus the distance we look ahead is greater
than N, we shall never overwrite the lexeme in its buffer before determining it.
Sentinels ‘To Improve Scanners Performance:
If we use the above scheme as described, we must check, each time we advance forward,
that we have not moved off one of the buffers; if we do, then we must also reload the other
buffer. Thus, for each character read, we make two tests: one for the end of the buffer, and one
to determine what character is read (the latter may be a multi way branch). We can combine the
buffer-cad test with the test for the current character if we extend cach buffer to hold a sentinel
character at the end. The sentinel is a special character that cannot be part of the source program,
and a natural choice is the character eof. Figure 1.8 shows the same arrangement as Figure 1.7,
but with the sentinels added. Note that eof retains its use as a marker for the end of the entire
input.
PE! is) IM! tefl c lt! +! 2 eof ‘eof
forward
lexemeBegin
Figure1.8 : Sentential at the end of each buffer
Any eof that appears other than at the end of a buffer means that the input is at an end. Figure 1.9
summarizes the algorithm for advancing forward. Notice how the first test, which can be part of
12|PaDepartment of Computer Science & Engineering
a multiway branch based on the character pointed to by forward, is the only test we make, except
in the case where we actually are at the end of a buffer or the end of the input,
switch (*forward++ )
(
case eof: if (forward is at end of first buffer )
{
reload second buffer,
forward = beginning of second buffer;
}
dlse if (forward is at end of second buffer )
{
reload first buifer:
forward = beginning of first buffer;
}
else /* eof within a buffer marks the end of inpat */
terminate lexical analysis;
break;
Figure 1.9: use of switch-case for the sentential
SPECIFICATION OF TOKENS:
Regular expressions are an important notation for specifying lexeme patterns. While they cannot express,
all possible pattems, they are very effective in specifying those types of patterns that we actually need for
tokens.
LEX the Lexical Analyzer generator
Lex is a tool used to generate lexical analyzer, the input notation for the Lex tool is
referred to as the Lex language and the tool itself is the Lex compiler. Behind the scenes, the
Lex compiler transforms the input patterns into a transition diagram and generates code, in a
file called lex yy .c, itis a c program given for C Compiler, gives the Object code. Here we need
to know how to write the Lex language. The structure of the Lex program is given below.
13|PDepartment of Computer Science & Engineering Course File : Compiler Design
Structure of LEX Program : A Lex program has the following form:
Declarations
%%
‘Translation rules
%%
Auxiliary functions definitions
‘The declarations section : includes declarations of variables, manifest constants (identifiers
declared to stand for a constant, ¢.g., the name of a token), and regular definitions. It appears
between % {.. .%}
In the Translation rules section, We place Pattern Action pairs where cach pair have the form
Pattem (Action)
The auxiliary function definitions section includes the definitions of functions used to install
identifiers and numbers in the Symbol tale.
LEX Program Example:
%
/* definitions of manifest constants LT.LE.EQ.NE.GT,GE, IF.THEN, ELSE,ID, NUMBER,
RELOP */
%)
/* regular definitions */
delim [dn]
ws (— delim}+
letter [A-Za-z]
digit [0-91
id {letter} ({letter} | (digit}) *
number it}+(\. (digit}+)? E [+I]? (digit)?
%%
(ws} {/* no action and no return */}
if {return(1F) ; }Department of Computer Science & Engineering Course File : Compiler Design
then {return(THEN) ; }
else {return(ELSE) ; }
(id) {yylval = (int) installID(); return(1D);}
(number) —_{yylval = (int) installNumd) ; return(NUMBER) ; }
n< {yylval = LT; return(RELOP) ; )}
“,
form, where ais a single non terminal, f is (VUT)*.Each production consists of:
@ A non terminal called the head or left side of the production; this
production defines some of the strings denoted by the head.
(©) The symbol ->. Some times: = has been used in place of the arrow.
(A body or right side consisting of zero. or more temminals and non-
terminals. The components of the body describe one way in which strings of the non
terminal at the head can be constructed.
1 Conventionally, the productions for the start symbol are listed
Example: Context Free Grammar to accept Arithmetic expressions.
‘The terminals are +, *,-, (), id.
‘The Non terminal symbols are expression, term, factor and expression is the starting symbol.
expression —> expression + term
expression —+ expression —term
expression —> term
term —+ term * factor
term —> term / factor
term —> factor
factor —» (expression )
‘factor — id
Figure 2.3 : Grammar for Simple Arithmetic Expressions
Notational Conventions Used In Writing CFGs:
To avoid always having to state that “these are the terminals," "these are the non
terminals,” and so on, the following notational conventions for grammars will be used
throughout our discussions.
1. These symbols are terminals:
(@) Lowercase letters early in the alphabet, such asa, b, e.
(©) Operator symbols such as +, *, and so on.
(© Punctuation symbols such as parentheses, comma, and so on.
(@ The digits 0,1... 9.
(©) Boldface strings such as
terminal symbol.
or if, each of which represents asingle
[PaveDepartment of Computer Science & Engineering
2. These symbols are non terminals:
(a) Uppercase letters early in the alphabet, such as A, B, C.
(b) The letter S, which, when it appears, is usually the start symbol.
(c) Lowercase, italic names such as expr or stmt.
(4) When discussing programming constructs, uppercase letters may be used to represent
Non terminals for the constructs. For example, non terminal for expressions, terms,
and factors are often represented by E, T, and F, respectively.
Using these conventions the grammar for the arithmetic expressions can be written as
EOE+T|E-T|T
T+ T*R|T/F|F
F+> ®|id
DERIVATIONS:
‘The construction of a parse tree can be made precise by taking a derivational view. in
which productions are treated as rewriting rules. Beginning with the start symbol, each rewriting
step replaces a Non terminal by the body of one of its productions. This derivational view
corresponds to the top-down construction of a parse tree as well as the bottom construction of the
parse tree.
© Derivations are classified in to Let most Derivation and Right Most Derivations.
Left Most Derivation (LMD):
It is the process of constructing the parse tree or accepting the given input string, in
which at every time we need to rewrite the production rule it is done with left most non terminal
only.
Ex:
if the Grammar is E-> E+E | E*E | -B] (E) | id and the input string is id + id* id
The production E -> - E signifies that if E denotes an expression, then — E must also denote an
expression. The replacement of a single E by ~ E will be described by writing
E => -E which is read as “E derives _E”
For a general definition of derivation, consider a non terminal A in the middle of a
sequence of grammar symbols, as in AB, where « and B are arbitrary strings of grammar
symbol. Suppose A ->y is a production. Then, we write GAB => ay. The symbol => means
“derives in one step". Often, we wish to say, "Derives in zero or more steps." For this purpose,
wwe can use the symbol “>, If we wish to say, “Derives in 5 one or more steps.” We en use
the symbol =5 If $ => a, where S is the start symbol of a grammar G, we say that a is a
sentential form of G.
The Lefimost Derivation for the given input string id + id* id is
E>E+E
2pDepartment of Computer Science & Engineering
=>id+E
=>id+E*E
= id+id*E
=>id-+id *id
NOTE: Every time we need to start from the root production only, the under line using at Non
terminal indicating that, i is the non terminal (left most one) we are choosing to rewrite the
productions to accept the string.
Right Most Derivation (RMD):
It is the process of constructing the parse tree or accepting the given input string, every
time we need to rewrite the production rule with Right most Non terminal only.
The Right most derivation for the given input string id + id* id is
E=>E+E
=> E+E*E
=>E+E*id
=>E+id*id
=>id+id *id
NOTE: Every time we need to start from the root production only, the under line using at Non’
terminal indicating that, it is the non terminal (Right most one) we are choosing to rewrite the
productions to accept the string.
What is a Parse Tree?
A parse tree is a graphical representation of a derivation that filters out the order in witich
productions are applied to replace non terminals.
Each interior node of a parse tree represents the application of a production.
Al the interior nodes are Non terminals and all the leaf nodes terminals.
All the leaf nodes reading from the left to right will be the output of the parse tree.
If a node n is labeled X and _ has children nl ,n2,n3,...nk with labels X1,X2,,..Xk
respectively, then there must be a production A>X1X2...Xk in the grammar.
Examplel:- Parse tree for the input string - (id + i) using the above Context free Grammar is
2pDepartment of Computer Science & Engineering Course File : Compiler Design
ZN,
ZEN,
win?
t i
id id
Figure 2.4 : Parse Tree for the input string - (id + id)
‘The Following figure shows step by step construction of purse tree using CFG for the parse tree
for the input string - (id + id).
B i mea
- BL
JIN
( EB )
= z 7 B o E
’, “™; /™;
7 a AN Weitin
JIN UN CAN.
Ba SE § as
id id id
Figure 2.5 : Sequence outputs of the Parse Tree construction process for the input string -(id+id)
Example2:- Puse tree for the input string id-+id*id using the above Context free Grammar is
ls,
7 TAIN,
id
i 7
id id
Figure 2.6: Parse tree for the input string id+ id*id.
BIPaDepartment of Computer Science & Engineering Course File : Compiler Design
AMBIGUITY in CFGs:
Definition: A grammar that produces more than one parse tree for some sentence (input string)
is said to be ambiguous.
In other words, an ambiguous grammar is one that produces more than one leftmost
derivation or more than one rightmost derivation for the same sentence.
Or If the right hand production of the grammar is having two non terminals which are
exactly same as left hand side production Non terminal then itis said to an ambiguous grammar.
Example : I the Grammar is E-> E+E | E*E | -E| (E) | id and the Input Siring is id + id* id
‘Two parse trees for given input string are
E
ZIN Mil
E + E
| ZAIN ZIN |
dogs ® i
id id id id
(a)
)
Two Left most Derivations for giv
E=>E+E E=>E*E
=> id+E => EsE*E
= id+E*E =>id+EtE
=>id+id*E =>id+ id" E
=> id +id id =>id +id *id
(a) (b)
‘The above Grammar is giving two parse trees or two derivations for the given input string so, it
is an ambiguous Grammar
Note: LL (1) parser will not accept the ambiguous grammars or We cannot construct an
LL() parser for the ambiguous grammars. Because such grammars may cause the Top
Down parser to go into infinite loop or make it consume more time for parsing. If necessary
‘we must remove all types of ambiguity from it and then construct.
ELIMINATING AMBIGUITY: Since Ambiguous grammars may cause the top down Parser
20 into infinite loop, consume more time during parsing.
Therefore, sometimes an ambiguous grammar can be rewritten to eliminate the ambiguity. The
general form of ambiguous productions that cause ambiguity in grammars is
MIP aDepartment of Computer Science & Engineering Course File : Compiler Design
A> Aa|B
‘This can be written as (introduce one new non terminal in the place of second non terminal)
ASB A
Alt aAlle
Example : Let the grammar is E—> E+E | E*E |-E] (B) |id . It is shown that it is ambiguous that
can be written as
E> EE
ESEE
EEE
EOE
E>)
E> id
In the above grammar the I* and 2 productions are having ambiguity. So, they can be written
as
E-> E+E | EE this production again can be written as
E-> E+E |B, where B is E*E
‘The above production is same as the general form. so, that can be written as,
ESE+TT
Top
The value of B is E*E so, above grammar can be written as
1) ESE+TT
2) T->E*E — The first production is free from ambiguity and substitute E->T in
the 2"! production then it can be written as
T> TT |-E| (B) | id this production again can be written as
‘T->T*T | B where B is -E] (E) | id, introduce new non terminal in the Right hand side
production then it becomes
T>T*F|E
F->-E|(E)|id now the entire grammar tured in to it equivalent unambiguous,
‘The Unambiguous grammar equivalent to the given ambiguous one is
D EPE+TIT
2 TOTFE|E
2 F>-E|() id
LEFT RECURSION:
Another feature of the CFGs which is not desirable to be used in top down parsers is left
recursion, A grammar is left recursive if it has anon terminal A such that there is a derivation
A=>Aa for some string a in (TUV)*. LL(1) or Top Down Parsers can not handle the Left
Recursive grammars, so we need to remove the left recursion from the grammars before being
used in Top Down Parsing.
25|PaDepartment of Computer Science & Engineering
The General form of Left Recursion is
A+ AaiB
The above left recursive production can be written as the non left recursive equivalent :
A> par
Ale aAll€
the following grammar left recursive? If so, find a non left recursive grammar
Example :
equivalent to
ESEsT|T
T+ TEE
F-E|()|id
Yes the grammar is left recursive due to the first two productions which are satisfying the
general form of Left recursion, so they can be rewritten after removing left recursion from
Es E+T,and T= T*F is
E> TE
E>4TE'|€
T+FT
TFT’ I€
F> ©)|id
LEFT FACTORING:
Left factoring is a grammar transformation that is useful for producing a grammar suitable for
predictive or top-down parsing. A grammar in which more than one production has common
prefix is to be rewritten by factoring out the prefixes.
For example, in the following grammar there are n A productions have the common prefix 41,
which should be removed or factored out without changing the language defined for A.
A aAl|aA2|aA3|
aA4|...|aAn
We can factor out the a from sll n productions by adding a new A production A ea!
and rewriting the A‘ productions grammar as
A> aa’
A’-r ALA2|A3IAd.
EIRST and FOLLOW;
JAn
26|PaveDepartment of Computer Science & Engineering Course File : Compiler Design
‘The construction of both top-down and bottom-up parsers is aided by two functions,
FIRST and FOLLOW, associated with a grammar G. During top down parsing, FIRST and
FOLLOW allow us to choose which production to apply, based on the next input (look a bead)
symbol.
Computation of FIRST:
FIRST function computes the set of terminal symbols with which the right hand side of
the productions begin. To compute FIRST (A) for all grammar symbols, apply the following
rules until no more terminals or € can be added to any FIRST set.
1. If Ais aterminal, then FIRST (A) = (A).
2. If Ais aNon terminal and A->X1X2...Xi
FIRST(A)=FIRST(X1) if XIis not null, if X1 is a non terminal and X1->€, add
FIRST(X2) to FIRST(A), if X2-> € add FIRST(X3) to FIRST(A), ... if Xi>€ ,
ie., all Xi’s for i=1.i are null, add € FIRST(A).
3. If A->€is a production, then add € to FIRST (A).
Computation Of FOLLOW:
Follow (A) is nothing but the set of terminal symbols of the grammar that are
immediately following the Non terminal A. If a is to the immediate right of non terminal A, then
Follow(A)= (a). To compute FOLLOW (A) for all non terminals A, apply the following rules
until no more symbols can be added to any FOLLOW set.
1. Place $ in FOLLOW(S), where S is the start symbol, and $ is the input right end
marker.
If there is a production A> aBp, then everything in FIRST (B) except € is in
FOLLOW@).
3. If there is a production A->aB or a production A-> aBf with FIRST(f) contains €,
‘then FOLLOW (B) = FOLLOW (A).
Example: ~ Compute the FIRST and FOLLOW values of the expression grammar
1. E> TE’
E/-4TE'|€
T> FT
TAFT’ €
F+@)|id
Computing FIRST Values:
FIRST (E) = FIRST (T) = FIRST (F) = {( id}
FIRST (E') = (+, €}
FIRST (T') = {*, €}
271PaDepartment of Computer Science & Engineering
Computing FOLLOW Values:
FOLLOW (E)={$,).} Because tis the start symbol of the grammar.
FOLLOW (E') = (FOLLOW (E)} __ satisfying the 3“ rule of FOLLOW()
{So}
FOLLOW (T) = { FIRST E’} his Satisfying the 2” mule.
U { FOLLOWE? }
= (4 FOLLOWE)}
= (48)
FOLLOW (I')=(FOLLOW(T)} —_Satistying the 3° Rule
14,$,))
FIRST (T’) } Itis Satisfying the 2™ rule,
U { FOLLOW(E’) }
*, FOLLOW (T)}
SHS
Course File: Compiler Design
NON TERMINAL FIRST FOLLOW
E (G id) (S$)
E (4 €} (8)
T {G id} (438))}
T {*, & {+8}
F {G id} (s 45)
Table 2.1: FIRST and FOLLOW values
Constructing Predictive Or LL (1) Parse Table:
Iti
FIRST and FOLLOW values of the Productions.
The rules to be followed to Construct the Parsing Table (M) are :
For each terminal symbol ‘a’
the process of placing the all productions of the grammar in the parse table based on the
‘or Each production A> a of the grammar, do the bellow steps.
RST (4), add the production A> a to M [A, a].
i. If€ is in FIRST (a) add production A->a to M [ A, b], where b is all terminals in
FOLLOW (A).
ii, If€ is in FIRST(a) and $ is in FOLLOW (A) then add production A->a to
MIA, S$].
4. Mark other entries in the parsing table as error .
INPUT SYMBOLS
NON-TERMINALS
+ * ( ) id s
281?Department of Computer Science & Engineering Course File: Compiler Design
E TE’ id
rE +TE” € €
T T_ FIv TF
Y Tv € |v FT YT € T €
F F ®
‘Table 2.2: LL (1) Parsing Table for the Expressions Grammar
Note: if there are no multiple entries in the table for single a terminal then grammar is accepted
by LL(1) Parser.
LL() Parsing Algor
‘The parser acts on basis on the basis of two symbols
i. A, the symbol on the top of the stack
|. a, the current input symbol
There are three conditions for A and ‘a’, that are used fro the parsing program.
1. If A=a=S then parsing is Successful.
2. If A=a¢S then parser pops off the stack and advances the current input pointer to the
next
3. If Aisa Non terminal the parser consults the entry M[A. a] in the parsing table. If
MIA, a] is a Production A-> X:X2..X,, then the program replaces the A on the top of
the Stack by X:X2..Xq in such a way that X1 comes on the top.
STRING ACCEPTANCE BY PARSER:
If the input string for the parser is id + id * id, the below table shows how the parser
accept the string with the help of Stack.
Stack Input Action Comments
ide id dS [ETE E on top of the stack is replaced by TE
ide idids [TP T on top of the stack is replaced by FT"
ide idids [Pid F on top of the stack is replaced by id
ids id*id $ | pop and remove id__| Condition 2 is satisfied
vdds [Te Ton top of the stack is replaced by €
adds [ETE E on top of the stack is replaced by +TE.
+id*idS [Pop and remove + _| Condition 2 is satisfied
iGrids T Fr T on top of the stack is replaced by FT”
idrids F_id F on top of the stack is replaced by id
id*id8___| popand remove id__| Condition 2 is satisfied
29|PageDepartment of Computer Science & Engineering Course File : Compiler Design
ser ids Tr Ton top of the stack is replaced by "FT
SET ER ey pop and remove * | Condition 2 is satisfied
F id Foon top of the stack is replaced by id
Pop and remove id _| Condition 2 is satisfied
Té Ton top of the stack is replaced by €
EB €¢ E* on top of the stack is replaced by €
Parsing is successful | Condition I satisfied
‘Table2.3 : Sequence of steps taken by parser in parsing the input token stream i+ id* id.
|
id +
Figure 2.7: Parse tree for the input id + id* id
ERROR HANDLING (RECOVERY) IN PREDICTIVE PARSING:
In table driven predictive parsing, itis clear as to which terminal and Non terminals the
parser expects from the rest of input. An error can be detected in the following situations:
1. When the terminal on top of the stack does not match the current input symbol.
2. when Non terminal A is on top of the stack, a is the current input symbol, and
MIA, a] is empty or error
‘The parser recovers from the error and continues its process. The following error recovery
schemes are use in predictive parsing:
Panic mode Error Recovery :
It is based on the idea that when an error is detected, the parser will skips the
remaining input until a synchronizing token is en countered in the input. Some cxamples are
listed below:
1. For a Non Terminal A, place all symbols in FOLLOW (A) are adde into the
synchronizing set of non terminal A. For Example, consider the assignment statement
“e=;" Here, the expression on the right hand side is missing. So the Follow of this is
considered. It is “;” and is taken as synchronizing token. On encountering it, parser
emits an error message “Missing Expression”.
2. For a Non Terminal A, place all symbols in FIRST (A) are adde into the
synchronizing set of non terminal A. For Example, consider the assignment statement
“22c= a+ b;” Here, FIRST (expr) is 22. It is *;” and is taken as synchronizing token
and then the reports the error as “extraneous token”.
30|PageDepartment of Computer Science & Engineering Course File : Compiler Design
Phrase Level Recovery :
It can be implemented in the predictive parsing by filling up the blank entries in
the predictive parsing table with pointers to error Handling routines. These routines can
insert, modify or delete symbols in the input.
RECURSIVE DESCENT PARSING ;
A recursive-descent parsing program consists of a set of recursive procedures, one for each non
terminal. Each procedure is responsible for parsing the constructs defined by its non terminal,
Execution begins with the procedure for the start symbol, which halts and announces success if
its procedure body scans the entire input string
If the given grammar is
E> TE’
E> 41
T+ Fr
Tl> *FT"/€
F > @®lid
Reccursive procedures for the recursive descent parser for the given grammar are given below.
procedure E( )
(
1€
TO:
BO:
}
procedure T ( )
(
FC);
TO:
)
Procedure E'( )
(
if input
{
31[PageDepartment of Computer Science & Engineering
TO:
retum true;
bse eum eo,
procedure FO
‘s input
{
advance():
EQ:
if input
‘advaneet );
retum true;
}
else if input= “id”
{
advance( );
retum true;
}
else return error,
}
advance()
4
BACK TRACKING: This parsing method uses the technique called Brute Force method
during the parse tree construction process. This allows the process to go back (back track) and
redo the steps by undoing the work done so far in the point of processing.
input = next token;
Brute force method: Itis a Top down Parsing technique, occurs when there is more
than one alternative in the productions to be tried while parsing the input string. It selects
alternatives in the order they appear and when it realizes that something gone wrong it tries with
next alternative
For example, cor
ler the grammar bellow.
S + cAd
Arabia
To generate the input string “cad”, initially the first parse tree given below is generated.
‘As the string generated is not “cad”, input pointer is back tracked to position “A”, to examine the
next alternate of “A”. Now a match to the input string occurs as shown in the 2™ parse trees
given below.
321?Department of Computer Science & Engineering Course File : Compiler Design
Ss Ss s
AN ZIN 4IN
c A d c A d € A d
/\, |
(b QQ)
IMPORTANT AND EXPECTED QUESTIONS
1. Explain the components of working of a Predictive Parser with an example?
What do the FIRST and FOLLOW values represent? Give the algorithm for computing
FIRST n FOLLOW of grammar symbols with an example?
3. Construct the LL (1) Parsing table for the following grammar?
ESEsTT
TOPE
F> @)lid
4. For the above grammar construct, and explain the Recursive Descent Parser?
5. What happens if multiple entries occurring in your LL (1) Parsing table? Justify your
answer? How does the Parser
ASSIGNMENT QUESTIONS
1, Eliminate the Left recursion from the below grammar?
As> Aab | AcBIb
B-> Bald
Explain the procedure to remove the ambiguity from the given grammar with your own
example?
3. Write the grammar for the if-else statement in the C programming and check for the left
factoring?
4. Will the Predictive parser accept the ambiguous Grammar justify your answer?
5. Is the grammar G = { S>L=R, S->R, R>L, L>*R | id } an LL(1) grammar?
33|PageDepartment of Computer Science & Engineering
BOTTOM-UP PARSING
Course File: Compiler Design
Bottom-up parsing corresponds to the construction of a parse tree for an input string
beginning at the leaves (the bottom nodes) and working up towards the root (the top node). It
involves “reducing an input string ‘w’ to the Start Symbol of the grammar. in each reduction
step, a perticular substring matching the right side of the production is replaced by symbol on the
Jeft of that production and it is the Right most derivation. For example consider the following
Grammar:
E> Est
TOTP
F +> @)jid
Bottom up parsing of the input string “id * Id “is as follows:
INPUT STRING ‘SUB STRING REDUCING PRODUCTION
iid a Foid
Frid T Fost
Tid 1a Fsid
TE * TST
T TE ST
E Start symbol. Hence, the input
String is accepted
Parse Tree representation is as follow:
34|PaveDepartment of Computer Science & Engineering Course File : Compiler Design
id + id Fo «id T « id T+F T. B
- | 17 | /\\ I
id F FF id T+*F T
| | 17 iy /I\
id id Pr id f * f
id FP id
ia
Figure 3.1 : A Bottom-up Parse tree for the input String “id*id”
Bottom up parsing is classified in to 1. Shift-Reduce Parsing, 2. Operator Precedence parsing ,
and 3. [Table Driven] L R Parsing
i. SLR(1)
CALR (1)
.LALR( 1)
SHIET-REDUCE PARSING:
Shift-reduce parsing is a form of bottom-up parsing in which a stack holds grammar
symbols and an input buffer holds the rest of the string to be parsed, We use $ to mark the
bottom of the stack and also the right end of the input. And it makes use of the process of shift
and reduce actions to accept the input string. Here, the parse tree is Constructed bottom up from
the leaf nodes towards the root node.
When we are parsing the given input string, if the match occurs the parser takes the
reduce action otherwise it will go for s mn, And it can accept ambiguous grammars also.
For example, consider the below grammar to accept the input string “id * id“, using S-R parser
ESEsTIT
TTF |F
F>@®)lid
Actions of the Shift-reduce parser using Stack implementation
STACK INPUT ‘ACTION
$ Teds Shift
Sid ids Reduce with Fd
SE ids Reduce with TF
ST ids Shift
ST ids Shift
sTd 3 Reduce with F>id
SPE 5 Reduce wih T> TF
ST § Reduce with ET
SE s ‘Accept
35|Pave