[go: up one dir, main page]

0% found this document useful (0 votes)
29 views27 pages

Lexical Analysis in Compiler Design

Lexical Analysis is the initial phase of a compiler that converts source code into tokens by reading characters and removing whitespace and comments. It utilizes regular expressions and finite automata (NFA and DFA) for token recognition, while also managing a symbol table for identifiers. The document covers various aspects of lexical analysis, including error reporting, the use of tools like LEX, and the importance of regular expressions in defining token patterns.

Uploaded by

tivimi9582
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views27 pages

Lexical Analysis in Compiler Design

Lexical Analysis is the initial phase of a compiler that converts source code into tokens by reading characters and removing whitespace and comments. It utilizes regular expressions and finite automata (NFA and DFA) for token recognition, while also managing a symbol table for identifiers. The document covers various aspects of lexical analysis, including error reporting, the use of tools like LEX, and the importance of regular expressions in defining token patterns.

Uploaded by

tivimi9582
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

⭐ UNIT – 2 : LEXICAL ANALYSIS

Lexical Analysis is the first phase of the compiler front-end.

It reads the source code character by character and groups them into tokens.

⭐ 1. What is Lexical Analysis?

Lexical Analysis (also called Scanning) is the process of converting a sequence of characters from the source
program into a sequence of tokens.

Example:

Input:

int a = b + 5;

Tokens:

• int (keyword)

• a (identifier)

• = (operator)

• b (identifier)

• + (operator)

• 5 (number)

• ; (delimiter)

⭐ 2. Role of the Lexical Analyzer

Functions:

1. Converts character stream → token stream

2. Removes whitespace, tabs, newlines

3. Removes comments

4. Applies regular expressions to recognize lexemes

5. Communicates with symbol table

6. Reports lexical errors

7. Performs buffering for fast input reading

⭐ 3. Tokens, Lexemes, Patterns (VERY IMPORTANT — Short Question)

Token:

A token is a category of lexemes such as identifier, keyword, number, operator.

Lexeme:

A lexeme is the actual character sequence that matches a pattern.

Example:

Token = identifier

Lexeme = "count"
Pattern:

Rule describing how a lexeme should look (regular expression).

Example pattern for identifier:

letter (letter | digit)*

⭐ 5. Regular Expressions (RE)

REs describe patterns for tokens.

Lexical analyzer uses REs to recognize lexemes.

Examples:

Digit → 0|1|…|9

Identifier → Letter (Letter|Digit)*

Number → Digit+

⭐ 6. Finite Automata (NFA & DFA)

Lexical analyzer converts regular expressions to finite automata.

➤ NFA (Nondeterministic Finite Automata)

• Multiple transitions for same input

• Easy to construct

• Not efficient for scanning

➤ DFA (Deterministic Finite Automata)

• Only one transition for each input symbol

• Fast and efficient

• Used by compilers

RE → NFA → DFA → Minimized DFA

⭐ 7. Recognition of Tokens using DFA

Lexical analyzer uses minimized DFA to recognize tokens.

Example:

Recognize numbers using DFA:

(start) → digit → digit → digit → (final)

⭐ 8. Finite Automata Construction Steps (Exam question)

1. Write regular expression

2. Convert RE to NFA (Thompson’s Construction)

3. Convert NFA to DFA (Subset Construction)

4. Minimize DFA

5. Use the DFA in lexical analyzer

⭐ 9. Lexical Errors

Examples:
• Invalid characters

• Unclosed string "hello

• Illegal identifier: 3abc

Lexical analyzer reports errors and tries error recovery.

⭐ 10. Symbol Table Management

Lexical analyzer interacts with the symbol table.

Stores:

• Identifier name

• Type

• Scope

• Memory location

Every time a new identifier appears → symbol table entry is created.

⭐ 11. Lex Tool (IMPORTANT – PYQ)

LEX is a tool that automatically generates a lexical analyzer.

Parts of Lex program:

1. Definitions section

2. Rules section

3. User functions section

How Lex works:

• You write rules using RE

• Lex generates a C program

• The C program scans input & returns tokens

⭐ Regular Expression (RE)

A Regular Expression (RE) is a formula or pattern used to describe a set of strings in a language.

It is widely used in lexical analysis, pattern matching, and token recognition in compilers.

Regular expressions describe the lexical structure of tokens such as identifiers, keywords, numbers, operators, etc.

⭐ Definition

A Regular Expression over an alphabet Σ is a pattern that represents a regular language.

Example alphabet:

Σ = {a, b}

Then REs like a*, ab+, (a|b)* describe different languages.

⭐ Basic Regular Expressions (Building Blocks)

1. Empty string (ε)

Represents a string of length 0.

2. Single character
a represents the string “a”.

3. Concatenation (XY)

If X = a, Y = b

Then XY = ab

4. Union / Alternation (X | Y)

Either X or Y.

Example:

a | b → {a, b}

5. Kleene Star (X)*

Zero or more repetitions of X.

Example:

a* → {ε, a, aa, aaa, …}

6. Kleene Plus (X+)

One or more repetitions of X.

Example:

a+ → {a, aa, aaa, …}

7. Parentheses ( )

Used to group expressions.

Example:

(a|b)*c

⭐ Examples of Regular Expressions

✔ Example 1: RE for all strings of a’s and b’s

(a | b)*

✔ Example 2: RE for identifiers

Identifiers usually start with a letter, followed by letters or digits:

letter (letter | digit)*

✔ Example 3: RE for numbers

digit+

✔ Example 4: RE for even number of a’s

(bb)* (aa)*

✔ Example 5: RE for strings ending in "ab"

(a|b)* ab

⭐ Properties of Regular Expressions

1. Describe regular languages

2. Can be converted to finite automata (NFA, DFA)


3. Used in lexical analyzers to identify tokens

4. Simple and efficient to implement

5. Cannot represent languages requiring memory (e.g., balanced parentheses)

⭐ Applications of Regular Expressions

• Lexical analysis to identify keywords, identifiers, operators

• Search and replace (grep, sed, awk)

• Text editors for pattern matching

• Data validation (email, phone number)

• Token generation in languages like Python, C, Java

⭐ Regular Expression → NFA → DFA → Minimized DFA

This is a major process in compiler design:

Regular Expression

Thompson Construction

NFA

Subset Construction

DFA

DFA Minimization

⭐ NFA and DFA (Very Important Topic)

Finite Automata are machines used to recognize patterns in input strings.

They are used in Lexical Analysis for token recognition.

Finite automata are of two types:

1. NFA – Nondeterministic Finite Automata

2. DFA – Deterministic Finite Automata

⭐ 1. NFA (Nondeterministic Finite Automaton)

✔ Definition

An NFA is a finite automaton where:

• From a state, multiple transitions for the same input symbol may exist

• ε-moves (epsilon moves) are allowed — i.e., move without consuming input

• It is simpler to construct

• Not efficient for scanning actual input


✔ Formal Definition

An NFA is a 5-tuple (Q, Σ, δ, q₀, F)

where

• Q = set of states

• Σ = input alphabet

• δ = transition function (Q × Σ → 2^Q)

• q₀ = start state

• F = set of final states

✔ Example NFA for RE: a(b|c)*

(q0) --a--> (q1)

/ \

b c

/ \

(loop on q1 for b or c)

⭐ 2. DFA (Deterministic Finite Automaton)

✔ Definition

A DFA is a finite automaton where:

• For each state, exactly one transition exists for each input symbol

• No ε-moves

• Fast and efficient for lexical analysis

• Used by compilers

✔ Formal Definition

A DFA is also a 5-tuple (Q, Σ, δ, q₀, F)

but

δ: Q × Σ → Q (not a set)

✔ Example DFA for strings ending with ab

(q0)--a-->(q1)--b-->(q2)

(q1)--a-->(q1)

(q1)--b-->(q0)

(q0)--b-->(q0)

⭐ 3. NFA vs DFA (Exam Table)

NFA DFA

ε-moves allowed No ε-moves

Multiple transitions allowed Only one transition

Easy to construct Hard to construct


Fast to design, slow to execute Slow to design, fast to execute

May have many possible current states Only one current state

Used for theoretical design Used in actual compilers

⭐ 4. Conversion: NFA → DFA (Subset Construction Algorithm)

(Important 7 marks)

Steps:

1. Start with ε-closure(start state of NFA) → this is DFA start state

2. From each DFA state, find transitions for each input symbol

3. Take union of all resulting NFA states

4. Each unique set = one DFA state

5. Continue until no new states formed

6. Mark states that contain an NFA final state → DFA final states

⭐ 6. DFA Minimization (Short Note)

Steps:

1. Partition states into final and non-final sets

2. Split further if transitions lead to different groups

3. Continue until no more splitting

4. Each group = one minimized DFA state

⭐ 7. Applications of NFA/DFA

✔ Used in Lexical Analyzer

✔ Token recognition (identifier, keyword, number)

✔ Pattern matching (grep, regex engines)

✔ Scanner generation tools (LEX)

⭐ NFA to DFA Conversion (Simple & Short)

NFA is converted to DFA using the Subset Construction Method.

✔ Steps:

1. Start State of DFA = ε-closure of NFA start state.

2. For each DFA state (which is a set of NFA states),

find transitions for each input symbol.

3. The union of all reachable NFA states becomes a new DFA state.

4. Repeat until no new states are produced.

5. Any DFA state containing an NFA final state becomes a final DFA state.

Context-Free Grammar (CFG)

A Context-Free Grammar is a formal grammar used to describe the syntax of programming languages and
languages in automata theory.
Definition

A CFG is a 4-tuple G = (V, T, S, P) where:

• V = set of variables (non-terminals)

• T = set of terminals (alphabet)

• S = start symbol

• P = set of production rules of the form A → α

(A is a non-terminal, α is any combination of terminals & non-terminals)

Key Points

• Every production rule must have one non-terminal on LHS.

• CFG generates Context-Free Languages (CFL).

• Accepted by Pushdown Automata (PDA).

• Used for syntax analysis in compilers.

Example of CFG

For language L = { aⁿ bⁿ | n ≥ 1 }

S → aSb

S → ab

Derivation Example

Generate “aaabbb”:

S ⇒ aSb ⇒ aaSbb ⇒ aaaSbbb ⇒ aaabbb

Parse Tree

A tree showing how the string is derived using rules.

Uses of CFG

• Designing programming language syntax

• Parsing

• Compilers

• Natural language processing

Parsing (Short, Simple Notes)

Parsing is the process of analyzing a string (program) using a grammar to check whether it is syntactically correct.

It converts source code → parse tree.

What Parsing Does

• Checks syntax of the input

• Builds a parse tree

• Helps detect syntax errors

• Used in syntax analysis phase of a compiler

Types of Parsing
Parsing is mainly of two types:

1. Top-Down Parsing

Starts from Start symbol (S) and tries to derive the input string.

Examples:

• Recursive Descent Parser

• LL(1) Parser

Characteristics:

• Expands from top (root) to bottom (leaves)

• Leftmost derivation

• Simple but struggles with left recursion

2. Bottom-Up Parsing

Starts from the input string and reduces it to the start symbol.

Examples:

• Shift-Reduce Parsing

• LR Parsing (LR, SLR, LALR)

Characteristics:

• Builds parse tree from bottom to top

• Rightmost derivation in reverse

• More powerful, used in real compilers

Parse Tree

A hierarchical tree showing how the grammar produces the input.

Root → Start Symbol

Leaves → Input Tokens

Parsing Errors

• Syntax error

• Missing symbols

• Operator precedence error

Parser tries error recovery methods like:

• Panic mode

• Phrase level recovery

Short Example

Grammar:

S → aS | b

Input: aaab

Parse:
S ⇒ aS ⇒ aaS ⇒ aaaS ⇒ aaab

LL(1) Parsing (Short Notes)

What is LL(1)?

LL(1) is a Top-Down Parsing technique.

• L = Left-to-right scanning of input

• L = Leftmost derivation

• 1 = Uses 1 look-ahead symbol

LL(1) parsers are predictive parsers (no backtracking).

Conditions for LL(1) Grammar

A grammar is LL(1) if:

1. No left recursion

2. Left factoring is done

3. FIRST and FOLLOW sets do not conflict

LL(1) Parsing Steps

To create LL(1) parser:

1. Compute FIRST set

FIRST(X) = terminals that can appear first in strings derived from X.

2. Compute FOLLOW set

FOLLOW(A) = terminals that can appear immediately after A.

3. Build LL(1) Parsing Table

For each production A → α:

• Put α in Table[A, a] where a ∈ FIRST(α)

• If FIRST(α) contains ε, then use FOLLOW(A)

Advantages

• No backtracking

• Fast and simple

• Good for small grammars

Disadvantages

• Cannot handle left recursion

• Cannot handle ambiguous grammars

LR Parsing (Short Notes)

LR parsing is a bottom-up parsing technique used by most real compilers.

Meaning of LR:

• L → Left-to-right scan of input

• R → Rightmost derivation in reverse


LR parsers are powerful, efficient, and handle larger grammars than LL(1).

Why LR Parsing?

• Detects syntax errors early

• Handles left recursion

• More powerful than LL parsers

• No backtracking

Types of LR Parsers

1. LR(0)

2. SLR(1) (Simple LR)

3. LALR(1) (Look-Ahead LR)

4. Canonical LR(1)

Main Components

LR parser uses two tables:

1. ACTION table

Tells parser to:

• Shift (read symbol)

• Reduce (apply a production)

• Accept

• Error

2. GOTO table

Tells which state to go after reduction.

LR Parsing Steps

1. Read input left to right

2. Use stack to store states

3. Use ACTION table to decide:

o Shift → push symbol & state

o Reduce → replace RHS with LHS

4. Use GOTO table after reduction

5. Continue until Accept

Example Grammar

S → aA

A→b

(Not full table here—just concept)

Conflicts in LR Parsing

1. Shift-Reduce Conflict
2. Reduce-Reduce Conflict

If a grammar has no conflicts → It is LR.

Advantages

• Very powerful

• Detects errors earlier

• Can parse almost all programming languages

Disadvantages

• Tables are large

• Hard to construct manually

⭐ Types of LR Parsers (Short Notes)

There are 4 types of LR parsers:

1. LR(0)

2. SLR(1) (Simple LR)

3. LR(1) (Canonical LR)

4. LALR(1) (Look Ahead LR)

1. LR(0) Parser

• Simplest LR parser

• Uses LR(0) items (no lookahead)

• Smallest parsing tables

• Accepts very few grammars

• Often has shift-reduce conflicts

✅ Easy to construct

❌ Very weak power

2. SLR(1) Parser (Simple LR)

• Uses FOLLOW sets to reduce conflicts

• More powerful than LR(0)

• Still simple and table size small

• Accepts more grammars than LR(0)

✅ Easy & widely used

❌ Still may have conflicts

3. Canonical LR(1) Parser

• Uses LR(1) items (item + 1 lookahead symbol)

• Most powerful LR parser

• Almost all programming language grammars can be parsed

• But table size is very large


✅ Most powerful

❌ Very large tables

4. LALR(1) Parser (Look-Ahead LR)

• Combines states with same LR(0) core

• Table size small like SLR

• Power almost equal to LR(1)

• Used in many compilers (e.g., YACC)

✅ Good balance: small table + strong power

❌ Slightly weaker than full LR(1)8j

⭐ Difference Between SLR, LALR, and CLR

1. SLR(1) (Simple LR)

• Uses FOLLOW sets for lookahead

• Based on LR(0) items

• Parsing table is small

• Less powerful (more conflicts)

• Easier to construct

2. LALR(1) (Look-Ahead LR)

• Uses LR(1) items, but merges states with same LR(0) core

• Parsing table is medium-sized

• More powerful than SLR

• Used in real compilers (like YACC)

3. CLR / LR(1) (Canonical LR)

• Uses full LR(1) items (item + 1 lookahead)

• Parsing table is very large

• Most powerful LR parser

• Almost no conflicts

• Hardest to construct

⭐ Precedence Parsing (Short Notes)

Precedence parsing is a bottom-up parsing technique used for parsing expressions using operator precedence (like
+, -, *, /, ^).

It works by using precedence relations between terminals.

⭐ Types of Precedence Parsing

1. Operator Precedence Parsing

2. Priority/Precedence Grammar Parsing

⭐ When We Use Precedence Parsing?


• Used for arithmetic expressions

• Grammar must be operator precedence grammar

• No ambiguous or left-recursive rules

• No ε-productions (except S → ε not allowed inside)

⭐ Precedence Relations

Parser uses three relations between terminals:

1. a <· b (a yields precedence to b)

2. a =· b (a has equal precedence with b)

3. a ·> b (a takes precedence over b)

These relations guide shifting or reducing.

⭐ Operator Precedence Table

It uses a precedence matrix/table to show these relations.

Example:

+ * id $

+ > < < >

* > > < >

id > > >

$ < < <

⭐ How Precedence Parsing Works (Simple Steps)

1. Scan input from left to right

2. Use precedence table to decide:

o Shift if a <· b or =· b

o Reduce if a ·> b

3. Keep performing shift/reduce until input ends

4. Accept if reduced to start symbol

⭐ Advantages

• Good for parsing expressions

• Simple implementation

• No backtracking needed

⭐ Disadvantages

• Works only for restricted grammars

• Cannot handle general programming language syntax

• No ε-productions permitted

⭐ TYPES OF PRECEDENCE PARSING

Precedence parsing mainly has two types:


1. Operator Precedence Parsing

2. Weak Precedence Parsing (or Simple Precedence Parsing)

Some books also mention Strong Precedence Parsing, but the main two used in exams are the ones above.

⭐ 1. Operator Precedence Parsing

This is the most common precedence parsing technique.

Key Features:

• Works only for Operator Precedence Grammars

• Uses operator precedence table

• Uses three relations:

o <· (yields precedence)

o =· (same precedence)

o ·> (takes precedence)

How it Works:

• If current operator has higher precedence, shift

• If lower precedence, reduce

Use Case:

• Arithmetic expressions like:

id + id * id

Advantages:

• Simple and fast

• Good for expression evaluation

Disadvantages:

• Grammar must have no left recursion, no ambiguity, no ε-productions

⭐ 2. Weak (Simple) Precedence Parsing

This is a generalization of operator precedence parsing.

Key Features:

• Uses precedence functions (f and g functions)

• Works for slightly more grammars than operator precedence

• No need for =· relation

• Reduces using weakest precedence

How it Works:

• Assign numerical precedence to terminals using two functions:

o f(a) = precedence of symbol on stack

o g(a) = precedence of input symbol

• If f(a) < g(b) → Shift


• If f(a) > g(b) → Reduce

Use Case:

• Parsing expressions with well-defined precedence

• Slightly more powerful than operator precedence

⭐ 3. (Extra) Strong Precedence Parsing (if needed)

Some books mention this for theory only.

Features:

• Stronger conditions

• More powerful than weak precedence

• Still not suitable for general programming languages

⭐ Syntax-Directed Definition (SDD)

A Syntax-Directed Definition is a formal way to attach meaning, actions, or computations to a grammar.

It is used in the syntax analysis + semantic analysis phase of a compiler.

In SDD:

• Each grammar production has semantic rules

• Each grammar symbol has attributes

⭐ Definition (Exam Line)

A Syntax-Directed Definition (SDD) is a context-free grammar where each grammar symbol is associated with
attributes and each production is associated with semantic rules that define how to compute those attributes.

⭐ Components of SDD

1. Attributes

o Values or information attached to grammar symbols

o Example: type, value, place, code

2. Semantic Rules

o Specify how attributes are computed

o Attached to grammar productions

⭐ Types of Attributes

1. Synthesized Attributes

o Computed from children nodes

o Flow bottom-up

2. Inherited Attributes

o Obtained from parent or siblings

o Flow top-down or sideways

⭐ Example of SDD

Grammar:
E → E1 + T

E→T

T → digit

Semantic rules:

[Link] = [Link] + [Link]

[Link] = [Link]

Here, [Link] and [Link] are attributes.

⭐ Uses of SDD

• Type checking

• Expression evaluation

• Intermediate code generation

• Building syntax trees

• Checking declarations and scope rules

⭐ Attributes (Short Notes)

Attributes are values or information associated with grammar symbols in a Syntax-Directed Definition (SDD).

They help to add meaning to a grammar, beyond just syntax.

⭐ Example with SDD

Grammar:

E→E+T

E→T

T → digit

Attributes:

• [Link], [Link]

Semantic rules:

[Link] = [Link] + [Link]

[Link] = [Link]

⭐ Syntax Tree (Parse Tree / Abstract Syntax Tree)

A Syntax Tree is a tree representation of the structure of a program according to a grammar.

It shows how an expression or statement is derived from the grammar rules.

⭐ Definition (Short Answer)

A Syntax Tree is a hierarchical tree structure that represents the syntactic structure of a source program.

Each internal node is an operator, and each leaf node is an operand or identifier.

⭐ Characteristics

• Shows the operator at the parent node.

• Shows operands as child nodes.


• Removes unnecessary grammar details (more compact than parse tree).

• Used for syntax analysis, type checking, and intermediate code generation.

⭐ Example: Syntax Tree for a + b * c

/ \

a *

/\

b c

Explanation:

• * is evaluated before +, so * becomes a child of +.

⭐ Uses of Syntax Tree

• Type checking

• Intermediate code generation

• Optimization

• Semantic analysis

• Detecting operator precedence and associativity

⭐ 1. Activation Record (Short Notes)

An Activation Record (AR) or Stack Frame is a memory block created on the runtime stack when a
function/procedure is called.

It stores all the information needed to manage a function call.

⭐ Contents of an Activation Record

Field Purpose

Return Address Where to return after function ends

Parameters Values passed to function

Local Variables Variables declared inside function

Temporary Values Intermediate computation values

Control Link Pointer to caller's activation record

Access Link Used to access non-local variables (static scoping)

Saved Registers Register values to restore later

⭐ Diagram (Simple)

--------------------

| Temporaries |

| Local Variables |

| Parameters |

| Return Address |
| Control Link |

--------------------

⭐ 2. Static and Dynamic Scoping

✔ Static Scoping (Lexical Scoping)

• Scope is determined at compile time.

• Compiler uses the program text to find the variable’s declaration.

• Uses static chain / access link to access non-local variables.

• Most languages: C, C++, Java, Python use static scoping.

Example:

a=5

function f():

print(a) → prints 5 (determined at compile time)

✔ Dynamic Scoping

• Scope determined at runtime, based on calling sequence.

• Uses dynamic chain to find variables.

• Value depends on which function called which.

Example:

a=5

function f():

print(a)

function g():

a = 10

f()

Call g() → output is 10 (variable found dynamically in caller).

⭐ Differences (Exam Table)

Static Scoping Dynamic Scoping

Determined at compile time Determined at runtime

Uses static chain Uses dynamic chain

Predictable Hard to predict

Based on program structure Based on calling order

⭐ 3. Parameter Passing Techniques (Short Notes)

✔ 1. Call by Value

• Actual value is copied to the function.

• Changes inside function do not affect the caller.

✔ 2. Call by Reference
• Address of actual parameter is passed.

• Changes inside function affect original variable.

✔ 3. Call by Value-Result (Copy-In Copy-Out)

• Value copied to function at start & copied back to caller at end.

✔ 4. Call by Name

• Parameter is substituted like macro expansion.

• Expression evaluated each time it is used.

✔ 5. Call by Need

• Like call-by-name, but evaluates the expression once (lazy evaluation).

✔ 6. Call by Result

• No initial value; output is copied back after function ends.

⭐ 1. DAG (Directed Acyclic Graph)

A DAG is used in Intermediate Code Generation to represent expressions and detect common subexpressions.

⭐ Definition (Short)

A DAG is a graph with no cycles where:

• Leaves = operands (constants, variables)

• Interior nodes = operators

• If two subexpressions are the same → they share the same node (removes duplicates)

⭐ Use of DAG

• Eliminates common subexpressions

• Helps in code optimization

• Detects dead code

• Determines evaluation order

⭐ Example

Expression:

a=b+c

d=b+c

DAG:

/\

b c

/\

a d

Both a and d point to the same b+c node → duplicate removed.

⭐ 2. Quadruple Representation
Quadruple = 4 fields:

| op | arg1 | arg2 | result |

Used in Three-Address Code (TAC).

⭐ Example

Expression:

x=a+b*c

Step 1: t1 = b * c

Step 2: x = a + t1

Quadruples:

op arg1 arg2 result

* b c t1

+ a t1 x

⭐ 3. Triple Representation

Triple = 3 fields:

| op | arg1 | arg2 |

Results are not named — they use index numbers.

⭐ Example

For the same expression:

Step 1: b*c → (0)

Step 2: a + (0) → (1)

Triples:

Index op arg1 arg2

(0) * b c

(1) + a (0)

⭐ Peephole Optimization

✔ Definition (Short Answer)

Peephole Optimization is a local optimization technique that improves code by looking at a small window (peephole)
of instructions and replacing inefficient code with more efficient code.

The compiler examines 2–5 instructions at a time and optimizes them.

⭐ Main Types of Peephole Optimizations

1. Constant Folding

Evaluate constant expressions at compile time.

Example:

t = 3 * 4 → t = 12

2. Strength Reduction
Replace expensive operations with cheaper ones.

Example:

x = y * 2 → x = y << 1

3. Algebraic Simplification

Use algebraic identities to simplify code.

Example:

x = x + 0 → remove

y = y * 1 → remove

4. Redundant Load/Store Elimination

Remove repeated loads/stores.

Example:

LOAD A

LOAD A → Remove second LOAD

5. Unreachable Code Removal

Remove instructions that are never executed.

⭐ Advantages

• Simple and fast

• Works with final machine code

• Removes common inefficiencies

• Improves performance and reduces code size

⭐ Basic Block

✔ Definition (Short Answer)

A Basic Block is a sequence of consecutive statements in a program that:

• Always executes from start to end without any jump (branch) inside.

• Has only one entry point and one exit point.

⭐ Characteristics of a Basic Block

1. Single Entry

Control enters at the beginning only.

2. Single Exit

Control leaves at the end only.

3. No branching inside

No jump or label appears in the middle.

4. Straight-line code

Instructions execute sequentially.

⭐ Uses of Basic Block


• Code optimization

• Control flow analysis

• DAG construction

• Local optimization (peephole, constant folding)

⭐ How to Identify Basic Blocks?

Find leaders first:

A statement is a leader if:

1. The first statement is always a leader.

2. Any statement that is the target of a jump.

3. Any statement immediately after a jump.

After identifying leaders → form blocks until the next leader.

⭐ Loop Optimization

✔ Definition (Short Answer)

Loop optimization refers to techniques used by the compiler to improve the performance of loops, because loops
execute many times and consume a large portion of program time.

⭐ Types of Loop Optimization (Very Important)

1. Loop Invariant Code Motion

• Move statements that do not change inside a loop to outside the loop.

• Reduces unnecessary repeated computations.

Example:

Before:

for i:

x = a + b ← same every iteration

y=x+i

After:

x=a+b ← moved out

for i:

y=x+i

2. Loop Unrolling

• Expands loop body to reduce overhead of increment, comparison, and jump.

• Makes code faster by reducing loop control instructions.

Example:

Before:

for i = 1 to 4:

A[i] = A[i] + 1
After unrolling:

A[1] = A[1] + 1

A[2] = A[2] + 1

A[3] = A[3] + 1

A[4] = A[4] + 1

3. Loop Fusion (Loop Jamming)

• Combine two adjacent loops that iterate over the same range.

Before:

for i: A[i]=0

for i: B[i]=1

After:

for i:

A[i]=0

B[i]=1

4. Loop Fission (Loop Splitting)

• Split one loop into multiple loops to improve cache locality or parallelism.

5. Loop Strength Reduction

• Replace expensive operations in loops with cheaper ones.

Example:

x = i * 2 → replaced by x = x + 2

6. Loop Interchange

• Swap nested loops to improve memory access efficiency.

Example:

for i:

for j:

A[i][j]

→ interchange loops if beneficial for cache.

⭐ Why Loop Optimization is Important?

• Loops run many times → optimizing them gives major performance gain

• Reduces execution time

• Reduces unnecessary calculations

• Improves cache performance

⭐ Flow Graph (Control Flow Graph – CFG)

✔ Definition (Short Answer)

A Flow Graph is a directed graph that represents the flow of control in a program.
Each node represents a basic block, and each edge represents a possible flow of execution from one block to
another.

⭐ Components of a Flow Graph

1. Node / Vertex

o Represents a basic block (straight-line code).

2. Edge / Arrow

o Shows the possible control flow (jump, branch, next statement).

3. Entry Node

o The starting block.

4. Exit Node

o The ending block.

⭐ Example Program

1. a = b + c

2. if a > 10 goto L1

3. d = a - 5

4. goto L2

L1: e = a * 2

L2: print(e)

Step 1: Identify Basic Blocks

B1: 1

B2: 2

B3: 3, 4

B4: e = a * 2

B5: print(e)

⭐ Flow Graph for the Program

┌──────┐

│ B1 │

└──┬───┘

┌──────┐

│ B2 │

┌────┴──┬───┴────┐

▼ ▼ ▼

B3 B4 B5

| | |
└──────┬───┴────────┘

B5

⭐ Purpose of Flow Graph

✔ Used in code optimization

✔ Used in loop detection

✔ Used in control flow analysis

✔ Shows branching, merging paths

✔ Used to find dominators, natural loops, etc.

⭐ 1. Code Generation (Short Notes)

Code Generation is the final phase of the compiler where intermediate code (TAC) is converted into machine code
or assembly code.

✔ Tasks in Code Generation

1. Instruction selection

2. Register allocation

3. Addressing modes selection

4. Handling jumps, labels, procedures

5. Generating machine instructions

✔ Requirements of a Good Code Generator

• Generated code must be correct

• Efficient (fast, uses fewer registers)

• Uses less memory

• Should match architecture (RISC/CISC)

✔ Inputs to Code Generator

• Intermediate code

• Symbol table

• Type information

• Register information

✔ Output

• Target machine code / assembly code.

⭐ 2. Register Allocation (Short Notes)

Register Allocation decides which program variables or temporaries should be stored in registers instead of
memory.

Registers are limited → good allocation makes code faster.

✔ Two Steps:

1. Register Allocation
• Assigns variables to registers.

2. Register Assignment

• Decides which register stores which variable.

✔ Goal of Register Allocation

• Reduce load/store instructions

• Improve execution speed

• Minimize register spills (values stored to memory when registers are full)

✔ Methods

• Graph coloring (most common)

• Linear scan allocation

• Top-down allocation

⭐ 3. Instruction Selection (Short Notes)

Instruction Selection chooses the best machine instructions to implement the intermediate code.

✔ Goals

• Generate efficient target instructions

• Minimize number of machine instructions

• Use powerful instructions if architecture supports them

✔ Factors Considered

1. Target machine architecture

(RISC or CISC instruction set)

2. Addressing modes

3. Cost of instructions

4. Operator precedence and expression tree

✔ Example

Intermediate Code:

t1 = a + b

t2 = t1 * c

Possible machine code:

ADD R1, a, b

MUL R2, R1, c

If machine supports a combined instruction:

MADD R2, a, b, c ; Multiply-add

→ Instruction selection chooses the better one.

You might also like