⭐ UNIT – 2 : LEXICAL ANALYSIS
Lexical Analysis is the first phase of the compiler front-end.
It reads the source code character by character and groups them into tokens.
⭐ 1. What is Lexical Analysis?
Lexical Analysis (also called Scanning) is the process of converting a sequence of characters from the source
program into a sequence of tokens.
Example:
Input:
int a = b + 5;
Tokens:
• int (keyword)
• a (identifier)
• = (operator)
• b (identifier)
• + (operator)
• 5 (number)
• ; (delimiter)
⭐ 2. Role of the Lexical Analyzer
Functions:
1. Converts character stream → token stream
2. Removes whitespace, tabs, newlines
3. Removes comments
4. Applies regular expressions to recognize lexemes
5. Communicates with symbol table
6. Reports lexical errors
7. Performs buffering for fast input reading
⭐ 3. Tokens, Lexemes, Patterns (VERY IMPORTANT — Short Question)
Token:
A token is a category of lexemes such as identifier, keyword, number, operator.
Lexeme:
A lexeme is the actual character sequence that matches a pattern.
Example:
Token = identifier
Lexeme = "count"
Pattern:
Rule describing how a lexeme should look (regular expression).
Example pattern for identifier:
letter (letter | digit)*
⭐ 5. Regular Expressions (RE)
REs describe patterns for tokens.
Lexical analyzer uses REs to recognize lexemes.
Examples:
Digit → 0|1|…|9
Identifier → Letter (Letter|Digit)*
Number → Digit+
⭐ 6. Finite Automata (NFA & DFA)
Lexical analyzer converts regular expressions to finite automata.
➤ NFA (Nondeterministic Finite Automata)
• Multiple transitions for same input
• Easy to construct
• Not efficient for scanning
➤ DFA (Deterministic Finite Automata)
• Only one transition for each input symbol
• Fast and efficient
• Used by compilers
RE → NFA → DFA → Minimized DFA
⭐ 7. Recognition of Tokens using DFA
Lexical analyzer uses minimized DFA to recognize tokens.
Example:
Recognize numbers using DFA:
(start) → digit → digit → digit → (final)
⭐ 8. Finite Automata Construction Steps (Exam question)
1. Write regular expression
2. Convert RE to NFA (Thompson’s Construction)
3. Convert NFA to DFA (Subset Construction)
4. Minimize DFA
5. Use the DFA in lexical analyzer
⭐ 9. Lexical Errors
Examples:
• Invalid characters
• Unclosed string "hello
• Illegal identifier: 3abc
Lexical analyzer reports errors and tries error recovery.
⭐ 10. Symbol Table Management
Lexical analyzer interacts with the symbol table.
Stores:
• Identifier name
• Type
• Scope
• Memory location
Every time a new identifier appears → symbol table entry is created.
⭐ 11. Lex Tool (IMPORTANT – PYQ)
LEX is a tool that automatically generates a lexical analyzer.
Parts of Lex program:
1. Definitions section
2. Rules section
3. User functions section
How Lex works:
• You write rules using RE
• Lex generates a C program
• The C program scans input & returns tokens
⭐ Regular Expression (RE)
A Regular Expression (RE) is a formula or pattern used to describe a set of strings in a language.
It is widely used in lexical analysis, pattern matching, and token recognition in compilers.
Regular expressions describe the lexical structure of tokens such as identifiers, keywords, numbers, operators, etc.
⭐ Definition
A Regular Expression over an alphabet Σ is a pattern that represents a regular language.
Example alphabet:
Σ = {a, b}
Then REs like a*, ab+, (a|b)* describe different languages.
⭐ Basic Regular Expressions (Building Blocks)
1. Empty string (ε)
Represents a string of length 0.
2. Single character
a represents the string “a”.
3. Concatenation (XY)
If X = a, Y = b
Then XY = ab
4. Union / Alternation (X | Y)
Either X or Y.
Example:
a | b → {a, b}
5. Kleene Star (X)*
Zero or more repetitions of X.
Example:
a* → {ε, a, aa, aaa, …}
6. Kleene Plus (X+)
One or more repetitions of X.
Example:
a+ → {a, aa, aaa, …}
7. Parentheses ( )
Used to group expressions.
Example:
(a|b)*c
⭐ Examples of Regular Expressions
✔ Example 1: RE for all strings of a’s and b’s
(a | b)*
✔ Example 2: RE for identifiers
Identifiers usually start with a letter, followed by letters or digits:
letter (letter | digit)*
✔ Example 3: RE for numbers
digit+
✔ Example 4: RE for even number of a’s
(bb)* (aa)*
✔ Example 5: RE for strings ending in "ab"
(a|b)* ab
⭐ Properties of Regular Expressions
1. Describe regular languages
2. Can be converted to finite automata (NFA, DFA)
3. Used in lexical analyzers to identify tokens
4. Simple and efficient to implement
5. Cannot represent languages requiring memory (e.g., balanced parentheses)
⭐ Applications of Regular Expressions
• Lexical analysis to identify keywords, identifiers, operators
• Search and replace (grep, sed, awk)
• Text editors for pattern matching
• Data validation (email, phone number)
• Token generation in languages like Python, C, Java
⭐ Regular Expression → NFA → DFA → Minimized DFA
This is a major process in compiler design:
Regular Expression
Thompson Construction
NFA
Subset Construction
DFA
DFA Minimization
⭐ NFA and DFA (Very Important Topic)
Finite Automata are machines used to recognize patterns in input strings.
They are used in Lexical Analysis for token recognition.
Finite automata are of two types:
1. NFA – Nondeterministic Finite Automata
2. DFA – Deterministic Finite Automata
⭐ 1. NFA (Nondeterministic Finite Automaton)
✔ Definition
An NFA is a finite automaton where:
• From a state, multiple transitions for the same input symbol may exist
• ε-moves (epsilon moves) are allowed — i.e., move without consuming input
• It is simpler to construct
• Not efficient for scanning actual input
✔ Formal Definition
An NFA is a 5-tuple (Q, Σ, δ, q₀, F)
where
• Q = set of states
• Σ = input alphabet
• δ = transition function (Q × Σ → 2^Q)
• q₀ = start state
• F = set of final states
✔ Example NFA for RE: a(b|c)*
(q0) --a--> (q1)
/ \
b c
/ \
(loop on q1 for b or c)
⭐ 2. DFA (Deterministic Finite Automaton)
✔ Definition
A DFA is a finite automaton where:
• For each state, exactly one transition exists for each input symbol
• No ε-moves
• Fast and efficient for lexical analysis
• Used by compilers
✔ Formal Definition
A DFA is also a 5-tuple (Q, Σ, δ, q₀, F)
but
δ: Q × Σ → Q (not a set)
✔ Example DFA for strings ending with ab
(q0)--a-->(q1)--b-->(q2)
(q1)--a-->(q1)
(q1)--b-->(q0)
(q0)--b-->(q0)
⭐ 3. NFA vs DFA (Exam Table)
NFA DFA
ε-moves allowed No ε-moves
Multiple transitions allowed Only one transition
Easy to construct Hard to construct
Fast to design, slow to execute Slow to design, fast to execute
May have many possible current states Only one current state
Used for theoretical design Used in actual compilers
⭐ 4. Conversion: NFA → DFA (Subset Construction Algorithm)
(Important 7 marks)
Steps:
1. Start with ε-closure(start state of NFA) → this is DFA start state
2. From each DFA state, find transitions for each input symbol
3. Take union of all resulting NFA states
4. Each unique set = one DFA state
5. Continue until no new states formed
6. Mark states that contain an NFA final state → DFA final states
⭐ 6. DFA Minimization (Short Note)
Steps:
1. Partition states into final and non-final sets
2. Split further if transitions lead to different groups
3. Continue until no more splitting
4. Each group = one minimized DFA state
⭐ 7. Applications of NFA/DFA
✔ Used in Lexical Analyzer
✔ Token recognition (identifier, keyword, number)
✔ Pattern matching (grep, regex engines)
✔ Scanner generation tools (LEX)
⭐ NFA to DFA Conversion (Simple & Short)
NFA is converted to DFA using the Subset Construction Method.
✔ Steps:
1. Start State of DFA = ε-closure of NFA start state.
2. For each DFA state (which is a set of NFA states),
find transitions for each input symbol.
3. The union of all reachable NFA states becomes a new DFA state.
4. Repeat until no new states are produced.
5. Any DFA state containing an NFA final state becomes a final DFA state.
Context-Free Grammar (CFG)
A Context-Free Grammar is a formal grammar used to describe the syntax of programming languages and
languages in automata theory.
Definition
A CFG is a 4-tuple G = (V, T, S, P) where:
• V = set of variables (non-terminals)
• T = set of terminals (alphabet)
• S = start symbol
• P = set of production rules of the form A → α
(A is a non-terminal, α is any combination of terminals & non-terminals)
Key Points
• Every production rule must have one non-terminal on LHS.
• CFG generates Context-Free Languages (CFL).
• Accepted by Pushdown Automata (PDA).
• Used for syntax analysis in compilers.
Example of CFG
For language L = { aⁿ bⁿ | n ≥ 1 }
S → aSb
S → ab
Derivation Example
Generate “aaabbb”:
S ⇒ aSb ⇒ aaSbb ⇒ aaaSbbb ⇒ aaabbb
Parse Tree
A tree showing how the string is derived using rules.
Uses of CFG
• Designing programming language syntax
• Parsing
• Compilers
• Natural language processing
Parsing (Short, Simple Notes)
Parsing is the process of analyzing a string (program) using a grammar to check whether it is syntactically correct.
It converts source code → parse tree.
What Parsing Does
• Checks syntax of the input
• Builds a parse tree
• Helps detect syntax errors
• Used in syntax analysis phase of a compiler
Types of Parsing
Parsing is mainly of two types:
1. Top-Down Parsing
Starts from Start symbol (S) and tries to derive the input string.
Examples:
• Recursive Descent Parser
• LL(1) Parser
Characteristics:
• Expands from top (root) to bottom (leaves)
• Leftmost derivation
• Simple but struggles with left recursion
2. Bottom-Up Parsing
Starts from the input string and reduces it to the start symbol.
Examples:
• Shift-Reduce Parsing
• LR Parsing (LR, SLR, LALR)
Characteristics:
• Builds parse tree from bottom to top
• Rightmost derivation in reverse
• More powerful, used in real compilers
Parse Tree
A hierarchical tree showing how the grammar produces the input.
Root → Start Symbol
Leaves → Input Tokens
Parsing Errors
• Syntax error
• Missing symbols
• Operator precedence error
Parser tries error recovery methods like:
• Panic mode
• Phrase level recovery
Short Example
Grammar:
S → aS | b
Input: aaab
Parse:
S ⇒ aS ⇒ aaS ⇒ aaaS ⇒ aaab
LL(1) Parsing (Short Notes)
What is LL(1)?
LL(1) is a Top-Down Parsing technique.
• L = Left-to-right scanning of input
• L = Leftmost derivation
• 1 = Uses 1 look-ahead symbol
LL(1) parsers are predictive parsers (no backtracking).
Conditions for LL(1) Grammar
A grammar is LL(1) if:
1. No left recursion
2. Left factoring is done
3. FIRST and FOLLOW sets do not conflict
LL(1) Parsing Steps
To create LL(1) parser:
1. Compute FIRST set
FIRST(X) = terminals that can appear first in strings derived from X.
2. Compute FOLLOW set
FOLLOW(A) = terminals that can appear immediately after A.
3. Build LL(1) Parsing Table
For each production A → α:
• Put α in Table[A, a] where a ∈ FIRST(α)
• If FIRST(α) contains ε, then use FOLLOW(A)
Advantages
• No backtracking
• Fast and simple
• Good for small grammars
Disadvantages
• Cannot handle left recursion
• Cannot handle ambiguous grammars
LR Parsing (Short Notes)
LR parsing is a bottom-up parsing technique used by most real compilers.
Meaning of LR:
• L → Left-to-right scan of input
• R → Rightmost derivation in reverse
LR parsers are powerful, efficient, and handle larger grammars than LL(1).
Why LR Parsing?
• Detects syntax errors early
• Handles left recursion
• More powerful than LL parsers
• No backtracking
Types of LR Parsers
1. LR(0)
2. SLR(1) (Simple LR)
3. LALR(1) (Look-Ahead LR)
4. Canonical LR(1)
Main Components
LR parser uses two tables:
1. ACTION table
Tells parser to:
• Shift (read symbol)
• Reduce (apply a production)
• Accept
• Error
2. GOTO table
Tells which state to go after reduction.
LR Parsing Steps
1. Read input left to right
2. Use stack to store states
3. Use ACTION table to decide:
o Shift → push symbol & state
o Reduce → replace RHS with LHS
4. Use GOTO table after reduction
5. Continue until Accept
Example Grammar
S → aA
A→b
(Not full table here—just concept)
Conflicts in LR Parsing
1. Shift-Reduce Conflict
2. Reduce-Reduce Conflict
If a grammar has no conflicts → It is LR.
Advantages
• Very powerful
• Detects errors earlier
• Can parse almost all programming languages
Disadvantages
• Tables are large
• Hard to construct manually
⭐ Types of LR Parsers (Short Notes)
There are 4 types of LR parsers:
1. LR(0)
2. SLR(1) (Simple LR)
3. LR(1) (Canonical LR)
4. LALR(1) (Look Ahead LR)
1. LR(0) Parser
• Simplest LR parser
• Uses LR(0) items (no lookahead)
• Smallest parsing tables
• Accepts very few grammars
• Often has shift-reduce conflicts
✅ Easy to construct
❌ Very weak power
2. SLR(1) Parser (Simple LR)
• Uses FOLLOW sets to reduce conflicts
• More powerful than LR(0)
• Still simple and table size small
• Accepts more grammars than LR(0)
✅ Easy & widely used
❌ Still may have conflicts
3. Canonical LR(1) Parser
• Uses LR(1) items (item + 1 lookahead symbol)
• Most powerful LR parser
• Almost all programming language grammars can be parsed
• But table size is very large
✅ Most powerful
❌ Very large tables
4. LALR(1) Parser (Look-Ahead LR)
• Combines states with same LR(0) core
• Table size small like SLR
• Power almost equal to LR(1)
• Used in many compilers (e.g., YACC)
✅ Good balance: small table + strong power
❌ Slightly weaker than full LR(1)8j
⭐ Difference Between SLR, LALR, and CLR
1. SLR(1) (Simple LR)
• Uses FOLLOW sets for lookahead
• Based on LR(0) items
• Parsing table is small
• Less powerful (more conflicts)
• Easier to construct
2. LALR(1) (Look-Ahead LR)
• Uses LR(1) items, but merges states with same LR(0) core
• Parsing table is medium-sized
• More powerful than SLR
• Used in real compilers (like YACC)
3. CLR / LR(1) (Canonical LR)
• Uses full LR(1) items (item + 1 lookahead)
• Parsing table is very large
• Most powerful LR parser
• Almost no conflicts
• Hardest to construct
⭐ Precedence Parsing (Short Notes)
Precedence parsing is a bottom-up parsing technique used for parsing expressions using operator precedence (like
+, -, *, /, ^).
It works by using precedence relations between terminals.
⭐ Types of Precedence Parsing
1. Operator Precedence Parsing
2. Priority/Precedence Grammar Parsing
⭐ When We Use Precedence Parsing?
• Used for arithmetic expressions
• Grammar must be operator precedence grammar
• No ambiguous or left-recursive rules
• No ε-productions (except S → ε not allowed inside)
⭐ Precedence Relations
Parser uses three relations between terminals:
1. a <· b (a yields precedence to b)
2. a =· b (a has equal precedence with b)
3. a ·> b (a takes precedence over b)
These relations guide shifting or reducing.
⭐ Operator Precedence Table
It uses a precedence matrix/table to show these relations.
Example:
+ * id $
+ > < < >
* > > < >
id > > >
$ < < <
⭐ How Precedence Parsing Works (Simple Steps)
1. Scan input from left to right
2. Use precedence table to decide:
o Shift if a <· b or =· b
o Reduce if a ·> b
3. Keep performing shift/reduce until input ends
4. Accept if reduced to start symbol
⭐ Advantages
• Good for parsing expressions
• Simple implementation
• No backtracking needed
⭐ Disadvantages
• Works only for restricted grammars
• Cannot handle general programming language syntax
• No ε-productions permitted
⭐ TYPES OF PRECEDENCE PARSING
Precedence parsing mainly has two types:
1. Operator Precedence Parsing
2. Weak Precedence Parsing (or Simple Precedence Parsing)
Some books also mention Strong Precedence Parsing, but the main two used in exams are the ones above.
⭐ 1. Operator Precedence Parsing
This is the most common precedence parsing technique.
Key Features:
• Works only for Operator Precedence Grammars
• Uses operator precedence table
• Uses three relations:
o <· (yields precedence)
o =· (same precedence)
o ·> (takes precedence)
How it Works:
• If current operator has higher precedence, shift
• If lower precedence, reduce
Use Case:
• Arithmetic expressions like:
id + id * id
Advantages:
• Simple and fast
• Good for expression evaluation
Disadvantages:
• Grammar must have no left recursion, no ambiguity, no ε-productions
⭐ 2. Weak (Simple) Precedence Parsing
This is a generalization of operator precedence parsing.
Key Features:
• Uses precedence functions (f and g functions)
• Works for slightly more grammars than operator precedence
• No need for =· relation
• Reduces using weakest precedence
How it Works:
• Assign numerical precedence to terminals using two functions:
o f(a) = precedence of symbol on stack
o g(a) = precedence of input symbol
• If f(a) < g(b) → Shift
• If f(a) > g(b) → Reduce
Use Case:
• Parsing expressions with well-defined precedence
• Slightly more powerful than operator precedence
⭐ 3. (Extra) Strong Precedence Parsing (if needed)
Some books mention this for theory only.
Features:
• Stronger conditions
• More powerful than weak precedence
• Still not suitable for general programming languages
⭐ Syntax-Directed Definition (SDD)
A Syntax-Directed Definition is a formal way to attach meaning, actions, or computations to a grammar.
It is used in the syntax analysis + semantic analysis phase of a compiler.
In SDD:
• Each grammar production has semantic rules
• Each grammar symbol has attributes
⭐ Definition (Exam Line)
A Syntax-Directed Definition (SDD) is a context-free grammar where each grammar symbol is associated with
attributes and each production is associated with semantic rules that define how to compute those attributes.
⭐ Components of SDD
1. Attributes
o Values or information attached to grammar symbols
o Example: type, value, place, code
2. Semantic Rules
o Specify how attributes are computed
o Attached to grammar productions
⭐ Types of Attributes
1. Synthesized Attributes
o Computed from children nodes
o Flow bottom-up
2. Inherited Attributes
o Obtained from parent or siblings
o Flow top-down or sideways
⭐ Example of SDD
Grammar:
E → E1 + T
E→T
T → digit
Semantic rules:
[Link] = [Link] + [Link]
[Link] = [Link]
Here, [Link] and [Link] are attributes.
⭐ Uses of SDD
• Type checking
• Expression evaluation
• Intermediate code generation
• Building syntax trees
• Checking declarations and scope rules
⭐ Attributes (Short Notes)
Attributes are values or information associated with grammar symbols in a Syntax-Directed Definition (SDD).
They help to add meaning to a grammar, beyond just syntax.
⭐ Example with SDD
Grammar:
E→E+T
E→T
T → digit
Attributes:
• [Link], [Link]
Semantic rules:
[Link] = [Link] + [Link]
[Link] = [Link]
⭐ Syntax Tree (Parse Tree / Abstract Syntax Tree)
A Syntax Tree is a tree representation of the structure of a program according to a grammar.
It shows how an expression or statement is derived from the grammar rules.
⭐ Definition (Short Answer)
A Syntax Tree is a hierarchical tree structure that represents the syntactic structure of a source program.
Each internal node is an operator, and each leaf node is an operand or identifier.
⭐ Characteristics
• Shows the operator at the parent node.
• Shows operands as child nodes.
• Removes unnecessary grammar details (more compact than parse tree).
• Used for syntax analysis, type checking, and intermediate code generation.
⭐ Example: Syntax Tree for a + b * c
/ \
a *
/\
b c
Explanation:
• * is evaluated before +, so * becomes a child of +.
⭐ Uses of Syntax Tree
• Type checking
• Intermediate code generation
• Optimization
• Semantic analysis
• Detecting operator precedence and associativity
⭐ 1. Activation Record (Short Notes)
An Activation Record (AR) or Stack Frame is a memory block created on the runtime stack when a
function/procedure is called.
It stores all the information needed to manage a function call.
⭐ Contents of an Activation Record
Field Purpose
Return Address Where to return after function ends
Parameters Values passed to function
Local Variables Variables declared inside function
Temporary Values Intermediate computation values
Control Link Pointer to caller's activation record
Access Link Used to access non-local variables (static scoping)
Saved Registers Register values to restore later
⭐ Diagram (Simple)
--------------------
| Temporaries |
| Local Variables |
| Parameters |
| Return Address |
| Control Link |
--------------------
⭐ 2. Static and Dynamic Scoping
✔ Static Scoping (Lexical Scoping)
• Scope is determined at compile time.
• Compiler uses the program text to find the variable’s declaration.
• Uses static chain / access link to access non-local variables.
• Most languages: C, C++, Java, Python use static scoping.
Example:
a=5
function f():
print(a) → prints 5 (determined at compile time)
✔ Dynamic Scoping
• Scope determined at runtime, based on calling sequence.
• Uses dynamic chain to find variables.
• Value depends on which function called which.
Example:
a=5
function f():
print(a)
function g():
a = 10
f()
Call g() → output is 10 (variable found dynamically in caller).
⭐ Differences (Exam Table)
Static Scoping Dynamic Scoping
Determined at compile time Determined at runtime
Uses static chain Uses dynamic chain
Predictable Hard to predict
Based on program structure Based on calling order
⭐ 3. Parameter Passing Techniques (Short Notes)
✔ 1. Call by Value
• Actual value is copied to the function.
• Changes inside function do not affect the caller.
✔ 2. Call by Reference
• Address of actual parameter is passed.
• Changes inside function affect original variable.
✔ 3. Call by Value-Result (Copy-In Copy-Out)
• Value copied to function at start & copied back to caller at end.
✔ 4. Call by Name
• Parameter is substituted like macro expansion.
• Expression evaluated each time it is used.
✔ 5. Call by Need
• Like call-by-name, but evaluates the expression once (lazy evaluation).
✔ 6. Call by Result
• No initial value; output is copied back after function ends.
⭐ 1. DAG (Directed Acyclic Graph)
A DAG is used in Intermediate Code Generation to represent expressions and detect common subexpressions.
⭐ Definition (Short)
A DAG is a graph with no cycles where:
• Leaves = operands (constants, variables)
• Interior nodes = operators
• If two subexpressions are the same → they share the same node (removes duplicates)
⭐ Use of DAG
• Eliminates common subexpressions
• Helps in code optimization
• Detects dead code
• Determines evaluation order
⭐ Example
Expression:
a=b+c
d=b+c
DAG:
/\
b c
/\
a d
Both a and d point to the same b+c node → duplicate removed.
⭐ 2. Quadruple Representation
Quadruple = 4 fields:
| op | arg1 | arg2 | result |
Used in Three-Address Code (TAC).
⭐ Example
Expression:
x=a+b*c
Step 1: t1 = b * c
Step 2: x = a + t1
Quadruples:
op arg1 arg2 result
* b c t1
+ a t1 x
⭐ 3. Triple Representation
Triple = 3 fields:
| op | arg1 | arg2 |
Results are not named — they use index numbers.
⭐ Example
For the same expression:
Step 1: b*c → (0)
Step 2: a + (0) → (1)
Triples:
Index op arg1 arg2
(0) * b c
(1) + a (0)
⭐ Peephole Optimization
✔ Definition (Short Answer)
Peephole Optimization is a local optimization technique that improves code by looking at a small window (peephole)
of instructions and replacing inefficient code with more efficient code.
The compiler examines 2–5 instructions at a time and optimizes them.
⭐ Main Types of Peephole Optimizations
1. Constant Folding
Evaluate constant expressions at compile time.
Example:
t = 3 * 4 → t = 12
2. Strength Reduction
Replace expensive operations with cheaper ones.
Example:
x = y * 2 → x = y << 1
3. Algebraic Simplification
Use algebraic identities to simplify code.
Example:
x = x + 0 → remove
y = y * 1 → remove
4. Redundant Load/Store Elimination
Remove repeated loads/stores.
Example:
LOAD A
LOAD A → Remove second LOAD
5. Unreachable Code Removal
Remove instructions that are never executed.
⭐ Advantages
• Simple and fast
• Works with final machine code
• Removes common inefficiencies
• Improves performance and reduces code size
⭐ Basic Block
✔ Definition (Short Answer)
A Basic Block is a sequence of consecutive statements in a program that:
• Always executes from start to end without any jump (branch) inside.
• Has only one entry point and one exit point.
⭐ Characteristics of a Basic Block
1. Single Entry
Control enters at the beginning only.
2. Single Exit
Control leaves at the end only.
3. No branching inside
No jump or label appears in the middle.
4. Straight-line code
Instructions execute sequentially.
⭐ Uses of Basic Block
• Code optimization
• Control flow analysis
• DAG construction
• Local optimization (peephole, constant folding)
⭐ How to Identify Basic Blocks?
Find leaders first:
A statement is a leader if:
1. The first statement is always a leader.
2. Any statement that is the target of a jump.
3. Any statement immediately after a jump.
After identifying leaders → form blocks until the next leader.
⭐ Loop Optimization
✔ Definition (Short Answer)
Loop optimization refers to techniques used by the compiler to improve the performance of loops, because loops
execute many times and consume a large portion of program time.
⭐ Types of Loop Optimization (Very Important)
1. Loop Invariant Code Motion
• Move statements that do not change inside a loop to outside the loop.
• Reduces unnecessary repeated computations.
Example:
Before:
for i:
x = a + b ← same every iteration
y=x+i
After:
x=a+b ← moved out
for i:
y=x+i
2. Loop Unrolling
• Expands loop body to reduce overhead of increment, comparison, and jump.
• Makes code faster by reducing loop control instructions.
Example:
Before:
for i = 1 to 4:
A[i] = A[i] + 1
After unrolling:
A[1] = A[1] + 1
A[2] = A[2] + 1
A[3] = A[3] + 1
A[4] = A[4] + 1
3. Loop Fusion (Loop Jamming)
• Combine two adjacent loops that iterate over the same range.
Before:
for i: A[i]=0
for i: B[i]=1
After:
for i:
A[i]=0
B[i]=1
4. Loop Fission (Loop Splitting)
• Split one loop into multiple loops to improve cache locality or parallelism.
5. Loop Strength Reduction
• Replace expensive operations in loops with cheaper ones.
Example:
x = i * 2 → replaced by x = x + 2
6. Loop Interchange
• Swap nested loops to improve memory access efficiency.
Example:
for i:
for j:
A[i][j]
→ interchange loops if beneficial for cache.
⭐ Why Loop Optimization is Important?
• Loops run many times → optimizing them gives major performance gain
• Reduces execution time
• Reduces unnecessary calculations
• Improves cache performance
⭐ Flow Graph (Control Flow Graph – CFG)
✔ Definition (Short Answer)
A Flow Graph is a directed graph that represents the flow of control in a program.
Each node represents a basic block, and each edge represents a possible flow of execution from one block to
another.
⭐ Components of a Flow Graph
1. Node / Vertex
o Represents a basic block (straight-line code).
2. Edge / Arrow
o Shows the possible control flow (jump, branch, next statement).
3. Entry Node
o The starting block.
4. Exit Node
o The ending block.
⭐ Example Program
1. a = b + c
2. if a > 10 goto L1
3. d = a - 5
4. goto L2
L1: e = a * 2
L2: print(e)
Step 1: Identify Basic Blocks
B1: 1
B2: 2
B3: 3, 4
B4: e = a * 2
B5: print(e)
⭐ Flow Graph for the Program
┌──────┐
│ B1 │
└──┬───┘
┌──────┐
│ B2 │
┌────┴──┬───┴────┐
▼ ▼ ▼
B3 B4 B5
| | |
└──────┬───┴────────┘
B5
⭐ Purpose of Flow Graph
✔ Used in code optimization
✔ Used in loop detection
✔ Used in control flow analysis
✔ Shows branching, merging paths
✔ Used to find dominators, natural loops, etc.
⭐ 1. Code Generation (Short Notes)
Code Generation is the final phase of the compiler where intermediate code (TAC) is converted into machine code
or assembly code.
✔ Tasks in Code Generation
1. Instruction selection
2. Register allocation
3. Addressing modes selection
4. Handling jumps, labels, procedures
5. Generating machine instructions
✔ Requirements of a Good Code Generator
• Generated code must be correct
• Efficient (fast, uses fewer registers)
• Uses less memory
• Should match architecture (RISC/CISC)
✔ Inputs to Code Generator
• Intermediate code
• Symbol table
• Type information
• Register information
✔ Output
• Target machine code / assembly code.
⭐ 2. Register Allocation (Short Notes)
Register Allocation decides which program variables or temporaries should be stored in registers instead of
memory.
Registers are limited → good allocation makes code faster.
✔ Two Steps:
1. Register Allocation
• Assigns variables to registers.
2. Register Assignment
• Decides which register stores which variable.
✔ Goal of Register Allocation
• Reduce load/store instructions
• Improve execution speed
• Minimize register spills (values stored to memory when registers are full)
✔ Methods
• Graph coloring (most common)
• Linear scan allocation
• Top-down allocation
⭐ 3. Instruction Selection (Short Notes)
Instruction Selection chooses the best machine instructions to implement the intermediate code.
✔ Goals
• Generate efficient target instructions
• Minimize number of machine instructions
• Use powerful instructions if architecture supports them
✔ Factors Considered
1. Target machine architecture
(RISC or CISC instruction set)
2. Addressing modes
3. Cost of instructions
4. Operator precedence and expression tree
✔ Example
Intermediate Code:
t1 = a + b
t2 = t1 * c
Possible machine code:
ADD R1, a, b
MUL R2, R1, c
If machine supports a combined instruction:
MADD R2, a, b, c ; Multiply-add
→ Instruction selection chooses the better one.