UNIT I: INTRODUCTION TO COMPILERS (9 Hours)
● Structure of a Compiler
● Lexical Analysis
● Role of Lexical Analyzer
● Input Buffering
● Specification of Tokens
● Recognition of Tokens
● Lex
● Finite Automata
● Regular Expressions to Automata
● Minimizing Deterministic Finite Automaton (DFA)
📘 Structure of a Compiler
A compiler is divided into major components, broadly categorized as the Front End, Middle
End, and Back End. Each part handles a specific stage of converting source code into target
machine code.
🔹 1. High-Level Overview
Source Program
↓
[Lexical Analyzer]
↓
[Syntax Analyzer]
↓
[Semantic Analyzer]
↓
[Intermediate Code Generator]
↓
[Code Optimizer]
↓
[Code Generator]
↓
Target Program
🔹 2. Phases of a Compiler
Phase Function
1. Lexical Analysis Converts source code into tokens (words of programming
language)
2. Syntax Analysis Parses tokens into a parse tree (checks syntax/grammar)
3. Semantic Analysis Ensures meaning is correct (e.g., type checking,
undeclared variables)
4. Intermediate Code Produces an intermediate representation (IR), close to
Generation machine code
5. Code Optimization Improves IR for performance/efficiency (removes dead
code, etc.)
6. Code Generation Generates machine/assembly code
7. Symbol Table Manager Keeps track of identifiers and their attributes
8. Error Handling Detects and reports errors in each phase
🔹 3. Compiler Components (Detailed View)
📌 1. Lexical Analyzer (Scanner)
● Reads source code and breaks it into tokens
● Removes white spaces and comments
● Uses Regular Expressions and Finite Automata
● Example: int x = 10; → int, id(x), =, num(10), ;
📌 2. Syntax Analyzer (Parser)
● Builds a parse tree / syntax tree using CFG (Context-Free Grammar)
● Checks for grammar violations
● Example: Ensures parentheses are balanced, correct statement structure
📌 3. Semantic Analyzer
● Checks for semantic consistency
● Type checking (e.g., assigning float to int), undeclared variables
● Builds symbol table and does type inference
📌 4. Intermediate Code Generator
● Translates AST into Intermediate Representation (IR)
● Example: Three-address code, postfix notation
📌 5. Code Optimizer
● Improves the IR without changing meaning
● Removes unnecessary code, simplifies expressions
● Example: x = x + 0 → removed
📌 6. Code Generator
● Converts IR into target machine code or assembly
● Allocates registers, memory
📌 7. Symbol Table
● Central data structure used throughout compilation
● Stores identifiers, types, scope, and location info
📌 8. Error Handler
● Detects, reports, and recovers from errors
● Phases may perform error recovery or stop compilation
🔹 4. Compiler Structure – Block Diagram
[Source Code]
↓
┌────────────────────────────┐
│ Lexical Analyzer │
└────────────────────────────┘
↓ Tokens
┌────────────────────────────┐
│ Syntax Analyzer │
└────────────────────────────┘
↓ Parse Tree
┌────────────────────────────┐
│ Semantic Analyzer │
└────────────────────────────┘
↓ Annotated Tree
┌────────────────────────────┐
│ Intermediate Code Generator│
└────────────────────────────┘
↓ IR Code
┌────────────────────────────┐
│ Code Optimizer │
└────────────────────────────┘
↓ Optimized IR
┌────────────────────────────┐
│ Code Generator │
└────────────────────────────┘
↓ Target Code
Side Modules:
● 🧾 Symbol Table
● ⚠️Error Handling Unit
📝 Summary
● The compiler is divided into phases to simplify the complex process of translation.
● Each phase has a specific task and passes its output to the next phase.
● Tools like Lex and YACC automate Lexical and Syntax Analysis.
●
📘 LEXICAL ANALYSIS
Lexical analysis is the first phase of a compiler, where the source code is read and converted
into tokens.
🔹 1. Role of Lexical Analyzer
The Lexical Analyzer, also called a scanner, performs the following:
Task Explanation
Reads source program Character by character
Groups characters into Tokens are strings with meaning (e.g., keywords, identifiers,
tokens literals, operators)
Removes These are not passed to further stages
whitespace/comments
Provides token stream to The parser uses these tokens for syntax analysis
parser
Maintains symbol table Keeps track of variable/function names and their attributes
Performs error detection Detects invalid tokens
2. Tokens, Lexemes, and Patterns
Term Definition
Token A category of lexical unit (e.g., if, +, id, num)
Lexeme Actual string in source code that matches a pattern (e.g., count, x)
Pattern A rule describing the structure of lexemes using regular
expressions
🧠 Example:
int count = 10;
Lexeme Token Pattern
int keyword int
count id [a-zA-Z_][a-zA-Z0-
9_]*
= assign =
10 num [0-9]+
; semicol ;
on
🔹 3. Specification of Tokens
Tokens are specified using Regular Expressions (RE).
Token Regular Expression
Type Example
Identifier [a-zA-Z_][a-zA-Z0-9_]*
Number [0-9]+(\.[0-9]+)?
Keyword if, else, while, etc.
Operators \+, \-, \*, \/, ==, !=, etc.
🔹 4. Input Buffering (Brief)
To speed up reading the input, lexical analyzers use buffer pairs with lookahead pointers.
● Buffer Pair: Two halves – while one is being processed, the other is filled.
● Lexeme Begin & Forward Pointers:
○ lexeme_begin: points to start of current lexeme
○ forward: moves ahead until a match is found
(Full explanation under Input Buffering topic.)
🔹 5. Finite Automata (FA) for Lexical Analysis
Used to recognize tokens defined by regular expressions.
● NFA (Nondeterministic FA): Easy to construct from RE
● DFA (Deterministic FA): Faster to simulate; used in practice
● Conversion: RE → NFA → DFA → Minimized DFA
🔹 6. Lexical Errors and Recovery
Common lexical errors:
● Invalid characters (e.g., @, #)
● Unrecognized identifiers
● Unterminated strings
Recovery techniques:
● Panic mode: Skip until a delimiter (e.g., ;) is found
● Error messages: Inform about the issue and continue
🔹 7. Lexical Analyzer vs Parser
Aspect Lexical Analyzer Syntax Analyzer (Parser)
Input Source code (characters) Token stream from lexer
Output Tokens Parse tree or syntax tree
Checks Spelling and structure of tokens Grammar and syntax of the
language
Example task Identify count as identifier Ensure if (x) is syntactically valid
🔹 8. Lex / Flex Tool (Intro)
● A tool to generate lexical analyzers
● Programmer writes token patterns (in RE)
● Lex tool converts them to C code
%% <-- Section separator
[a-z]+ printf("Identifier");
[0-9]+ printf("Number");
%% <-- Section separator
📘 INPUT BUFFERING
Input buffering is used in lexical analysis to read source code efficiently, especially when
tokens need lookahead (i.e., peeking ahead without consuming characters immediately).
🔹 1. Why Input Buffering Is Needed
● Lexical analyzers process one character at a time.
● Some tokens require multiple characters to decide (e.g., ==, >=, <=, !=).
● Reading character-by-character from disk or file is slow.
● Buffering helps read larger blocks at once to optimize performance.
🔹 2. Buffer Pair Scheme
To manage input efficiently, compilers use a two-buffer scheme (also called buffer pair):
diff
CopyEdit
+----------------------+----------------------+
| BUFFER 1 | BUFFER 2 |
+----------------------+----------------------+
^ ^
lexeme_begin forward
● Each buffer holds n characters.
● When forward reaches the end of BUFFER 1, BUFFER 2 is filled.
● Used in circular fashion.
🔹 3. Pointers in Buffering
Pointer Purpose
lexeme_beg Points to the start of the current lexeme
in
forward Moves forward to find the end of the lexeme
✅ Once the token is found:
→ The characters between lexeme_begin and forward are returned as the lexeme.
→ lexeme_begin is reset to the new token's start.
🔹 4. Special Symbol: Sentinel
To detect end of buffer, we place a sentinel character (usually EOF or a unique marker like \
0).
This avoids explicit checks for buffer end during scanning.
🔹 5. Operations Performed
Operation When It Happens
Advance Until a valid token is matched
forward
Retract forward If a longer token is rejected (e.g., == vs
=)
Fill next buffer When forward reaches the sentinel
Synchronized Order of Major Operations in Input
Buffering
1. Set lexeme_begin and forward to the start of buffer
○ Begin scanning a new token from current position.
2. Advance forward pointer
○ Move forward character by character to match a token pattern.
3. Skip whitespace and comments (if any)
○ Ignore non-token characters like spaces, tabs, or // comments.
4. Check for end of buffer (sentinel reached)
○ If sentinel (␚ or EOF) is found, load the next buffer half and continue.
5. Match token pattern
○ When a valid token pattern is matched, stop advancing.
6. Retract forward (if needed)
○ If an extra character was read, move forward back to correct position.
7. Mark and extract lexeme
○ Take the substring from lexeme_begin to forward - 1.
8. Return the corresponding token to the parser
○ Example: return TOKEN = IDENTIFIER, LEXEME = "count"
9. Set lexeme_begin = forward
○ Reset pointers to scan the next lexeme.
10. Repeat steps for the next token
● Continue until end of file.
🔁 Flow Summary:
Start → Advance → Skip → Match → (Maybe Retract) → Extract → Return Token →
Reset → R
🔍 Example:
Consider the input:
a == b;
● lexeme_begin starts at a, forward scans till it reaches space → identifies a as
an identifier
● Then == is scanned – needs to look ahead
● If only = is seen first, scanner waits for next =
● If only one = exists, it retracts and returns single = token
🧠 Benefits of Input Buffering
Advantage Explanation
Reduces file I/O calls Loads larger chunks at once
Supports lookahead Enables tokens like ==, <=,
etc.
Efficient scanning Uses two pointers and
sentinel
📘 Specification of Tokens
In lexical analysis, tokens are the basic units of meaning (keywords, identifiers, operators, etc.)
that are recognized from the input source code.
To detect and classify them, each token type is defined using patterns, most commonly using
regular expressions (REs).
🔹 1. Token, Lexeme, and Pattern – Quick Recap
Term Meaning
Token The category/type of a string (e.g., NUMBER, ID)
Lexeme Actual string in code that matches a token (x, if)
Pattern A rule (often a Regular Expression) that defines the structure of valid lexemes
for a token
🔹 2. Specifying Tokens Using Regular Expressions
Each token type is described by a regular expression (RE) pattern that defines which strings
(lexemes) are valid for that token.
✨ Example Table:
Token Type Regular Expression (RE) Matches
Examples
ID (Identifier) [a-zA-Z_][a-zA-Z0- x, count1, _temp
9_]*
NUMBER [0-9]+ 10, 4567
FLOAT [0-9]+\.[0-9]+ 3.14, 0.25
KEYWORD `if else
RELOP (Relational `== !=
op)
ASSIGN = =
PLUS \+ +
MULT \* *
LPAREN \( (
RPAREN \) )
🔎 These regular expressions define what the lexer should look for when
scanning input.
🔹 3. How Tokens Are Specified in Practice
● In tools like Lex/Flex, you write rules like this:
lex
CopyEdit
%%
[0-9]+ { return NUMBER; }
[a-zA-Z_][a-zA-Z0-9_]* { return ID; }
if { return IF; }
== { return EQ_OP; }
Each pattern is associated with code to execute or a token to return.
🔹 4. Types of Tokens Common in Programming Languages
Category Examples
Keywords int, if, while, return
Identifiers x, my_var, sum1
Constants 10, 3.14, 'a', "hello"
Operators +, -, *, /, ==, !=
Separators ;, ,, (, )
Literals Numbers, characters,
strings
🔹 5. Lexical Errors (Optional Extra)
If a lexeme doesn’t match any pattern:
● It’s a lexical error
● Lexer reports error or skips invalid character
✅ Summary
● Token specification is done using regular expressions
● Each token type has a pattern to match valid lexemes
● Lexical analyzers use these rules to scan and classify input text
● Tools like Lex/Flex automate this process
📘 Recognition of Tokens
Once tokens are specified using regular expressions, the next task of the lexical analyzer is
to recognize them in the source code. This means reading the character stream and deciding
which part matches which token type.
🔹 1. Goal of Token Recognition
● To scan the source program and identify lexemes that match token patterns.
● For each valid lexeme, return the corresponding token to the parser.
🔹 2. How Are Tokens Recognized?
Token recognition is done using Finite Automata (FA), which operate on regular expressions.
✅ Recognition Process Flow:
1. Regular expressions define token patterns.
2. Convert REs into Non-deterministic Finite Automata (NFA).
3. Convert NFA to Deterministic Finite Automata (DFA).
4. Use DFA to scan characters and recognize valid lexemes.
5. When a valid token is found, return it to the parser.
🔸 Example:
Let’s say we have these token patterns:
Token Pattern (RE)
ID [a-zA-Z_][a-zA-Z0-
9_]*
NUM [0-9]+
ASSIG =
N
PLUS +
Source code:
c
CopyEdit
sum = sum + 10;
Token recognition happens like this:
1. Read characters s, u, m → match pattern for ID
2. Read = → match pattern for ASSIGN
3. Read sum → match ID
4. Read + → match PLUS
5. Read 10 → match NUM
6. Read ; → match SEMICOLON
🔹 3. Longest Match Rule
● The lexical analyzer always chooses the longest possible lexeme that matches a
token.
● This avoids ambiguity.
📌 Example:
CopyEdit
Input: >=
Possible matches: > (relational op), >= (relational op)
✅ Lexer returns >= because it's the longest match.
🔹 4. Priority Rule (when patterns overlap)
If multiple token patterns match the same lexeme, priority is given to the pattern defined first
(or with higher precedence).
📌 Example:
lex
CopyEdit
[0-9]+ { return NUM; }
[0-9]+[a-z]+ { return INVALID_ID; } // Not matched unless NUM fails
🔹 5. Recognition Using DFA
Once RE is converted to DFA:
● The DFA scans input from current lexeme_begin.
● Moves forward as long as it stays in valid states.
● If it reaches a final state and no further match, it returns the token.
This is fast and efficient.
🔹 6. Token Recognition Output
Each successful recognition returns a pair:
php-template
CopyEdit
<Token Type, Lexeme>
Example:
arduino
CopyEdit
<ID, "sum">
<ASSIGN, "=">
<NUM, "10">
📝 Summary
● Token recognition uses FA (DFA) built from regular expressions.
● Lexer scans input using longest match and priority rules.
● Each match returns a token and its lexeme.
● Efficient recognition is key to fast compilation.
📘 Recognition of Tokens
🔹 What is Token Recognition?
Recognition of tokens refers to the process by which the lexical analyzer (lexer) reads the
input characters and identifies valid tokens using predefined patterns (usually regular
expressions).
🔹 How Are Tokens Recognized?
Token recognition involves these key steps:
Step Description
1 Scanning
1️⃣ The lexer reads characters from the source code, starting from the
current position (lexeme_begin)
2️⃣Pattern Matching The characters are matched against regular expressions for known
token types
3️⃣Longest Match If multiple patterns match, the longest valid lexeme is selected (e.g.,
Rule == over =)
4️⃣Token The recognized lexeme is categorized as a token and returned to the
Generation parser
5️⃣Advance lexeme_begin is updated to the start of the next token
Pointers
🔹 Example
Input:
CopyEdit
if (count == 10)
Lexer process:
Lexeme Matched Token Pattern
Scanned
if KEYWORD if
( LPAREN \(
count IDENTIFIER [a-zA-Z_][a-zA-Z0-
(ID) 9_]*
== RELOP ==
10 NUMBER [0-9]+
) RPAREN \)
🔹 Longest Match Rule
When more than one pattern matches the input, the lexer always selects the longest possible
valid lexeme.
📌 Example:
Input: >=
Possible Matches Selected
> (relational operator) ❌
>= (relational ✅ (longest match)
operator)
🔹 Prioritized Matching
If two patterns match the same string of same length, priority/order in the rule list decides.
📌 Example:
For if, both these patterns could match:
● if → keyword
● [a-zA-Z_][a-zA-Z0-9_]* → identifier
But the lexer chooses if as keyword, because it’s listed earlier.
🔹 Tools Used for Token Recognition
● Lexers use finite automata (DFA) for fast pattern recognition.
● Regular Expressions → NFA → DFA → Token table.
✅ Summary
Key Point Explanation
Token recognition = matching input to Each token has a regular expression pattern
patterns
Longest match rule Picks the longest possible lexeme that matches
a pattern
DFA used in practice Finite Automata are used to efficiently recognize
tokens
Order of rules matters Earlier rules take priority if lengths are equal