[go: up one dir, main page]

0% found this document useful (0 votes)
96 views23 pages

Unit I - Introduction To Compilers (9 Hours)

This document provides an overview of compiler structure and phases, detailing the roles of various components such as the lexical analyzer, syntax analyzer, and code generator. It explains the process of lexical analysis, including token specification using regular expressions, input buffering techniques, and the recognition of tokens through finite automata. Additionally, it highlights the importance of error handling and the use of tools like Lex and YACC in automating the compilation process.

Uploaded by

Nandhini prince
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views23 pages

Unit I - Introduction To Compilers (9 Hours)

This document provides an overview of compiler structure and phases, detailing the roles of various components such as the lexical analyzer, syntax analyzer, and code generator. It explains the process of lexical analysis, including token specification using regular expressions, input buffering techniques, and the recognition of tokens through finite automata. Additionally, it highlights the importance of error handling and the use of tools like Lex and YACC in automating the compilation process.

Uploaded by

Nandhini prince
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

UNIT I: INTRODUCTION TO COMPILERS (9 Hours)

● Structure of a Compiler

● Lexical Analysis

● Role of Lexical Analyzer

● Input Buffering

● Specification of Tokens

● Recognition of Tokens

● Lex

● Finite Automata

● Regular Expressions to Automata

● Minimizing Deterministic Finite Automaton (DFA)

📘 Structure of a Compiler
A compiler is divided into major components, broadly categorized as the Front End, Middle
End, and Back End. Each part handles a specific stage of converting source code into target
machine code.

🔹 1. High-Level Overview
Source Program

[Lexical Analyzer]

[Syntax Analyzer]

[Semantic Analyzer]

[Intermediate Code Generator]

[Code Optimizer]

[Code Generator]

Target Program

🔹 2. Phases of a Compiler
Phase Function

1. Lexical Analysis Converts source code into tokens (words of programming


language)

2. Syntax Analysis Parses tokens into a parse tree (checks syntax/grammar)

3. Semantic Analysis Ensures meaning is correct (e.g., type checking,


undeclared variables)

4. Intermediate Code Produces an intermediate representation (IR), close to


Generation machine code

5. Code Optimization Improves IR for performance/efficiency (removes dead


code, etc.)

6. Code Generation Generates machine/assembly code

7. Symbol Table Manager Keeps track of identifiers and their attributes

8. Error Handling Detects and reports errors in each phase

🔹 3. Compiler Components (Detailed View)


📌 1. Lexical Analyzer (Scanner)

● Reads source code and breaks it into tokens

● Removes white spaces and comments

● Uses Regular Expressions and Finite Automata

● Example: int x = 10; → int, id(x), =, num(10), ;

📌 2. Syntax Analyzer (Parser)

● Builds a parse tree / syntax tree using CFG (Context-Free Grammar)

● Checks for grammar violations


● Example: Ensures parentheses are balanced, correct statement structure

📌 3. Semantic Analyzer

● Checks for semantic consistency

● Type checking (e.g., assigning float to int), undeclared variables

● Builds symbol table and does type inference

📌 4. Intermediate Code Generator

● Translates AST into Intermediate Representation (IR)

● Example: Three-address code, postfix notation

📌 5. Code Optimizer

● Improves the IR without changing meaning

● Removes unnecessary code, simplifies expressions

● Example: x = x + 0 → removed

📌 6. Code Generator

● Converts IR into target machine code or assembly

● Allocates registers, memory

📌 7. Symbol Table

● Central data structure used throughout compilation

● Stores identifiers, types, scope, and location info

📌 8. Error Handler

● Detects, reports, and recovers from errors

● Phases may perform error recovery or stop compilation


🔹 4. Compiler Structure – Block Diagram
[Source Code]

┌────────────────────────────┐
│ Lexical Analyzer │
└────────────────────────────┘
↓ Tokens
┌────────────────────────────┐
│ Syntax Analyzer │
└────────────────────────────┘
↓ Parse Tree
┌────────────────────────────┐
│ Semantic Analyzer │
└────────────────────────────┘
↓ Annotated Tree
┌────────────────────────────┐
│ Intermediate Code Generator│
└────────────────────────────┘
↓ IR Code
┌────────────────────────────┐
│ Code Optimizer │
└────────────────────────────┘
↓ Optimized IR
┌────────────────────────────┐
│ Code Generator │
└────────────────────────────┘
↓ Target Code

Side Modules:

● 🧾 Symbol Table

● ⚠️Error Handling Unit

📝 Summary
● The compiler is divided into phases to simplify the complex process of translation.

● Each phase has a specific task and passes its output to the next phase.

● Tools like Lex and YACC automate Lexical and Syntax Analysis.

📘 LEXICAL ANALYSIS
Lexical analysis is the first phase of a compiler, where the source code is read and converted
into tokens.

🔹 1. Role of Lexical Analyzer


The Lexical Analyzer, also called a scanner, performs the following:

Task Explanation

Reads source program Character by character

Groups characters into Tokens are strings with meaning (e.g., keywords, identifiers,
tokens literals, operators)

Removes These are not passed to further stages


whitespace/comments

Provides token stream to The parser uses these tokens for syntax analysis
parser

Maintains symbol table Keeps track of variable/function names and their attributes

Performs error detection Detects invalid tokens

2. Tokens, Lexemes, and Patterns

Term Definition

Token A category of lexical unit (e.g., if, +, id, num)


Lexeme Actual string in source code that matches a pattern (e.g., count, x)

Pattern A rule describing the structure of lexemes using regular


expressions

🧠 Example:

int count = 10;

Lexeme Token Pattern

int keyword int

count id [a-zA-Z_][a-zA-Z0-
9_]*

= assign =

10 num [0-9]+

; semicol ;
on

🔹 3. Specification of Tokens
Tokens are specified using Regular Expressions (RE).
Token Regular Expression
Type Example

Identifier [a-zA-Z_][a-zA-Z0-9_]*

Number [0-9]+(\.[0-9]+)?

Keyword if, else, while, etc.

Operators \+, \-, \*, \/, ==, !=, etc.

🔹 4. Input Buffering (Brief)


To speed up reading the input, lexical analyzers use buffer pairs with lookahead pointers.

● Buffer Pair: Two halves – while one is being processed, the other is filled.

● Lexeme Begin & Forward Pointers:

○ lexeme_begin: points to start of current lexeme

○ forward: moves ahead until a match is found

(Full explanation under Input Buffering topic.)

🔹 5. Finite Automata (FA) for Lexical Analysis


Used to recognize tokens defined by regular expressions.

● NFA (Nondeterministic FA): Easy to construct from RE

● DFA (Deterministic FA): Faster to simulate; used in practice

● Conversion: RE → NFA → DFA → Minimized DFA

🔹 6. Lexical Errors and Recovery


Common lexical errors:

● Invalid characters (e.g., @, #)

● Unrecognized identifiers

● Unterminated strings

Recovery techniques:

● Panic mode: Skip until a delimiter (e.g., ;) is found

● Error messages: Inform about the issue and continue

🔹 7. Lexical Analyzer vs Parser

Aspect Lexical Analyzer Syntax Analyzer (Parser)

Input Source code (characters) Token stream from lexer

Output Tokens Parse tree or syntax tree

Checks Spelling and structure of tokens Grammar and syntax of the


language

Example task Identify count as identifier Ensure if (x) is syntactically valid

🔹 8. Lex / Flex Tool (Intro)


● A tool to generate lexical analyzers

● Programmer writes token patterns (in RE)

● Lex tool converts them to C code


%% <-- Section separator

[a-z]+ printf("Identifier");

[0-9]+ printf("Number");

%% <-- Section separator

📘 INPUT BUFFERING
Input buffering is used in lexical analysis to read source code efficiently, especially when
tokens need lookahead (i.e., peeking ahead without consuming characters immediately).

🔹 1. Why Input Buffering Is Needed


● Lexical analyzers process one character at a time.

● Some tokens require multiple characters to decide (e.g., ==, >=, <=, !=).

● Reading character-by-character from disk or file is slow.

● Buffering helps read larger blocks at once to optimize performance.

🔹 2. Buffer Pair Scheme


To manage input efficiently, compilers use a two-buffer scheme (also called buffer pair):

diff

CopyEdit

+----------------------+----------------------+

| BUFFER 1 | BUFFER 2 |

+----------------------+----------------------+

^ ^

lexeme_begin forward
● Each buffer holds n characters.

● When forward reaches the end of BUFFER 1, BUFFER 2 is filled.

● Used in circular fashion.

🔹 3. Pointers in Buffering

Pointer Purpose

lexeme_beg Points to the start of the current lexeme


in

forward Moves forward to find the end of the lexeme

✅ Once the token is found:


→ The characters between lexeme_begin and forward are returned as the lexeme.
→ lexeme_begin is reset to the new token's start.

🔹 4. Special Symbol: Sentinel

To detect end of buffer, we place a sentinel character (usually EOF or a unique marker like \
0).

This avoids explicit checks for buffer end during scanning.

🔹 5. Operations Performed

Operation When It Happens

Advance Until a valid token is matched


forward
Retract forward If a longer token is rejected (e.g., == vs
=)

Fill next buffer When forward reaches the sentinel

Synchronized Order of Major Operations in Input


Buffering
1. Set lexeme_begin and forward to the start of buffer

○ Begin scanning a new token from current position.

2. Advance forward pointer

○ Move forward character by character to match a token pattern.

3. Skip whitespace and comments (if any)

○ Ignore non-token characters like spaces, tabs, or // comments.

4. Check for end of buffer (sentinel reached)

○ If sentinel (␚ or EOF) is found, load the next buffer half and continue.

5. Match token pattern

○ When a valid token pattern is matched, stop advancing.

6. Retract forward (if needed)

○ If an extra character was read, move forward back to correct position.

7. Mark and extract lexeme

○ Take the substring from lexeme_begin to forward - 1.

8. Return the corresponding token to the parser

○ Example: return TOKEN = IDENTIFIER, LEXEME = "count"

9. Set lexeme_begin = forward


○ Reset pointers to scan the next lexeme.

10. Repeat steps for the next token

● Continue until end of file.

🔁 Flow Summary:

Start → Advance → Skip → Match → (Maybe Retract) → Extract → Return Token →


Reset → R

🔍 Example:
Consider the input:

a == b;

● lexeme_begin starts at a, forward scans till it reaches space → identifies a as


an identifier

● Then == is scanned – needs to look ahead

● If only = is seen first, scanner waits for next =

● If only one = exists, it retracts and returns single = token

🧠 Benefits of Input Buffering

Advantage Explanation
Reduces file I/O calls Loads larger chunks at once

Supports lookahead Enables tokens like ==, <=,


etc.

Efficient scanning Uses two pointers and


sentinel

📘 Specification of Tokens
In lexical analysis, tokens are the basic units of meaning (keywords, identifiers, operators, etc.)
that are recognized from the input source code.

To detect and classify them, each token type is defined using patterns, most commonly using
regular expressions (REs).

🔹 1. Token, Lexeme, and Pattern – Quick Recap

Term Meaning

Token The category/type of a string (e.g., NUMBER, ID)

Lexeme Actual string in code that matches a token (x, if)

Pattern A rule (often a Regular Expression) that defines the structure of valid lexemes
for a token

🔹 2. Specifying Tokens Using Regular Expressions


Each token type is described by a regular expression (RE) pattern that defines which strings
(lexemes) are valid for that token.

✨ Example Table:

Token Type Regular Expression (RE) Matches


Examples

ID (Identifier) [a-zA-Z_][a-zA-Z0- x, count1, _temp


9_]*

NUMBER [0-9]+ 10, 4567

FLOAT [0-9]+\.[0-9]+ 3.14, 0.25

KEYWORD `if else

RELOP (Relational `== !=


op)

ASSIGN = =

PLUS \+ +

MULT \* *

LPAREN \( (

RPAREN \) )
🔎 These regular expressions define what the lexer should look for when
scanning input.

🔹 3. How Tokens Are Specified in Practice


● In tools like Lex/Flex, you write rules like this:

lex

CopyEdit

%%

[0-9]+ { return NUMBER; }

[a-zA-Z_][a-zA-Z0-9_]* { return ID; }

if { return IF; }

== { return EQ_OP; }

Each pattern is associated with code to execute or a token to return.

🔹 4. Types of Tokens Common in Programming Languages

Category Examples

Keywords int, if, while, return

Identifiers x, my_var, sum1

Constants 10, 3.14, 'a', "hello"

Operators +, -, *, /, ==, !=
Separators ;, ,, (, )

Literals Numbers, characters,


strings

🔹 5. Lexical Errors (Optional Extra)


If a lexeme doesn’t match any pattern:

● It’s a lexical error

● Lexer reports error or skips invalid character

✅ Summary
● Token specification is done using regular expressions

● Each token type has a pattern to match valid lexemes

● Lexical analyzers use these rules to scan and classify input text

● Tools like Lex/Flex automate this process

📘 Recognition of Tokens
Once tokens are specified using regular expressions, the next task of the lexical analyzer is
to recognize them in the source code. This means reading the character stream and deciding
which part matches which token type.

🔹 1. Goal of Token Recognition


● To scan the source program and identify lexemes that match token patterns.

● For each valid lexeme, return the corresponding token to the parser.
🔹 2. How Are Tokens Recognized?
Token recognition is done using Finite Automata (FA), which operate on regular expressions.

✅ Recognition Process Flow:


1. Regular expressions define token patterns.

2. Convert REs into Non-deterministic Finite Automata (NFA).

3. Convert NFA to Deterministic Finite Automata (DFA).

4. Use DFA to scan characters and recognize valid lexemes.

5. When a valid token is found, return it to the parser.

🔸 Example:
Let’s say we have these token patterns:

Token Pattern (RE)

ID [a-zA-Z_][a-zA-Z0-
9_]*

NUM [0-9]+

ASSIG =
N

PLUS +

Source code:

c
CopyEdit

sum = sum + 10;

Token recognition happens like this:

1. Read characters s, u, m → match pattern for ID

2. Read = → match pattern for ASSIGN

3. Read sum → match ID

4. Read + → match PLUS

5. Read 10 → match NUM

6. Read ; → match SEMICOLON

🔹 3. Longest Match Rule


● The lexical analyzer always chooses the longest possible lexeme that matches a
token.

● This avoids ambiguity.

📌 Example:

CopyEdit

Input: >=

Possible matches: > (relational op), >= (relational op)

✅ Lexer returns >= because it's the longest match.

🔹 4. Priority Rule (when patterns overlap)


If multiple token patterns match the same lexeme, priority is given to the pattern defined first
(or with higher precedence).

📌 Example:

lex

CopyEdit

[0-9]+ { return NUM; }

[0-9]+[a-z]+ { return INVALID_ID; } // Not matched unless NUM fails

🔹 5. Recognition Using DFA


Once RE is converted to DFA:

● The DFA scans input from current lexeme_begin.

● Moves forward as long as it stays in valid states.

● If it reaches a final state and no further match, it returns the token.

This is fast and efficient.

🔹 6. Token Recognition Output


Each successful recognition returns a pair:

php-template

CopyEdit

<Token Type, Lexeme>

Example:

arduino

CopyEdit
<ID, "sum">

<ASSIGN, "=">

<NUM, "10">

📝 Summary
● Token recognition uses FA (DFA) built from regular expressions.

● Lexer scans input using longest match and priority rules.

● Each match returns a token and its lexeme.

● Efficient recognition is key to fast compilation.

📘 Recognition of Tokens
🔹 What is Token Recognition?
Recognition of tokens refers to the process by which the lexical analyzer (lexer) reads the
input characters and identifies valid tokens using predefined patterns (usually regular
expressions).

🔹 How Are Tokens Recognized?


Token recognition involves these key steps:

Step Description

1 Scanning
1️⃣ The lexer reads characters from the source code, starting from the
current position (lexeme_begin)

2️⃣Pattern Matching The characters are matched against regular expressions for known
token types
3️⃣Longest Match If multiple patterns match, the longest valid lexeme is selected (e.g.,
Rule == over =)

4️⃣Token The recognized lexeme is categorized as a token and returned to the


Generation parser

5️⃣Advance lexeme_begin is updated to the start of the next token


Pointers

🔹 Example
Input:

CopyEdit

if (count == 10)

Lexer process:

Lexeme Matched Token Pattern


Scanned

if KEYWORD if

( LPAREN \(

count IDENTIFIER [a-zA-Z_][a-zA-Z0-


(ID) 9_]*
== RELOP ==

10 NUMBER [0-9]+

) RPAREN \)

🔹 Longest Match Rule


When more than one pattern matches the input, the lexer always selects the longest possible
valid lexeme.

📌 Example:
Input: >=

Possible Matches Selected

> (relational operator) ❌

>= (relational ✅ (longest match)


operator)

🔹 Prioritized Matching
If two patterns match the same string of same length, priority/order in the rule list decides.

📌 Example:
For if, both these patterns could match:

● if → keyword

● [a-zA-Z_][a-zA-Z0-9_]* → identifier
But the lexer chooses if as keyword, because it’s listed earlier.

🔹 Tools Used for Token Recognition


● Lexers use finite automata (DFA) for fast pattern recognition.

● Regular Expressions → NFA → DFA → Token table.

✅ Summary

Key Point Explanation

Token recognition = matching input to Each token has a regular expression pattern
patterns

Longest match rule Picks the longest possible lexeme that matches
a pattern

DFA used in practice Finite Automata are used to efficiently recognize


tokens

Order of rules matters Earlier rules take priority if lengths are equal

You might also like