FIRST Set
The FIRST set of a grammar symbol (terminal or non-terminal) contains the set of terminals that
can appear at the start of strings derived from that symbol.
Why FIRST Set Is Important
Used in LL(1) and LR(1) parsing tables.
Helps decide which production rule to use.
Assists in resolving ambiguity.
Helps compute the FOLLOW set.
Essential for recursive descent parsing.
Rules for Computing FIRST Set
Let’s compute FIRST(X):
1. If X is a terminal:
FIRST(X) = {X}
2. If X → ε is a production:
Add ε to FIRST(X)
3. If X → Y1 Y2 Y3 ... Yn:
o Add FIRST(Y1) to FIRST(X) (excluding ε)
o If FIRST(Y1) contains ε, add FIRST(Y2), and so on.
o If all Yi (1 to n) derive ε, then add ε to FIRST(X)
Example 1
Grammar:
E → T E’
E’ → + T E’ | ε
T → F T’
T’ → * F T’ | ε
F → ( E ) | id
FIRST Sets:
FIRST(E) = { (, id }
FIRST(E’) = { +, ε }
FIRST(T) = { (, id }
FIRST(T’) = { *, ε }
FIRST(F) = { (, id }
Features of FIRST Set
Feature Description
Purpose To identify which terminal symbols can begin strings derived from a non-terminal.
Used in LL(1) parsing, recursive descent parsers, predictive parsers.
Helps with Rule selection, ambiguity resolution, and error detection.
Advantages
Helps make parsing decisions fast.
Supports error detection and recovery.
Reduces ambiguity in grammar.
Disadvantages
Complex computation for large grammars.
Not effective for ambiguous or left-recursive grammars directly.
Tip for Learning
Always start from terminals and work upward. Pay special attention to rules that include ε
because they affect whether we need to consider more symbols in a production.
Left Recursion
A grammar is left-recursive if a non-terminal symbol calls itself on the left side of its production.
Why it's a problem:
Top-down parsers like recursive descent and LL(1) parsers cannot handle left recursion directly
— it can lead to infinite recursion.
Example of Left Recursion:
A → Aα | β
Here, A calls itself on the left side. It’s left-recursive.
Example with expressions:
E→E+T|T
Removing Left Recursion
To remove left recursion, rewrite the productions using the following method:
Convert:
A → Aα | β
Into:
A → β A'
A' → α A' | ε
Example:
Original:
E→E+T|T
After eliminating left recursion:
E → T E'
E' → + T E' | ε
Now it's suitable for top-down parsing!
Left Factoring
Left factoring is a grammar transformation used when two or more productions share a
common prefix. It helps remove ambiguity and make decisions easier during parsing.
Why it's useful:
LL(1) parsers need to choose the correct production based on a single lookahead symbol. Left
factoring helps in such decisions.
Example of Left Factoring:
Before:
A → if E then S else S
A → if E then S
The common prefix is: if E then S
After Left Factoring:
A → if E then S A'
A' → else S | ε
Now, the parser doesn’t have to guess which rule to use—it looks ahead and knows exactly
what to do.
Comparison Table
Feature Left Recursion Left Factoring
Purpose Eliminate infinite recursion Remove ambiguity in rule selection
Problem it solves Recursive descent parser failure Multiple rules starting the same way
Changes Converts recursion into iteration Extracts common prefixes
Helps in Making grammar LL(1)-compatible Making parser decisions easier
A translator for simple expressions is a component of a compiler that converts arithmetic
expressions (like a + b * c) into a different form, such as:
Postfix (Reverse Polish Notation)
Intermediate representation (IR)
Machine code
Simple Expression
Simple expressions include basic arithmetic using identifiers and operators like:
a+b
a+b*c
(a + b) * c
Translator Goals
Parse the expression
Apply correct precedence and associativity
Generate intermediate code or postfix notation
Example 1: Infix to Postfix Translation
Input:
a+b*c
Output:
abc*+
Grammar for Simple Arithmetic Expressions
Let’s define a grammar that captures precedence and associativity:
E → T E'
E' → + T E' | ε
T → F T'
T' → * F T' | ε
F → (E) | id
Translation Scheme: Postfix Notation Generator
Here’s a Syntax-Directed Translation scheme using the above grammar:
Production Rules with Postfix Translation:
E → T E' { E.code = T.code || E'.code }
E' → + T E' { E'.code = T.code || "+" || E'.code }
E' → ε { E'.code = "" }
T → F T' { T.code = F.code || T'.code }
T' → * F T' { T'.code = F.code || "*" || T'.code }
T' → ε { T'.code = "" }
F → (E) { F.code = E.code }
F → id { F.code = id.lexeme }
Example Trace (Input: a + b * c)
Postfix Output:
abc*+
Adding a Lexical Analyzer for Simple Expressions
What is a Lexical Analyzer?
It reads raw input (a string like "a + b * c")
Splits it into tokens: identifiers, operators, parentheses
Each token has a type and possibly a value
Tokens We Need for Simple Expressions
Token Type Examples
ID a, b, c, var1
PLUS +
MULT *
LPAREN (
RPAREN )
EOF (end) End of input marker
Example Lexer in Python
import re
# Token specification using regex
token_specification = [
('ID', r'[a-zA-Z_][a-zA-Z_0-9]*'), # Identifiers
('PLUS', r'\+'), # Plus
('MULT', r'\*'), # Multiply
('LPAREN',r'\('), # Left Parenthesis
('RPAREN',r'\)'), # Right Parenthesis
('SKIP', r'[ \t]+'), # Skip spaces and tabs
]
def lexer(code):
tokens = []
pos = 0
while pos < len(code):
match = None
for tok_type, tok_regex in token_specification:
pattern = re.compile(tok_regex)
match = pattern.match(code, pos)
if match:
if tok_type != 'SKIP': # Ignore whitespace
tokens.append((tok_type, match.group(0)))
pos = match.end()
break
if not match:
raise SyntaxError(f'Illegal character: {code[pos]} at position {pos}')
tokens.append(('EOF', ''))
return tokens
# Example usage:
input_code = "a + b * (c + d)"
tokens = lexer(input_code)
for t in tokens:
print(t)
Integrating Lexer with Parser
Modify the parser to consume tokens as (token_type, token_value) instead of raw strings.
Example snippet of parser token consumption:
tokens = lexer(input_code)
pos = 0
def next_token():
return tokens[pos][0] if pos < len(tokens) else 'EOF'
def match(expected_type):
global pos
if next_token() == expected_type:
pos += 1
else:
raise SyntaxError(f'Expected {expected_type}, got {next_token()}')
Use tokens[pos][1] whenever you want the actual lexeme (like the variable name).
Now you have a complete mini-compiler front-end for simple expressions:
Lexer to tokenize input
Parser to validate syntax and parse structure
Translator to convert infix to postfix
Token Attributes:
When a lexical analyzer (lexer) scans the source code, it produces tokens. A token is a pair
consisting of a token type (like ID, NUMBER, PLUS) and its attributes, which hold additional
information about that token.
Examples of Token Attributes:
Token Type Possible Attributes
ID The actual variable name (lexeme), e.g., "count"
NUMBER Numeric value (int, float), e.g., 42 or 3.14
STRING The string literal value, e.g., "hello"
PLUS Usually no attribute (just the symbol)
KEYWORD The keyword string or an internal code
Why are Token Attributes Important?
They carry semantic info for later stages (parsing, semantic analysis, code gen).
For example, the parser needs the lexeme of an ID to build symbol tables.
The code generator may use numeric values in NUMBER tokens directly.
Example of Token with Attributes
Token = {
'type': 'ID',
'lexeme': 'variableName',
'line': 10,
'column': 5
}
Instructions for Stack Manipulation (in Compilation)
Stack manipulation instructions are central in runtime environments for expressions evaluation,
function calls, and intermediate code generation (especially for postfix evaluation or stack
machine code).
Common Stack Operations:
Instruction Description
PUSH <value> Push a constant, variable value, or intermediate result onto the stack
POP <dest> Pop the top value from the stack into a variable or register
LOAD <var> Push the value of a variable onto the stack
STORE <var> Pop the top value from the stack and store it into a variable
DUP Duplicate the top value of the stack
SWAP Swap the top two values on the stack
Arithmetic Operations (Assuming operands on stack):
Instruction Operation Description
ADD Pop two values, add, push result
SUB Pop two values, subtract, push result
MUL Pop two values, multiply, push result
DIV Pop two values, divide, push result
Example: Evaluating a + b * c in a Stack Machine
For postfix expression a b c * +, instructions might be:
LOAD a # push a
LOAD b # push b
LOAD c # push c
MUL # pop b,c multiply and push result
ADD # pop a, result add and push final result