0% found this document useful (0 votes)

27 views41 pages

Lexical Analysis

Uploaded by

SAMEER REDDY

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views41 pages

Lexical Analysis

Uploaded by

SAMEER REDDY

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 41

LEXICAL ANALYSIS

• Lexical analysis is the process of converting a sequence of characters from source program into a sequence of
tokens.
• A program which performs lexical analysis is termed as a lexical analyzer (lexer), tokenizer or scanner.
• The sequence of tokens produced by lexical analyzer helps the parser in analyzing the syntax of programming
languages.
• The input to a lexical analyzer is the pure high-level code from the preprocessor. It identifies valid lexemes
from the program and returns tokens to the syntax analyzer, one after the other, corresponding to
the getNextToken command from the syntax analyzer.
Main of task of lexical analysis
 Scans the pure HLL code line by line
 Takes lexemes as i/p and produces Tokens
 Removes comments and whitespaces(blank space (' '), newline ('\n'), horizontal tab ('\t'), carriage
return ('\r'), form feed ('\f') or vertical tab ('\v’) )from the pure HLL code.
 Find lexical errors
 Return the Sequence of valid tokens to the syntax analyzer
 When it finds an identifier, it has to make an entry into the symbol table.
Contd…
• There are three important terms :
1. Tokens: A Token is a pre-defined sequence of characters that cannot be broken down further. It is like an
abstract symbol that represents a unit. A token can have an optional attribute value. There are different types of
tokens:
1. Identifiers (user-defined)
2. Punctuations (;, ,, {}, etc.)
3. Operators (+, -, *, /, etc.)
4. Special symbols
5. Keywords
6. Constant
7. Literals
2. Lexemes: A lexeme is a sequence of characters matched in the source program that matches the pattern of a
token.
For example: (, ) are lexemes of type punctuation where punctuation is the token.
3. Patterns: A pattern is a set of rules a scanner follows to match a lexeme in the input program to identify a valid
token. It is like the lexical analyzer's description of a token to validate a lexeme.
For example, the characters in the keyword are the pattern to identify a keyword. To identify an identifier the
pre-defined set of rules to create an identifier. i.e., pattern of an identifier is letter followed by letter or digits.
Contd…..
Contd….

1. The number of tokens in the following C statement is

printf("HELLO WORLD");

2. (GATE2000). The number of tokens in the following C statement is

printf("i=%d, &i=%x", i, &i);
3.
Specification of a token:
• A token is the smallest individual element of a program that is meaningful to the compiler. It
cannot be further broken down.
• Regular expression are used to specify tokens.
• Regular expression generates regular language. Language means a set of strings. Strings means a
set of alphabet.
• So in this concept , we will discuss about
1. Alphabet
2. String
3. Language
4. Operation on language
5. Regular expression
6. Regular Definition
1. Alphabet
• It is a set symbol denoted by ∑.
Example:
∑= {0, 1} It is a binary set.
∑={a,b,…..,z} It is a set of lower case letter.
2. String
• It is finite set of symbols generated from ∑ (alphabet)
• Example:
∑={a,b}
From this we can generate n no. of strings like a, ab,aa,ba………..

• Length of String
• The length of the string can be determined by the number of alphabets in the
string. The string is represented by the letter ‘s’ and |s| represents the length of the
string. Let’s consider the string:
• s = banana
|s| = 6
• Note: The empty string or the string with length 0 is represented by ‘ ∈’.
Term related to string

1. Prefix of String
• The prefix of the string is the preceding symbols present in the string and the
string s itself.
• For example:
• s = abcd
• The prefix of the string abcd: ∈, a, ab, abc, abcd
2. Suffix of String
• Suffix of the string is the ending symbols of the string and the string s itself.
• For example:
• s = abcd
• Suffix of the string abcd: ∈, d, cd, bcd, abcd
Contd….
3. Proper Prefix of String
• The proper prefix of the string includes all the prefixes of the string excluding ∈ and the
string s itself.
• Proper Prefix of the string abcd: a, ab, abc
4. Proper Suffix of String
• The proper suffix of the string includes all the suffixes excluding ∈ and the string s itself.
• Proper Suffix of the string abcd: d, cd, bcd
5. Substring of String
• The substring of a string s is obtained by deleting any prefix or suffix from the string.
• Substring of the string abcd: ∈, abcd, bcd, abc, …
Contd…
6. Proper Substring of String
• The proper substring of a string s includes all the substrings of s excluding ∈ and
the string s itself.
• Proper Substring of the string abcd: bcd, abc, cd, ab…
7. Subsequence of String
• The subsequence of the string is obtained by eliminating zero or more (not
necessarily consecutive) symbols from the string.
• A subsequence of the string abcd: abd, bcd, bd, …
8. Concatenation of String
• If s and t are two strings, then st denotes concatenation.
• s = abc t = def
• Concatenation of string s and t i.e. st = abcdef
3. Language
• Language is a set of strings over some fixed alphabets. Like the English language
is a set of strings over the fixed alphabets ‘a to z’.
Operation on language
1. Union
• Union is the most common set operation. Consider the two languages L and M. Then the union of
these two languages is denoted by:
• L ∪ M = { s | s is in L or s is in M}
• That means the string s from the union of two languages can either be from language L or from
language M.
• If L = {a, b} and M = {c, d}Then L ∪ M = {a, b, c, d}
2. Concatenation
• Concatenation links the string from one language to the string of another language in a series in all
possible ways. The concatenation of two different languages is denoted by:
• L ⋅ M = {st | s is in L and t is in M}If L = {a, b} and M = {c, d}
• Then L ⋅ M = {ac, ad, bc, bd}
Contd….
3. Kleene Closure
• Kleene closure of a language L provides you with a set of strings. This set of
strings is obtained by concatenating L zero or more time. The Kleene closure of
the language L is denoted by:
• If L = {a, b}L* = {∈, a, b, aa, bb, aaa, bbb, …}
4. Positive Closure
• The positive closure on a language L provides a set of strings. This set of strings is
obtained by concatenating ‘L’ one or more times. It is denoted by:
• It is similar to the Kleene closure. Except for the term L 0, i.e. L+ excludes ∈ until
it is in L itself.
• If L = {a, b}L+ = {a, b, aa, bb, aaa, bbb, …}
• So, these are the four operations that can be performed on the languages in the
lexical analysis phase.
5. Regular Expression
1. If regular expression (R) is equal to Epsilon (ε) then language of Regular expression (R) will
represent the epsilon set i.e. { ε}. Mathematical equation is given below,

regular expression (R) is equal to Epsilon (ε)

2. If regular expression (R) is equal to Φ then language of Regular expression (R) will represent the
empty set i.e. { }. Mathematical equation is given below,

empty regular expression

3. If regular expression (R) is equal to a input symbol “a” which belongs to sigma, then language of
Regular expression (R) will represent the set which having “a” alphabet i.e. {a}. Mathematical
equation is given below,

regular expression (R) is equal to one input

Contd….
4. Union of Two Regular Expressions will always produce a regular language.
Suppose R1 and R2 are two regular expressions. IF R1= a, R2=b then R1 U R2
=a+b So L(R1 U R2) = {a,b}, still string “a,b” is a regular language.

Hence, above equation shows that {a,b} is also a regular language.

Contd….
5. Concatenation of two Two Regular Expressions will always produce a
regular language. IF R1= a, R2=b then R1.R2 =a.b So L(R1.R2) = {ab}, still
string “ab” is a regular language.

Hence, above equation shows that {ab} is also a regular language.

Contd….
6. Kleene closure of Regular Expression (RE) is also a regular language
If R1 = x and (R1)* is still a regular language
In a regular expression, x* means zero or more occurrence of x. It can generate { ε, x, xx, xxx, xxxx,
…..}
In a regular expression, x+means one or more occurrence of x. It can generate {x, xx, xxx, xxxx, …..}
Algebraic laws of regular expressions

Consider the regular expressions r, s, t. Algebraic laws for these regular expressions.
Notations
1. The ‘+’ symbol denotes one or more occurrence
2. The ‘*’ symbol denotes zero or more occurrence
3. 2. The `?` symbol denotes zero or one occurrence
4. 3. The `[ ]` symbol denotes character classes
Regular Definition
Giving names to regular expressions is referred to as a Regular definition.
If Σ is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form
dl → r 1
d2 → r2
………
dn → rn
1.Each di is a distinct name.
2.Each ri is a regular expression over the alphabet Σ U {dl, d2,. . . , di-l}.
Example: Identifiers is the set of strings of letters and digits beginning with a letter. Regular
definition for this set:
letter → A | B | …. | Z | a | b | …. | z |
digit → 0 | 1 | …. | 9
id → letter ( letter | digit ) *
Or
id → [a-zA-Z_][a-z A-Z 0-9]*
Contd….
The regular expression number can be written using the regular definition as;

123 , 456
123.45
123.45E23
123.45E-23
Recognition of tokens
• Tokens are recognized using transition diagram
Examples:

1. Recognition of relational operator

2. Recognition of an identifier or keyword
3. Recognition of a number
4. Recognition of delimiter
ws → delimiter(delimeter) *
Lex and flex Tool
• It is language or tool for specifying lexical analyzer.
• Basically, Lex and Flex are lexical analyzer generators
• Lex and Flex are good at matching patterns
• Lex was originally written by Mike Lesk and Eric Schmidt in 1975
• Flex is an open-source alternative to Lex
Working/Function of Lex tool

• Source code is in lex language with filename.l extension. It is given to lex

compiler which is the lex tool and produces lex.yy.c , which is a c program.
• C compiler then run this c code i.e., lex.yy.c program and produces an output a.out
, which is a lexical analyzer.
• a.out transforms an input stream into a sequence of token.

lex.yy.c : C program/ c language file

File.l : Lex source program file will always have .l extension
a.out : Lexical Analyzer

Note: Lex language is used to program in the lex tool.

Structures of a lex program
• A LEX program consists of three sections : Declarations, Rules and Auxiliary functions.

DECLARATIONS
%%
TRANSLATION RULES
%%
AUXILIARY FUNCTIONS
1. Declaration
This section includes declaration of variables, constants and regular definitions. The declarations section can be
empty.

%{
Declaration
%}
Example:
%{
int a , b;
float count=0;
%}
digit [0-9]
Letter [a-z A-Z]
2. Translation Rules:
• Starts with %% and end with %% .
• Every Rule is specify in the form of pattern followed by action
pattern {action}
• Each pattern is a regular expression which may use regular definitions defined in the declarations section.
• Each action is a fragment of C-code.
• Pattern and action are separated by as ingle line or multiples whitespaces such as blank space and tap space.
3. Auxiliary Function:
• All necessary function are defined here.
• Following are the lex predefined functions and variables :-
• yyin :- the input stream pointer (i.e it points to an input file which is to be scanned or tokenised),
however the default input of default main() is stdin .
yylex() :- implies the main entry point for lex, reads the input stream generates tokens, returns zero
at the end of input stream . It is called to invoke the lexer (or scanner) and each time yylex() is
called, the scanner continues processing the input from where it last left off.
yytext :- a buffer that holds the input characters that actually match the pattern (i.e lexeme) or say a
pointer to the matched string .
yyleng :- the length of the lexeme .
yylval :- contains the token value .
yyval :- a local variable .*
yyout :- the output stream pointer (i.e it points to a file where it has to keep the output), however
the default output of default main() is stdout .
yywrap() :- it is called by lex when input is exhausted (or at EOF). default yywrap always return 1.
yymore() :- returns the next token .
yyless(k) :- returns the first k characters in yytext .
yyparse() :- it parses (i.e builds the parse tree) of lexeme .*
%{
#include <stdio.h>
%}

%%
"Hi" {printf("Hello");}
.* {printf ("Wrong");}
%%
int main()
{
printf(" Enter the input:\n");
yylex();
}
int yywrap(){ return 1; }
%{
#include<stdio.h>
int i;
%}
%%
[0-9]+ {i=atoi(yytext);
if(i%2==0)
printf("The given no. is even.\n");
else
printf("The given no. is odd.\n");}
%%
int yywrap(){}
int main()
{
printf("Enter the number:\n");
yylex();
return 0;
}
Compiler construction toolkit

• Compiler Construction Tools are specialized tools that help in the implementation of various phases of a
compiler. These tools help in the creation of an entire compiler or its parts.
• Some of the commonly used compiler constructions tools are:-
1. Parser Generator
2. Scanner Generator
3. Syntax Directed Translation Engines
4. Automatic Code Generators
5. Data-Flow Analysis Engines
Contd…

1. Scanner Generators
• Input: Regular expression description of the tokens of a language
Output: Lexical analyzers.
Scanner generator generates lexical analyzers from a regular expression description of the tokens of a
language.
2. Parser Generators
• Input: Grammatical description of a programming language
Output: Syntax analyzers.
• Parser generator takes the grammatical description of a programming language and produces a syntax
analyzer.
Contd….
3. Syntax-directed Translation Engines
• Input: Parse tree.
Output: Intermediate code.
Syntax-directed translation engines produce collections of routines that walk a parse tree and generates
intermediate code.
4. Automatic Code Generators
• Input: Intermediate language.
Output: Machine language.
Code-generator takes a collection of rules that define the translation of each operation of the
intermediate language into the machine language for a target machine.
5. Data-flow Analysis Engines
• Data-flow analysis engine gathers the information, that is, the values transmitted from one part of a
program to each of the other parts. Data-flow analysis is a key part of code optimization.
6. Compiler Construction Toolkits
• The toolkits provide integrated set of routines for various phases of compiler. Compiler construction
toolkits provide an integrated set of routines for construction of phases of compiler.
Contd….

What is the purpose of compiler construction tools?

• The purpose of compiler construction tools is to help with the development of compilers and other
language processing tools. These tools assist in different stages of the compiler development
process, such as lexical analysis, parsing, optimization, etc.

Compiler Design: Ambo University School of Informatics and Electrical Engineering Department of Computer Science
No ratings yet
Compiler Design: Ambo University School of Informatics and Electrical Engineering Department of Computer Science
35 pages
Specification of Tokens
No ratings yet
Specification of Tokens
21 pages
Scanner (Lexical Analyzer) : The Structure of A Compiler
No ratings yet
Scanner (Lexical Analyzer) : The Structure of A Compiler
109 pages
Lec2 LexicalAnalyser
No ratings yet
Lec2 LexicalAnalyser
30 pages
Chapter 2 - Lexical Analysis
100% (1)
Chapter 2 - Lexical Analysis
69 pages
CD 1
No ratings yet
CD 1
92 pages
Chapter 3 Finite Automata and Lexical Analysis
No ratings yet
Chapter 3 Finite Automata and Lexical Analysis
100 pages
Chapter 3 Finite Automata and Lexical Analysis
No ratings yet
Chapter 3 Finite Automata and Lexical Analysis
100 pages
Lexi Cal A Analyzer
No ratings yet
Lexi Cal A Analyzer
38 pages
2-Patterns, Lexemes, Tokens, Attributes-18-12-2024
No ratings yet
2-Patterns, Lexemes, Tokens, Attributes-18-12-2024
73 pages
Chapter 3 Finite Automata and Lexical Analysis
No ratings yet
Chapter 3 Finite Automata and Lexical Analysis
95 pages
Chap-2 2 (RegularExpression)
No ratings yet
Chap-2 2 (RegularExpression)
46 pages
Chapter - 2 Lexical Analysis
No ratings yet
Chapter - 2 Lexical Analysis
160 pages
Pcdunit2 Continuation
No ratings yet
Pcdunit2 Continuation
26 pages
Lexical Analysis
No ratings yet
Lexical Analysis
57 pages
CH 3 Myppt
No ratings yet
CH 3 Myppt
59 pages
Lexical Analysis
No ratings yet
Lexical Analysis
62 pages
2 - Lexical Analysis
No ratings yet
2 - Lexical Analysis
52 pages
Ch3 Modified
No ratings yet
Ch3 Modified
80 pages
CC 2
No ratings yet
CC 2
65 pages
Ch3 - Lexical Analysis
No ratings yet
Ch3 - Lexical Analysis
52 pages
ch3 M.PPTX - 0
No ratings yet
ch3 M.PPTX - 0
46 pages
Chapter Two (3) (Autosaved)
No ratings yet
Chapter Two (3) (Autosaved)
29 pages
CD Unit-2
No ratings yet
CD Unit-2
64 pages
Unit22pdf 2021 03 13 13 38 11
No ratings yet
Unit22pdf 2021 03 13 13 38 11
114 pages
Lec 4
No ratings yet
Lec 4
16 pages
Lexical Analysis
No ratings yet
Lexical Analysis
62 pages
Lexical Analysis
No ratings yet
Lexical Analysis
153 pages
Lecture 3a and 3b
No ratings yet
Lecture 3a and 3b
21 pages
Chapter 7 Lexical Analysis
No ratings yet
Chapter 7 Lexical Analysis
61 pages
Lecture02 Scanning 1
No ratings yet
Lecture02 Scanning 1
72 pages
Lec 06 Specification of Tokens
No ratings yet
Lec 06 Specification of Tokens
23 pages
Chapter2-Lexical Analysis
No ratings yet
Chapter2-Lexical Analysis
64 pages
CD Unit-2
No ratings yet
CD Unit-2
64 pages
Compiler Design Chapter-2
60% (5)
Compiler Design Chapter-2
105 pages
CD ch2
No ratings yet
CD ch2
104 pages
Lexical Analysis
No ratings yet
Lexical Analysis
44 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
18 pages
Compilers - Week 2
No ratings yet
Compilers - Week 2
14 pages
Lexical Analysis
No ratings yet
Lexical Analysis
31 pages
Acd Unit-2
No ratings yet
Acd Unit-2
16 pages
Lexical Analysis-1
No ratings yet
Lexical Analysis-1
9 pages
Module 3
No ratings yet
Module 3
7 pages
ch-2 Compiler Design
No ratings yet
ch-2 Compiler Design
9 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
56 pages
Lexical Analyzer 1
No ratings yet
Lexical Analyzer 1
37 pages
WINSEM2023-24 CSI2005 TH VL2023240501823 2024-01-08 Reference-Material-I
No ratings yet
WINSEM2023-24 CSI2005 TH VL2023240501823 2024-01-08 Reference-Material-I
23 pages
Specification of Tokens
No ratings yet
Specification of Tokens
21 pages
Compiler Design Unit-1 - 4
No ratings yet
Compiler Design Unit-1 - 4
4 pages
2 Lex
No ratings yet
2 Lex
45 pages
Lexical Analyzer 2023
No ratings yet
Lexical Analyzer 2023
38 pages
Chapter 2 - Lexical Analysis - Regular Expressions
No ratings yet
Chapter 2 - Lexical Analysis - Regular Expressions
27 pages
Ch2+3 Compiler
No ratings yet
Ch2+3 Compiler
21 pages
SE Compiler Chapter 2
No ratings yet
SE Compiler Chapter 2
16 pages
Unit 2 Lexical Analyzer
No ratings yet
Unit 2 Lexical Analyzer
63 pages
BJ Wadia Report
No ratings yet
BJ Wadia Report
31 pages
M.Suhaib Khalid PDF
No ratings yet
M.Suhaib Khalid PDF
10 pages
Lect2 Lexical
No ratings yet
Lect2 Lexical
9 pages
SPECIFICATION OF TOKENS - Unit 1
No ratings yet
SPECIFICATION OF TOKENS - Unit 1
13 pages
Financial Planning - Budgeting
No ratings yet
Financial Planning - Budgeting
17 pages
Fish Genetics and Breeding - Lecture 1
100% (1)
Fish Genetics and Breeding - Lecture 1
62 pages
rkCD-Chapter 2 - LEXICAL ANALYSIS
No ratings yet
rkCD-Chapter 2 - LEXICAL ANALYSIS
9 pages
Maintenance Technology and Management
No ratings yet
Maintenance Technology and Management
8 pages
Sitecore Basic Training
No ratings yet
Sitecore Basic Training
33 pages
Assessing Writing Performance
No ratings yet
Assessing Writing Performance
44 pages
Prilepko Solution
100% (2)
Prilepko Solution
37 pages
Thomsen Et Al.2009
No ratings yet
Thomsen Et Al.2009
62 pages
Whether Application For Android OPERATING SYSTEM MAD
No ratings yet
Whether Application For Android OPERATING SYSTEM MAD
13 pages
Panic Logs Chart
No ratings yet
Panic Logs Chart
4 pages
MQTT Vs Opc Ua,: When To Use Them
No ratings yet
MQTT Vs Opc Ua,: When To Use Them
8 pages
Concrete Proportions - Group 1 (QS) - Final
No ratings yet
Concrete Proportions - Group 1 (QS) - Final
9 pages
Sample Apex Section 2-3 (Reading-Use of English)
No ratings yet
Sample Apex Section 2-3 (Reading-Use of English)
12 pages
DNV Foundation Repowering
No ratings yet
DNV Foundation Repowering
3 pages
Hiring Form
No ratings yet
Hiring Form
2 pages
Paradise GaN vBUC Data Sheet
100% (1)
Paradise GaN vBUC Data Sheet
12 pages
Bidding 2 Suited 3 Suited Hands
No ratings yet
Bidding 2 Suited 3 Suited Hands
4 pages
Figure 1a Figure 1b
No ratings yet
Figure 1a Figure 1b
33 pages
Offshore Travel Guidelines
No ratings yet
Offshore Travel Guidelines
11 pages
NPV Annuity Plans
No ratings yet
NPV Annuity Plans
10 pages
Bikki Sainath Termpaper
No ratings yet
Bikki Sainath Termpaper
10 pages
6 FM 12
No ratings yet
6 FM 12
2 pages
PRC 140625040720 Phpapp02
No ratings yet
PRC 140625040720 Phpapp02
12 pages
Market Research
No ratings yet
Market Research
9 pages
Noir Island
No ratings yet
Noir Island
2 pages
Type Contract Owner Name Here) : Site Diary
No ratings yet
Type Contract Owner Name Here) : Site Diary
3 pages
Dineout Merchant List
0% (1)
Dineout Merchant List
6 pages
Bomba Centrifuga B3ZRMS
No ratings yet
Bomba Centrifuga B3ZRMS
1 page
Marketing Mix 4ps PNG
50% (2)
Marketing Mix 4ps PNG
2 pages
ACTIVITY 4 Essay - PE & HEALTH
No ratings yet
ACTIVITY 4 Essay - PE & HEALTH
1 page
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)

Lexical Analysis

Uploaded by

Lexical Analysis

Uploaded by

LEXICAL ANALYSIS

1. The number of tokens in the following C statement is

2. (GATE2000). The number of tokens in the following C statement is

regular expression (R) is equal to Epsilon (ε)

empty regular expression

regular expression (R) is equal to one input

Hence, above equation shows that {a,b} is also a regular language.

Hence, above equation shows that {ab} is also a regular language.

1. Recognition of relational operator

• Source code is in lex language with filename.l extension. It is given to lex

lex.yy.c : C program/ c language file

Note: Lex language is used to program in the lex tool.

What is the purpose of compiler construction tools?

You might also like