[go: up one dir, main page]

0% found this document useful (0 votes)
32 views28 pages

Unit - 4 Regex

The document provides an overview of regular expressions (regex) in Python, detailing their definition, applications, and how to create them using the re module. It covers various functions such as re.match(), re.search(), re.findall(), and re.finditer(), explaining their purposes and differences. Additionally, it discusses character classes, predefined character classes, and regex metacharacters, along with practical examples of usage.

Uploaded by

SHADOW GAMING
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views28 pages

Unit - 4 Regex

The document provides an overview of regular expressions (regex) in Python, detailing their definition, applications, and how to create them using the re module. It covers various functions such as re.match(), re.search(), re.findall(), and re.finditer(), explaining their purposes and differences. Additionally, it discusses character classes, predefined character classes, and regex metacharacters, along with practical examples of usage.

Uploaded by

SHADOW GAMING
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

REGULAR EXPRESSIONS AND OOP CONCEPTS

UNIT – 4 | I SEM | MCA |2024-26 BATCH | RIT


REGEX IN PYTHON

 What is Regular expression?


 Applications of regular expressions
 How to create Regular expressions in Python?
REGULAR EXPRESSION

 A Regular Expression (regex) is a sequence of characters that defines a search


pattern.

 It is commonly used for string matching, searching, and replacing text.

 It is a code or way of describing what kind of text is being looked for in a bigger
chunk of text.

 Python provides the re module to work with regular expressions.


APPLICATIONS OF REGULAR EXPRESSIONS

 Data validations
 Ex: mobile number validation, email validation, etc
 Data extraction
 Specific info from data can be extracted
 Data cleaning, web scrapping
 Functionalities of
 ctrl+f and replace, grep commands (UNIX), LIKE operator in SQL
 To create translators – compilers, interpreters, assemblers
 For syntax analysis and lexical analysis
 Password Policies
 Used in NLP to identify specific patterns in data.
BASIC SEARCH FUNCTIONS

 search()
 match()
 finditer()
 findall()
re.match()

 Purpose: search for a pattern at the beginning of a string.


 Syntax: re.match (pattern, string, flags = 0)
 pattern: The regular expression pattern you want to search for
 string: input string in which you want to search for pattern
 Returns: if a match is found at the beginning of the string, it returns a
match object; otherwise it returns None.
Using the re Module in Python

Python’s re module provides powerful tools for regex operations.

re.match() – Matches the Beginning of a String. It only checks the start of the string.

import re .group() → Returns the actual match.


.span() → Returns the start and end positions
pattern = r"Hello" of the match.
text = "Hello, world!"

match = re.match(pattern, text) if match:


if match: print("Matched text:", match.group())
print("Match found!") # Returns matched text ("Hello")
else: print("Start and End positions:",
print("No match") match.span())
# Output: Match found! # Returns (0, 5)
What is a Raw String (r"")?

• In Python, the r before a string (like r"^\d$") makes it a raw string literal.

• In a normal string, backslashes (\) are treated as escape characters


(e.g., "\n" for a newline, "\t" for a tab).

• A raw string (r"") tells Python not to interpret backslashes as escape


sequences.

• In regex, we often use \d, \s, \b, etc., where \ has a special meaning.
Using r"" prevents Python from treating \ as an escape character.

• Always use r"" for regex patterns to avoid unexpected errors.


re.search()

 Purpose: The search() function in the re module scans a string for the
first occurrence of a pattern.
 Syntax: re.search (pattern, data)
 pattern: The regular expression pattern you want to search for
 data: input string in which you want to search for pattern
 Returns: match object if match is found or None if no match found
re.search() – Finds the First Match Anywhere

Unlike match(), search() checks the entire string.

import re

pattern = r"world"
text = "Hello, world!"

match = re.search(pattern, text)


if match:
print("Match found!")

# Output: Match found!


In Python, you can use regular expressions in two ways:
1. Directly as a string pattern

You pass a raw string directly to functions like re.search(), re.match(), etc.

2. Using Regular Expression Objects

You first compile the pattern using re.compile(), creating a reusable regex
object. This is useful for repeated searches.
Using Regular Expression Objects

import re

pattern = re.compile(r"World") # Compile the regex pattern

text = "Hello, World!"

match = pattern.search(text) # Using the compiled object

if match:
print("Found:", match.group())
re.finditer()

Purpose: re.finditer() returns an iterator yielding match objects for all


non-overlapping occurrences of a pattern in a string.
 Syntax: re.finditer (pattern, data, flags = 0)
 pattern: The regular expression pattern you want to search for
 data: input string in which you want to search for pattern
 Returns: iterator object containing match info.
re.finditer() – Returns Matches as an Iterator

import re import re

pattern = r“Hello" pattern = re.compile('ab', re.IGNORECASE)


text = "Hello, world!" data = 'abaababa'
match_iter = re.finditer(pattern, data)
matches = re.finditer(pattern, text) count = 0
for match in matches: for match in match_iter:
print(match.group()) count += 1
print(f"start:{match.start()},
# Output: Hello, world end:{match.end()}, element:{match.group()}")
print("total:", count)

Useful when handling large data, as it yields results lazily.


re.findall()

Purpose: re.findall() returns a list of all non-overlapping matches of a


pattern in a string.
 Syntax: re.findall (pattern, data, flags = 0)
 pattern: The regular expression pattern you want to search for
 data: input string in which you want to search for pattern
 Returns: A list containing all matching substrings
re.findall() – Returns All Matches in a List

import re

pattern = r“[0-9]” # Find all numbers


text = "My number is 123 and my friend's is 456"

matches = re.findall(pattern, text)


print(matches) # Output: ['1', '2', '3', '4', '5', '6']

import re

pattern = re.compile('ab', re.IGNORECASE)


data = ‘abaababa’

match_list = re.findall(pattern, data)


print(match_list) # Output: ['ab', 'ab', 'ab']
DIFFERENCE BETWEEN findall() AND finditer()

Both re.findall() and re.finditer() are used to search for all occurrences of a
pattern in a string, but they differ in how they return results.

Feature re.findall() re.finditer()

Return Type Returns a list of matching Returns an iterator yielding


substrings. match objects.
Memory Usage Stores all matches in a list Uses an iterator (more memory-
(higher memory usage for efficient).
large data).

Accessing Match Info Only returns matched Provides full match details
substrings, no details like (start, end, groups).
position.
Use Case When only matched strings When additional match details
are needed. (index, groups) are needed.
Understanding Non-Overlapping Matches in re.findall() and finditer()

In re.findall() and finditer(), matches are non-overlapping, meaning


once a match is found, the search continues after the match, rather
than inside it.

import re

data = "ababab"
matches = re.findall(r"aba", data)
print(matches)

#Output: ['aba']
CHARACTER CLASS IN PYTHON REGEX

 A character class typically refers to a set of characters that you can


define using regular expressions
 Character classes are used to specify range or group of characters you
want to search in data
 These classes help in defining flexible patterns for text searching and
validation.
Character Classes in Python Regex

Square Brackets [ ]
•Used to define a set of characters.
•Example: [abc] matches 'a', 'b', or 'c'.
Range of Characters
•[a-z] → Matches any lowercase letter (a to z).
•[A-Z] → Matches any uppercase letter (A to Z).
•[0-9] → Matches any digit (0 to 9).
Negation [^ ] (Caret Inside Brackets)
•Matches anything except the characters inside the brackets.
•Example: [^0-9] matches anything except digits.
Predefined Character Classes
•\d → Matches any digit (equivalent to [0-9]).
•\D → Matches any non-digit character (equivalent to [^0-9]).
•\w → Matches any word character (letters, digits, underscore) [a-zA-Z0-9_].
•\W → Matches any non-word character (opposite of \w).
•\s → Matches any whitespace character (space, tab, newline).
•\S → Matches any non-whitespace character.
Special Character Classes
•[aeiou] → Matches any vowel.
•[13579] → Matches any odd digit.
•[02468] → Matches any even digit.
Regex Meaning Example Matches Does Not Match
Pattern
\b Word boundary (start or \bcat\b "The cat is here" → "caterpillar",
end of a word) "cat" "wildcat" →

\A Matches only at the start of \AHello "Hello world" → "world Hello" →


a string

\Z Matches only at the end of tutorial\Z "This is a tutorial" → "tutorial on regex"


a string →

. Matches every character


Find digits in given data

import re import re

pattern = r'[0-9]' pattern = r'[0-9]'


data = "The price is $." data = "The price is $100."

match_list = re.findall(pattern, data) match_iter = re.finditer(pattern, data)

if match_list: for match in match_iter:


print("digits present") print(match)
else:
print("not present")
Table 1: Basic Regex Metacharacters

Symbol Description
. Matches any character except a newline
^ Matches the start of a string
$ Matches the end of a string
Matches 0 or more occurrences of the preceding
*
character
Matches 1 or more occurrences of the preceding
+
character
? Matches 0 or 1 occurrence of the preceding character
{n} Matches exactly n occurrences
{n,} Matches n or more occurrences
{n,m} Matches between n and m occurrences
\ Escape character (e.g., \. matches a literal dot .)
Table 2: Character Classes and Groups

Pattern Description
\d Matches any digit (0-9)
\D Matches any non-digit character
\w Matches any word character (a-z, A-Z, 0-9, _)
\W Matches any non-word character
\s Matches any whitespace (space, tab, newline)
\S Matches any non-whitespace character
[abc] Matches any one of a, b, or c
[^abc] Matches anything except a, b, or c
Matches word boundaries (e.g., \bword\b
\b
matches the word "word" exactly)
re.sub() – Replaces Text in a String

import re

text = "Python is fun!"

new_text = re.sub(r"Python", "Java", text)

print(new_text)

# Output: Java is fun!


https://regexr.com/

https://www.kaggle.com/code/albeffe/regex-exercises-
solutions/notebook
Character classes
. any character except newline
\w\d\s word, digit, whitespace
\W\D\S not word, digit, whitespace
[abc] any of a, b, or c
[^abc] not a, b, or c
[a-g] character between a & g
Anchors
^abc$ start / end of the string
\b word boundary
Escaped characters
\. \* \\ escaped special characters
\t \n \r tab, linefeed, carriage return
Groups
(abc) capture group
Quantifiers & Alternation
a* a+ a? 0 or more, 1 or more, 0 or 1
a{5} a{2,} exactly five, two or more
a{1,3} between one & three
a+? a{2,}? match as few as possible
ab|cd match ab or cd

You might also like