Unit - 4 Regex
Unit - 4 Regex
It is a code or way of describing what kind of text is being looked for in a bigger
chunk of text.
Data validations
Ex: mobile number validation, email validation, etc
Data extraction
Specific info from data can be extracted
Data cleaning, web scrapping
Functionalities of
ctrl+f and replace, grep commands (UNIX), LIKE operator in SQL
To create translators – compilers, interpreters, assemblers
For syntax analysis and lexical analysis
Password Policies
Used in NLP to identify specific patterns in data.
BASIC SEARCH FUNCTIONS
search()
match()
finditer()
findall()
re.match()
re.match() – Matches the Beginning of a String. It only checks the start of the string.
• In Python, the r before a string (like r"^\d$") makes it a raw string literal.
• In regex, we often use \d, \s, \b, etc., where \ has a special meaning.
Using r"" prevents Python from treating \ as an escape character.
Purpose: The search() function in the re module scans a string for the
first occurrence of a pattern.
Syntax: re.search (pattern, data)
pattern: The regular expression pattern you want to search for
data: input string in which you want to search for pattern
Returns: match object if match is found or None if no match found
re.search() – Finds the First Match Anywhere
import re
pattern = r"world"
text = "Hello, world!"
You pass a raw string directly to functions like re.search(), re.match(), etc.
You first compile the pattern using re.compile(), creating a reusable regex
object. This is useful for repeated searches.
Using Regular Expression Objects
import re
if match:
print("Found:", match.group())
re.finditer()
import re import re
import re
import re
Both re.findall() and re.finditer() are used to search for all occurrences of a
pattern in a string, but they differ in how they return results.
Accessing Match Info Only returns matched Provides full match details
substrings, no details like (start, end, groups).
position.
Use Case When only matched strings When additional match details
are needed. (index, groups) are needed.
Understanding Non-Overlapping Matches in re.findall() and finditer()
import re
data = "ababab"
matches = re.findall(r"aba", data)
print(matches)
#Output: ['aba']
CHARACTER CLASS IN PYTHON REGEX
Square Brackets [ ]
•Used to define a set of characters.
•Example: [abc] matches 'a', 'b', or 'c'.
Range of Characters
•[a-z] → Matches any lowercase letter (a to z).
•[A-Z] → Matches any uppercase letter (A to Z).
•[0-9] → Matches any digit (0 to 9).
Negation [^ ] (Caret Inside Brackets)
•Matches anything except the characters inside the brackets.
•Example: [^0-9] matches anything except digits.
Predefined Character Classes
•\d → Matches any digit (equivalent to [0-9]).
•\D → Matches any non-digit character (equivalent to [^0-9]).
•\w → Matches any word character (letters, digits, underscore) [a-zA-Z0-9_].
•\W → Matches any non-word character (opposite of \w).
•\s → Matches any whitespace character (space, tab, newline).
•\S → Matches any non-whitespace character.
Special Character Classes
•[aeiou] → Matches any vowel.
•[13579] → Matches any odd digit.
•[02468] → Matches any even digit.
Regex Meaning Example Matches Does Not Match
Pattern
\b Word boundary (start or \bcat\b "The cat is here" → "caterpillar",
end of a word) "cat" "wildcat" →
import re import re
Symbol Description
. Matches any character except a newline
^ Matches the start of a string
$ Matches the end of a string
Matches 0 or more occurrences of the preceding
*
character
Matches 1 or more occurrences of the preceding
+
character
? Matches 0 or 1 occurrence of the preceding character
{n} Matches exactly n occurrences
{n,} Matches n or more occurrences
{n,m} Matches between n and m occurrences
\ Escape character (e.g., \. matches a literal dot .)
Table 2: Character Classes and Groups
Pattern Description
\d Matches any digit (0-9)
\D Matches any non-digit character
\w Matches any word character (a-z, A-Z, 0-9, _)
\W Matches any non-word character
\s Matches any whitespace (space, tab, newline)
\S Matches any non-whitespace character
[abc] Matches any one of a, b, or c
[^abc] Matches anything except a, b, or c
Matches word boundaries (e.g., \bword\b
\b
matches the word "word" exactly)
re.sub() – Replaces Text in a String
import re
print(new_text)
https://www.kaggle.com/code/albeffe/regex-exercises-
solutions/notebook
Character classes
. any character except newline
\w\d\s word, digit, whitespace
\W\D\S not word, digit, whitespace
[abc] any of a, b, or c
[^abc] not a, b, or c
[a-g] character between a & g
Anchors
^abc$ start / end of the string
\b word boundary
Escaped characters
\. \* \\ escaped special characters
\t \n \r tab, linefeed, carriage return
Groups
(abc) capture group
Quantifiers & Alternation
a* a+ a? 0 or more, 1 or more, 0 or 1
a{5} a{2,} exactly five, two or more
a{1,3} between one & three
a+? a{2,}? match as few as possible
ab|cd match ab or cd