0% found this document useful (0 votes)

12 views26 pages

Module 1 NLP

The document covers various concepts in Natural Language Processing (NLP), including regular expressions, tokenization, text normalization, stemming, lemmatization, and byte pair encoding. It discusses the importance of these techniques in improving model performance and handling text variations. Additionally, it explains vocabulary dynamics through Herdan’s Law and differentiates between wordforms and wordtypes.

Uploaded by

sahasohan42

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views26 pages

Module 1 NLP

Uploaded by

sahasohan42

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Module 1

NLP
• Pattern-matching tools used to search, extract, or
manipulate text based on specific rules.
• Sequence of characters defining a search pattern:
i) Literals: Exact matches (e.g., cat matches "cat").

Regular ii) Metacharacters: Special symbols with wildcard

behaviors:

Expressions . → Any character (e.g., c.t matches "cat", "cot").

* → 0+ repetitions (e.g., a* matches "", "a", "aa").
+ → 1+ repetitions (e.g., a+ matches "a", "aa").
\d → Digit (0–9), \w → Word character (a-z, A-Z, 0-9, _).
[ ] → Character set (e.g., [aeiou] matches any vowel).
Tokenization

Regex Cleaning Text

Use Extracting Patterns

Cases Rule-Based Matching

Replacing Text
Strengths &
Limitations
Negation in Regular Expressions

Used to exclude
specific
characters, Used in NLP for
words, or tasks like
patterns from
matches.

i) Sentiment
ii) Data cleaning
analysis (e.g.,
(e.g., removing
ignoring
unwanted
negated
symbols).
phrases)
Types of Negation

Negating Negative
Negating Word Negative
Character Lookbehind: (?<!
Boundaries: \B Lookahead: (?! )
Sets: [^ ] )
Disjunction

Allows you to match one pattern OR It’s a way to specify multiple alternative
another using the pipe symbol |. patterns in a single expression.
• Process of transforming text into a consistent,
standardized format to improve the performance of
NLP models.

• It prepares raw text for analysis by handling variations

in spelling, grammar, and formatting.

Text Normalization ensures uniformity, reducing

Normalization complexity for NLP tasks like:

• Tokenization
• Sentiment analysis
• Machine translation
• Search engines
• Split text into individual words (tokens).
• Example:

• Sentence: "The cat sat on the mat."

Tokenization
• Tokens: ["the", "cat", "sat", "on", "the",
"mat"]
• Tokenization in Space-Delimited
Tokenization Languages
• Examples: English, French, Spanish
With vs. Rule: Words are separated by
whitespace.
Without • Tokenization in Non-Space-
Spaces Delimited Languages
• Examples: Chinese, Japanese, Thai
Rule: No spaces between words →
Requires advanced methods.
Word Normalization
• Process of transforming words into standardized format to reduce
variability and improve computational analysis.

• Key Goals:

Reduce Noise: Handle misspellings, abbreviations, and formatting inconsistencies.

Improve Consistency: Treat similar words (e.g., "run" vs. "running") as equivalent.
Enhance Model Performance: Simplify patterns for ML models by reducing vocabulary size.
Stemming
• Crudely chop off word endings (prefixes/suffixes).
such as "running" → "run".
• Fast but inaccurate ("flies" → "fli").
• Ignores word meaning/grammar.
• Works for many languages with simple rules.
• Rule-based (heuristic).
• Search engines, quick preprocessing.
Lemmatization
• Linguistically reduce words to base form (lemma) using dictionaries.
"better" → "good".
• Slower but more accurate.
• Considers POS (e.g., verb/noun distinction).
• Requires language-specific dictionaries.
• Sentiment analysis, machine translation.
Byte Pair Encoding
• Subword tokenization algorithm widely used in NLP (e.g., GPT, BERT) to
handle rare/unknown words.

• Balances vocabulary size and coverage by merging frequent character

pairs iteratively.

• Goal: Compress text by replacing frequent pairs of bytes (or characters)

with a new symbol.

• NLP Adaptation: Split words into subword units (e.g., "unhappiness"

→ ["un", "happiness"]).
Byte Pair
Algorithm
Byte Pair Example
Minimum Edit Distance (MED)
• Measures the smallest number of operations required to transform
one tokenized sequence into another. These operations include:

• Insertion

• Deletion

• Updation
MED-Cost Of Operations
MED Example
MED-Applications
• Spell correction (e.g., "graffe" → "giraffe")

• Noisy Text Normalization:

• (e.g., "New Yrok" → "New York").

• Token-Level Alignment: Compare tokenized outputs from different

models.
Herdan’s Law (Heaps’ Law)
• Relationship between vocabulary size (unique words) and corpus size (total words) in a
language.
• The number of unique words (V) in a text grows polynomially with the total number of
words (N), following:
• Sublinear Growth: Vocabulary grows slower than corpus
size (e.g., doubling N doesn’t double V).

Key Implications • Finite Vocabulary: Even infinite texts won’t have infinite unique
words (due to Zipf’s Law).

• Language/Genre-Dependent:
• English: K≈30K≈30, β≈0.5β≈0.5 (empirically).
• Twitter Data: Higher β (more neologisms/typos).
Implications
• Resource Allocation:
• Predicts memory needs for vocabularies (e.g., in search engines).

• Tokenization:
• Explains why sub-word methods (BPE) outperform word-level models
(vocab grows slower).

• Dataset Curation:
• Guides how much text is needed to cover a language’s lexicon.
Wordforms and WordTypes
• Wordform: A specific surface form of a word as it appears in text
(including inflections).
• Example: "running", "ran", "runs" (all of the lemma "run")

• WordType: A unique lexical entry representing a distinct meaning +

part-of-speech (POS)
• Example: "bank" (noun: financial) vs. "bank" (noun: river)
Key Differences

Aspect Wordform WordType

Focus Form: How a word is Meaning + POS: A unique lexical identity.

written/spoken.

Variability Changes with inflection (e.g., tense, Invariant (groups all forms of a word sense).
number).

Example "goes", "went", "going" (of "go") "lead" (verb) vs. "lead" (noun: metal)

NLP Use Case Tokenization, spelling correction. Word sense disambiguation (WSD),
translation.

CL - Lec 6
No ratings yet
CL - Lec 6
28 pages
Stemming, Lemmatization & NLP Basics
No ratings yet
Stemming, Lemmatization & NLP Basics
6 pages
NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
NLP 3-6
No ratings yet
NLP 3-6
20 pages
2.3 Chap NLP Stemming
No ratings yet
2.3 Chap NLP Stemming
32 pages
Chap 2
No ratings yet
Chap 2
70 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
Unit 1 NLP KCS072
No ratings yet
Unit 1 NLP KCS072
12 pages
2-Regular Expressions, Text Normalization, Edit Distance
No ratings yet
2-Regular Expressions, Text Normalization, Edit Distance
42 pages
Module 2 Complete
No ratings yet
Module 2 Complete
134 pages
NLP m2
No ratings yet
NLP m2
71 pages
NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
NLP Final
No ratings yet
NLP Final
27 pages
Natural Language Processing Unit 1
No ratings yet
Natural Language Processing Unit 1
16 pages
Unit 2
No ratings yet
Unit 2
20 pages
Text Mining
No ratings yet
Text Mining
34 pages
Module 1.2
No ratings yet
Module 1.2
28 pages
Pipeline
No ratings yet
Pipeline
9 pages
Text Mining
No ratings yet
Text Mining
62 pages
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
No ratings yet
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
54 pages
Week 2
No ratings yet
Week 2
90 pages
Tokenization & Morphology in NLP
No ratings yet
Tokenization & Morphology in NLP
63 pages
NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
v24dsl07t - Unit I - NLP
No ratings yet
v24dsl07t - Unit I - NLP
65 pages
What Is NLP?: Components of An FSA
No ratings yet
What Is NLP?: Components of An FSA
16 pages
Natural Language Processing
No ratings yet
Natural Language Processing
28 pages
Unit 6 - AI (NLP)
No ratings yet
Unit 6 - AI (NLP)
37 pages
NLP and Python Course Overview
No ratings yet
NLP and Python Course Overview
121 pages
Text-Processing
No ratings yet
Text-Processing
70 pages
Ai NLP
No ratings yet
Ai NLP
9 pages
AP For NLP-LO1
No ratings yet
AP For NLP-LO1
61 pages
2-Text Operations - New
No ratings yet
2-Text Operations - New
39 pages
Basics of Text Processing
No ratings yet
Basics of Text Processing
28 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
Grapheme:: Morpheme
No ratings yet
Grapheme:: Morpheme
20 pages
Intro To NLP
No ratings yet
Intro To NLP
44 pages
Natural Language Processing
No ratings yet
Natural Language Processing
10 pages
Introduction To NLP
No ratings yet
Introduction To NLP
15 pages
AP For NLP-Word 2 Vec
No ratings yet
AP For NLP-Word 2 Vec
33 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
Lect 05 Preprocessing Text
No ratings yet
Lect 05 Preprocessing Text
25 pages
NLP Lecture 6 Week 3
No ratings yet
NLP Lecture 6 Week 3
9 pages
AMLTA
No ratings yet
AMLTA
17 pages
Natural Language Processing 1
No ratings yet
Natural Language Processing 1
19 pages
Text Preprocessing
No ratings yet
Text Preprocessing
59 pages
NLB Final Lab Manual
No ratings yet
NLB Final Lab Manual
23 pages
NLP Book
No ratings yet
NLP Book
599 pages
NLP Exp 3
No ratings yet
NLP Exp 3
24 pages
Regular Expression and BPE
No ratings yet
Regular Expression and BPE
68 pages
NLP Crash Course Comprehensive
No ratings yet
NLP Crash Course Comprehensive
2 pages
Week 8-Module 7 NLP
No ratings yet
Week 8-Module 7 NLP
52 pages
NLP - Module 2
No ratings yet
NLP - Module 2
54 pages
NLP Learning Materials 1
No ratings yet
NLP Learning Materials 1
28 pages
Inverted Index Construction Guide
No ratings yet
Inverted Index Construction Guide
57 pages
Sentiment Analysis for Engineers
No ratings yet
Sentiment Analysis for Engineers
7 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
Lecture 1
No ratings yet
Lecture 1
74 pages
Narrative Text
No ratings yet
Narrative Text
18 pages
Sport: Writing Task: Write An Opinion Essay
No ratings yet
Sport: Writing Task: Write An Opinion Essay
1 page
TOEFL Listening Skill 23-34
No ratings yet
TOEFL Listening Skill 23-34
21 pages
Drama Techniques in The Primary Classroom
No ratings yet
Drama Techniques in The Primary Classroom
11 pages
B1 First Conditional Practice
No ratings yet
B1 First Conditional Practice
11 pages
Sujay Post-Harappan Literacy and Origin of Brahmi
No ratings yet
Sujay Post-Harappan Literacy and Origin of Brahmi
42 pages
Sociolinguistic Competence in The ESL Classroom
No ratings yet
Sociolinguistic Competence in The ESL Classroom
3 pages
Cambridge Exams (Starters - Movers - Flyers)
No ratings yet
Cambridge Exams (Starters - Movers - Flyers)
50 pages
8 Syllabus (2024-25)
No ratings yet
8 Syllabus (2024-25)
71 pages
Quiz (SVO + Tenses)
No ratings yet
Quiz (SVO + Tenses)
2 pages
HTTP Schools - Eklavyafocs.com PDVLPSMAIN-2023 WorkSheet C11 M10 D26 NOTE4
No ratings yet
HTTP Schools - Eklavyafocs.com PDVLPSMAIN-2023 WorkSheet C11 M10 D26 NOTE4
3 pages
KS2 English Grammar Glossary
100% (1)
KS2 English Grammar Glossary
27 pages
5 The Prophecies Of: Nfrti Neferti
No ratings yet
5 The Prophecies Of: Nfrti Neferti
23 pages
English - Level 2: WS 3 - I. Personal Pronouns - Please Translate
No ratings yet
English - Level 2: WS 3 - I. Personal Pronouns - Please Translate
3 pages
CCC 0522 VIII English-II A
No ratings yet
CCC 0522 VIII English-II A
16 pages
Action Plan in Reading 2024-2025
No ratings yet
Action Plan in Reading 2024-2025
6 pages
The Zairian Language Policy
No ratings yet
The Zairian Language Policy
21 pages
Wa0053
No ratings yet
Wa0053
46 pages
2122 F2 1st UT Key 2
No ratings yet
2122 F2 1st UT Key 2
5 pages
English Syllabus Class 3 (2025-26)
No ratings yet
English Syllabus Class 3 (2025-26)
5 pages
MLK Other Wes Moore
No ratings yet
MLK Other Wes Moore
100 pages
CW Ho 1
No ratings yet
CW Ho 1
5 pages
Language Shift on Instagram Users
No ratings yet
Language Shift on Instagram Users
12 pages
CHAPTER 1 Bahasa Inggris
No ratings yet
CHAPTER 1 Bahasa Inggris
17 pages
The Simple Sentence: Writing Workshop Sentences Divided by Structure
No ratings yet
The Simple Sentence: Writing Workshop Sentences Divided by Structure
8 pages
EAPP Mastery 1 For Upload
No ratings yet
EAPP Mastery 1 For Upload
5 pages
Kami Export - Worsie, Worsie
No ratings yet
Kami Export - Worsie, Worsie
5 pages
NLP Basics for Tech Enthusiasts
No ratings yet
NLP Basics for Tech Enthusiasts
2 pages
Communication Evaluation Form Example
No ratings yet
Communication Evaluation Form Example
1 page

Module 1 NLP

Uploaded by

Module 1 NLP

Uploaded by

Module 1

Regular ii) Metacharacters: Special symbols with wildcard

Expressions . → Any character (e.g., c.t matches "cat", "cot").

Regex Cleaning Text

Use Extracting Patterns

Cases Rule-Based Matching

• It prepares raw text for analysis by handling variations

Text Normalization ensures uniformity, reducing

• Sentence: "The cat sat on the mat."

Reduce Noise: Handle misspellings, abbreviations, and formatting inconsistencies.

• Balances vocabulary size and coverage by merging frequent character

• Goal: Compress text by replacing frequent pairs of bytes (or characters)

• NLP Adaptation: Split words into subword units (e.g., "unhappiness"

• Noisy Text Normalization:

• Token-Level Alignment: Compare tokenized outputs from different

• WordType: A unique lexical entry representing a distinct meaning +

Aspect Wordform WordType

Focus Form: How a word is Meaning + POS: A unique lexical identity.

You might also like