[go: up one dir, main page]

0% found this document useful (0 votes)
12 views26 pages

Module 1 NLP

The document covers various concepts in Natural Language Processing (NLP), including regular expressions, tokenization, text normalization, stemming, lemmatization, and byte pair encoding. It discusses the importance of these techniques in improving model performance and handling text variations. Additionally, it explains vocabulary dynamics through Herdan’s Law and differentiates between wordforms and wordtypes.

Uploaded by

sahasohan42
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views26 pages

Module 1 NLP

The document covers various concepts in Natural Language Processing (NLP), including regular expressions, tokenization, text normalization, stemming, lemmatization, and byte pair encoding. It discusses the importance of these techniques in improving model performance and handling text variations. Additionally, it explains vocabulary dynamics through Herdan’s Law and differentiates between wordforms and wordtypes.

Uploaded by

sahasohan42
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Module 1

NLP
• Pattern-matching tools used to search, extract, or
manipulate text based on specific rules.
• Sequence of characters defining a search pattern:
i) Literals: Exact matches (e.g., cat matches "cat").

Regular ii) Metacharacters: Special symbols with wildcard


behaviors:

Expressions . → Any character (e.g., c.t matches "cat", "cot").


* → 0+ repetitions (e.g., a* matches "", "a", "aa").
+ → 1+ repetitions (e.g., a+ matches "a", "aa").
\d → Digit (0–9), \w → Word character (a-z, A-Z, 0-9, _).
[ ] → Character set (e.g., [aeiou] matches any vowel).
Tokenization

Regex Cleaning Text

Use Extracting Patterns

Cases Rule-Based Matching

Replacing Text
Strengths &
Limitations
Negation in Regular Expressions

Used to exclude
specific
characters, Used in NLP for
words, or tasks like
patterns from
matches.

i) Sentiment
ii) Data cleaning
analysis (e.g.,
(e.g., removing
ignoring
unwanted
negated
symbols).
phrases)
Types of Negation

Negating Negative
Negating Word Negative
Character Lookbehind: (?<!
Boundaries: \B Lookahead: (?! )
Sets: [^ ] )
Disjunction

Allows you to match one pattern OR It’s a way to specify multiple alternative
another using the pipe symbol |. patterns in a single expression.
• Process of transforming text into a consistent,
standardized format to improve the performance of
NLP models.

• It prepares raw text for analysis by handling variations


in spelling, grammar, and formatting.

Text Normalization ensures uniformity, reducing


Normalization complexity for NLP tasks like:

• Tokenization
• Sentiment analysis
• Machine translation
• Search engines
• Split text into individual words (tokens).
• Example:

• Sentence: "The cat sat on the mat."


Tokenization
• Tokens: ["the", "cat", "sat", "on", "the",
"mat"]
• Tokenization in Space-Delimited
Tokenization Languages
• Examples: English, French, Spanish
With vs. Rule: Words are separated by
whitespace.
Without • Tokenization in Non-Space-
Spaces Delimited Languages
• Examples: Chinese, Japanese, Thai
Rule: No spaces between words →
Requires advanced methods.
Word Normalization
• Process of transforming words into standardized format to reduce
variability and improve computational analysis.

• Key Goals:

Reduce Noise: Handle misspellings, abbreviations, and formatting inconsistencies.


Improve Consistency: Treat similar words (e.g., "run" vs. "running") as equivalent.
Enhance Model Performance: Simplify patterns for ML models by reducing vocabulary size.
Stemming
• Crudely chop off word endings (prefixes/suffixes).
such as "running" → "run".
• Fast but inaccurate ("flies" → "fli").
• Ignores word meaning/grammar.
• Works for many languages with simple rules.
• Rule-based (heuristic).
• Search engines, quick preprocessing.
Lemmatization
• Linguistically reduce words to base form (lemma) using dictionaries.
"better" → "good".
• Slower but more accurate.
• Considers POS (e.g., verb/noun distinction).
• Requires language-specific dictionaries.
• Sentiment analysis, machine translation.
Byte Pair Encoding
• Subword tokenization algorithm widely used in NLP (e.g., GPT, BERT) to
handle rare/unknown words.

• Balances vocabulary size and coverage by merging frequent character


pairs iteratively.

• Goal: Compress text by replacing frequent pairs of bytes (or characters)


with a new symbol.

• NLP Adaptation: Split words into subword units (e.g., "unhappiness"


→ ["un", "happiness"]).
Byte Pair
Algorithm
Byte Pair Example
Minimum Edit Distance (MED)
• Measures the smallest number of operations required to transform
one tokenized sequence into another. These operations include:

• Insertion

• Deletion

• Updation
MED-Cost Of Operations
MED Example
MED-Applications
• Spell correction (e.g., "graffe" → "giraffe")

• Noisy Text Normalization:


• (e.g., "New Yrok" → "New York").

• Token-Level Alignment: Compare tokenized outputs from different


models.
Herdan’s Law (Heaps’ Law)
• Relationship between vocabulary size (unique words) and corpus size (total words) in a
language.
• The number of unique words (V) in a text grows polynomially with the total number of
words (N), following:
• Sublinear Growth: Vocabulary grows slower than corpus
size (e.g., doubling N doesn’t double V).

Key Implications • Finite Vocabulary: Even infinite texts won’t have infinite unique
words (due to Zipf’s Law).

• Language/Genre-Dependent:
• English: K≈30K≈30, β≈0.5β≈0.5 (empirically).
• Twitter Data: Higher β (more neologisms/typos).
Implications
• Resource Allocation:
• Predicts memory needs for vocabularies (e.g., in search engines).

• Tokenization:
• Explains why sub-word methods (BPE) outperform word-level models
(vocab grows slower).

• Dataset Curation:
• Guides how much text is needed to cover a language’s lexicon.
Wordforms and WordTypes
• Wordform: A specific surface form of a word as it appears in text
(including inflections).
• Example: "running", "ran", "runs" (all of the lemma "run")

• WordType: A unique lexical entry representing a distinct meaning +


part-of-speech (POS)
• Example: "bank" (noun: financial) vs. "bank" (noun: river)
Key Differences

Aspect Wordform WordType

Focus Form: How a word is Meaning + POS: A unique lexical identity.


written/spoken.

Variability Changes with inflection (e.g., tense, Invariant (groups all forms of a word sense).
number).

Example "goes", "went", "going" (of "go") "lead" (verb) vs. "lead" (noun: metal)

NLP Use Case Tokenization, spelling correction. Word sense disambiguation (WSD),
translation.

You might also like