[go: up one dir, main page]

0% found this document useful (0 votes)
3 views22 pages

NLP Notes Unit2 & Unit3

Uploaded by

dr428440
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views22 pages

NLP Notes Unit2 & Unit3

Uploaded by

dr428440
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Natural Language Processing (NLP) Notes For Unit 2 & Unit 3

1. Morphology
Definition: Morphology is the study of the internal structure of words and how they
are formed. It examines how words are created from smaller meaningful units called
morphemes.

Types of Morphology:

1) Inflectional Morphology:

 Focuses on grammatical changes.


 Does not change the core meaning or the word class.

Examples:

"Walk" -> "Walks" (third person singular).

"Dog" -> "Dogs" (plural).

2) Derivational Morphology:

 Creates new words by adding prefixes or suffixes.


 Often changes the meaning and sometimes the word class.

Examples:

"Happy" -> "Happiness" (adjective to noun).

"Teach" -> "Teacher" (verb to noun).

2. Morphological Analysis and Generation using Finite State


Transducers (FSTs)
Morphological Analysis:

The process of breaking down a word into its root form and identifying its affixes
(prefixes, suffixes, infixes, etc.).

Example:
Input: "unhappiness"

Analysis: Root: "happy", Prefix: "un", Suffix: "ness".

Morphological Generation:

The reverse process of creating a valid word by combining a root with affixes.

Example:

Root: "happy", Prefix: "un", Suffix: "ness" -> Output:


"unhappiness".

Finite State Transducers (FSTs):

A computational model using states and transitions to map input strings to output
strings.

Key Features:

Enables bidirectional processing for analysis and generation.

Efficient for handling regular morphological rules.

Example:

Transition: "walk + ed" -> "walked".

3. Part-of-Speech (POS) Tagging


Part of Speech (POS) Tagging is the process of assigning a part of speech to each
word in a given text based on its definition and context. POS tagging is essential for
natural language understanding and serves as a foundation for many NLP tasks, such
as parsing, information extraction, and sentiment analysis.

Parts of Speech

These are grammatical categories that words are assigned to, such as:

1. Noun (NN): Person, place, thing, or idea (e.g., "dog", "India").


2. Verb (VB): Action or state (e.g., "run", "is").
3. Adjective (JJ): Describes a noun (e.g., "beautiful", "tall").
4. Adverb (RB): Modifies a verb, adjective, or another adverb (e.g., "quickly", "very").
5. Pronoun (PRP): Replaces a noun (e.g., "he", "it").
6. Preposition (IN): Links nouns to other words (e.g., "on", "under").
7. Conjunction (CC): Connects words, phrases, or clauses (e.g., "and", "but").
8. Determiner (DT): Specifies a noun (e.g., "the", "a").
9. Interjection (UH): Expresses emotion (e.g., "wow", "oh").

Approaches to POS Tagging

Rule-Based Tagging:

1. Uses a set of predefined linguistic rules to assign tags.


2. Example Rule: If a word ends in "-ly", tag it as an adverb.
3. Advantages: Transparent and easy to understand.
4. Disadvantages: Requires extensive rule creation and is less robust to variations.

Statistical Tagging:

1. Assigns tags based on probabilities derived from a tagged corpus.


2. Example: Hidden Markov Model (HMM)-based tagging.
1. HMM assumes:
1. Each word depends only on its part of speech.
2. The POS depends on the previous POS tags (Markov assumption).
2. Calculates the tag sequence TT that maximizes the
probability P(T∣ W)P(T∣ W), where WW is the sequence of words.

Machine Learning-Based Tagging:

1. Uses supervised machine learning to train models on labeled datasets.


2. Popular algorithms:
1. Maximum Entropy Models
2. Support Vector Machines (SVMs)
3. Conditional Random Fields (CRFs)
3. These models consider features like:
1. Current word.
2. Surrounding words (context).
3. Word suffixes/prefixes.
4. Capitalization or punctuation.

Neural Network-Based Tagging:

1. Employs deep learning models like Recurrent Neural Networks (RNNs), Long Short-
Term Memory (LSTMs), and Transformers.
2. Automatically learns features from raw text.
3. Example: BERT and other pre-trained language models.
4. Advantages: State-of-the-art accuracy and ability to capture long-range
dependencies.
5. Disadvantages: Requires significant computational resources and large labeled
datasets.

Steps in POS Tagging


Tokenization:

1. Split the text into words or tokens.


2. Example: "The cat sleeps." → ["The", "cat", "sleeps"]

Feature Extraction:

1. Extract features like the word itself, its suffix/prefix, capitalization, etc.

Tag Assignment:

1. Assign tags to tokens using one of the above methods.

Evaluation:

1. Compare the predicted tags to a gold-standard tagged dataset using metrics like
accuracy.

Applications of POS Tagging

1. Syntactic Parsing: Helps in building parse trees for sentences.


2. Information Retrieval: Improves keyword matching by understanding word categories.
3. Named Entity Recognition (NER): Tags words as entities like names, places, or dates.
4. Sentiment Analysis: Identifies adjectives or adverbs for understanding sentiments.
5. Speech Recognition: Assists in understanding sentence structure for transcription.

Example: POS Tagging Sentence

Input Sentence:
"The quick brown fox jumps over the lazy dog."

Tagged Output:

 The/DT
 quick/JJ
 brown/JJ
 fox/NN
 jumps/VBZ
 over/IN
 the/DT
 lazy/JJ
 dog/NN

Tools for POS Tagging

NLTK (Python): Easy-to-use library for POS tagging.


import nltk
nltk.download('averaged_perceptron_tagger')
nltk.pos_tag(["The", "quick", "brown", "fox"])

spaCy: Provides fast and accurate POS tagging.

1. Stanford POS Tagger: Java-based, statistical tagger.


2. Hugging Face Transformers: Neural network-based tagging with pre-trained models.

By understanding and implementing POS tagging, you can enhance text processing
capabilities for a wide variety of applications.

4. Maximum Entropy Model for POS Tagging


A Maximum Entropy (MaxEnt) Model is a probabilistic model that predicts
outcomes (e.g., part-of-speech tags) based on features of the input data. It is widely
used for POS tagging because it can handle overlapping and interdependent features
efficiently without making strong independence assumptions.

Key Concepts in Maximum Entropy

Maximum Entropy Principle:

1. Among all probability distributions consistent with known constraints, the MaxEnt
model chooses the one with the highest entropy (i.e., the least biased distribution).
2. This ensures no additional assumptions are made about the data beyond what is
supported by evidence.

Conditional Probability:

1. For POS tagging, the model predicts the conditional probability of a tag tt given a
word ww and its context CC:P(t∣ C)=1Z(C)exp⁡(∑iλifi(t,C))P(t∣ C)=Z(C)1exp(i∑λifi
(t,C)) where:
1. fi(t,C)fi(t,C): Binary feature functions that indicate relationships
between tt, ww, and CC.
2. λiλi: Weights (parameters) associated with features.
3. Z(C)Z(C): Normalization factor ensuring probabilities sum to
1:Z(C)=∑t′exp⁡(∑iλifi(t′,C))Z(C)=t′∑exp(i∑λifi(t′,C))

Feature Functions:

1. These are manually defined or automatically extracted functions indicating specific


conditions. Examples:
1. Does the word end with "-ing"? f(t,C)=1f(t,C)=1 if t=VBt=VB (verb).
2. Is the previous word a determiner (e.g., "the")?
3. Is the current word capitalized?
Steps in Maximum Entropy POS Tagging

Feature Extraction:

1. Define features based on the current word and its context.


2. Example features:
1. f1f1: The word itself (e.g., "dog").
2. f2f2: Prefixes or suffixes (e.g., "-ing").
3. f3f3: Previous or next words in the sequence.
4. f4f4: Word shape (e.g., capitalization, numbers).

Training the Model:

1. Estimate the weights λiλi for each feature using a tagged training dataset.
2. The weights are learned by maximizing the likelihood of the training data using
optimization algorithms like Iterative Scaling or Gradient Descent.

POS Tagging (Inference):

1. For a given word and its context, compute the conditional probability for each
possible tag.
2. Assign the tag with the highest probability:t^=arg⁡max⁡tP(t∣ C)t^=argtmaxP(t∣ C)

Evaluation:

1. Test the model on a separate dataset and evaluate performance using metrics like
accuracy or F1-score.

Advantages of Maximum Entropy Models

Feature Flexibility

1. Can incorporate arbitrary, overlapping, and non-independent features.


2. Allows the use of rich linguistic information like prefixes, suffixes, and neighboring
words.

No Independence Assumptions:

1. Unlike models like Hidden Markov Models (HMMs), MaxEnt does not assume
conditional independence between features.

Probabilistic Outputs:

1. Provides probabilities for each tag, which can be useful for downstream tasks
requiring confidence scores.
Challenges

Computational Cost:

1. Training can be slow due to the need to calculate the normalization


factor Z(C)Z(C) across all possible tags.

Feature Design

1. The performance depends heavily on the quality of feature engineering.

Sparse Data:

1. Requires sufficient training data to estimate parameters accurately.

Example

Sentence: "The quick brown fox jumps."

Features for the word "quick":

 Current word = "quick".


 Previous word = "The".
 Next word = "brown".
 Word shape = lowercase.
 Suffix = "-ick".

Feature Function Examples:

 f1(t,C)=1f1(t,C)=1 if t=JJt=JJ and Current word = "quick".


 f2(t,C)=1f2(t,C)=1 if t=JJt=JJ and Previous word = "The".

Practical Tools

1. NLTK (Python):

o Can implement MaxEnt for POS tagging.

2. Stanford NLP Library:

o Includes MaxEnt-based POS taggers.

3. scikit-learn:

o Supports logistic regression (a form of MaxEnt) for classification tasks.


Conclusion

The Maximum Entropy model for POS tagging is a powerful method due to its
flexibility in incorporating diverse features and providing robust probabilistic outputs.
While computationally intensive, it remains a popular choice in NLP pipelines.

5. Multi-Word Expressions (MWEs)


Multi-Word Expressions (MWEs) are phrases or word combinations that exhibit
unique properties not entirely predictable from the meanings of their individual
components. They are common in natural language and present challenges for
computational linguistics because their interpretation often requires understanding
them as a single semantic or syntactic unit.

Types of Multi-Word Expressions

Idiomatic Expressions:

1. The meaning of the expression is not literal or compositional.


2. Example: "Kick the bucket" (means "to die").

Collocations:

1. Frequently co-occurring words where the combination is more common than


expected.
2. Example: "Strong tea" (not "powerful tea").

Phrasal Verbs:

1. Verbs combined with particles or prepositions, often with idiomatic meanings.


2. Example: "Give up" (means "to surrender" or "stop trying").

Compound Nouns:

1. Two or more words functioning as a single noun.


2. Example: "Toothbrush", "data science".

Light Verb Constructions:

1. A verb paired with a noun to create a meaning different from the verb alone.
2. Example: "Take a walk" (means "to walk").
Named Entities:

1. Names of people, places, organizations, etc., that act as a single unit.


2. Example: "New York City", "United Nations".

Institutionalized Expressions:

1. Fixed phrases or clichés used in formal or informal contexts.


2. Example: "On the other hand", "at the end of the day".

Challenges with MWEs

Non-Compositionality:

1. The meaning of MWEs cannot always be deduced from their parts (e.g., "spill the
beans").

Ambiguity:

1. MWEs can sometimes be literal or idiomatic depending on context (e.g., "break the
ice" in a conversation vs. literally breaking ice).

Flexibility:

1. Some MWEs allow word reordering or substitution, while others are rigid.
2. Example: "Make a decision" vs. "Decision was made".

Sparsity:

1. MWEs are often rare in corpora, making it difficult for models to learn their
properties.

Language-Specificity:

1. MWEs vary greatly between languages, posing challenges for machine translation.

Approaches to Handling MWEs

Lexicon-Based Methods:

1. Maintain a predefined list of MWEs for identification and tagging.


2. Example: Dictionaries of idioms or named entities.

Statistical Methods:

1. Identify MWEs based on co-occurrence patterns in a corpus.


2. Example Metrics:
1. Pointwise Mutual Information
(PMI):PMI(w1,w2)=log⁡P(w1,w2)P(w1)P(w2)PMI(w1,w2)=logP(w1)P(w2
)P(w1,w2)
2. T-score: Measures the strength of co-occurrence.
3. These methods are useful for collocations like "strong tea".

Parsing-Based Methods:

1. Use syntactic parsers to identify MWEs as single units in parse trees.

Machine Learning Models:

1. Train models to recognize MWEs using annotated data.


2. Features include:
1. Word context.
2. POS tags of components.
3. Dependency relations.

Neural Network Approaches:

1. Use embeddings and sequence models (LSTMs, Transformers) to detect MWEs.


2. Pre-trained models like BERT often learn representations for MWEs during training.

Hybrid Approaches:

1. Combine lexicons, statistical measures, and machine learning for better


performance.

Applications of MWE Recognition

Machine Translation:

1. Ensure idiomatic expressions are translated correctly (e.g., "spill the beans" →
"reveal the secret").

Information Retrieval:

1. Improve search accuracy by treating MWEs as single units (e.g., "data science").

Text Summarization:

1. Capture meaningful phrases for concise summaries.

Sentiment Analysis:

1. Handle expressions like "not bad" (positive sentiment despite the negative word
"not").
Speech Recognition:

1. Recognize MWEs as a single concept to improve transcription quality.

Example: Handling MWEs

Input Sentence: "John kicked the bucket last night."

MWE Recognition:

 Identify "kicked the bucket" as a single idiomatic expression.


 Assign the meaning "died" instead of interpreting "kicked" and "bucket" literally.

Conclusion

Multi-Word Expressions are an essential aspect of natural language that adds richness
and complexity. Effectively recognizing and processing MWEs is critical for
applications like machine translation, sentiment analysis, and conversational AI.
Combining linguistic, statistical, and neural approaches can help address the
challenges they present.

6. Role of Language Models


Language Models (LMs) are essential in the field of Natural Language Processing
(NLP) because they help machines understand, process, and generate human language
in a way that mimics how humans use language. Their core function is to predict the
probability of a sequence of words or tokens, enabling a variety of applications that
require comprehension, generation, or transformation of text.

Here are the key roles of language models:

1. Word Prediction and Text Generation

Language models are primarily used for predicting the next word in a sequence,
which is a fundamental task in many NLP applications, such as text completion and
auto-correction. They are also used in text generation, where the model generates
coherent and contextually relevant sentences based on a given prompt.

Example:

o Given the input "The sun is", a language model might predict the next word as
"shining", "bright", or "setting" based on the context.

Applications:
o Autocompletion (in search engines, email, or messaging apps).
o Creative Text Generation (writing stories, generating poetry, etc.).

2. Language Understanding

Language models also facilitate language understanding by enabling machines to


interpret text or speech accurately. They can recognize the syntax, semantics, and
context in sentences to solve tasks such as:

Sentiment Analysis: Determining if a sentence is positive, negative, or


neutral.

Named Entity Recognition (NER): Identifying people, organizations,


locations, and other entities in a text.

Part-of-Speech Tagging: Assigning grammatical labels (e.g., noun, verb,


adjective) to each word in a sentence.

Example:

o In the sentence "Apple is looking to buy a startup", an LM helps identify that


"Apple" is a company, not the fruit.

Applications:

o Sentiment analysis in social media, customer reviews.


o Text categorization (e.g., news articles or product classifications).

3. Text Classification

Language models are used in text classification tasks, where the goal is to categorize
text into predefined categories. This includes applications like spam detection, topic
categorization, and intent recognition.

Example:

o A language model can classify a tweet into categories like "politics", "technology",
or "entertainment".

Applications:

o Email spam filters.


o Sentiment analysis in customer feedback.

4. Machine Translation
Language models are central to machine translation, where the goal is to translate
text from one language to another while maintaining meaning and fluency. These
models help ensure that translations are accurate, contextually appropriate, and
grammatically correct.

Example:

o Translate "How are you?" from English to French as "Comment ça va?" using a
language model trained on bilingual data.

Applications:

o Google Translate, DeepL, and other translation tools.

5. Question Answering

Language models also power question answering systems, where they can extract or
generate answers to user queries from a given context, such as a passage of text or a
database.

Example:

o Given the input "What is the capital of France?", a language model can provide the
output "Paris".

Applications:

o Virtual assistants (e.g., Siri, Alexa).


o Search engines (Google, Bing) providing direct answers to queries.

6. Conversational Agents (Chatbots)

Language models are used to build chatbots and virtual assistants that engage in
human-like conversation. These systems process user input, generate meaningful and
contextually appropriate responses, and can carry on multi-turn dialogues.

Example:

o A customer service chatbot powered by a language model can understand and


respond to queries like "Where is my order?" or "How can I return an item?"

Applications:

o Customer support bots.


o Personal assistants (e.g., Google Assistant, Apple Siri).

7. Text Summarization
Language models also assist in text summarization, where the goal is to generate a
concise and coherent summary of a longer document while preserving the main ideas.
There are two types of summarization:

Extractive Summarization: The model selects key sentences directly from


the text.

Abstractive Summarization: The model generates new sentences that


paraphrase the original content.

Example:

o A model can summarize a lengthy article about climate change into a few key points,
such as "Climate change is causing sea levels to rise and extreme weather events to
increase."

Applications:

o News article summarization.


o Legal or academic document summarization.

8. Speech Recognition

Language models play a crucial role in speech recognition, which converts spoken
language into text. The LM helps disambiguate similar-sounding words and corrects
possible errors based on context.

Example:

o In speech-to-text systems, a model might correct "I scream" to "ice cream" based on
context.

Applications:

o Virtual assistants like Google Assistant or Apple Siri.


o Voice typing and transcription services.

9. Information Retrieval and Search

Language models help in information retrieval, where they can improve search
engines by ranking documents based on their relevance to a query. The model
understands the query context and retrieves documents that best match the intent of
the user

Example:

o A search engine query like "Best Italian restaurants in New York" will return a list of
relevant restaurants based on the query’s semantics
Applications:

o Web search engines (Google, Bing).


o Document or database retrieval systems.

10. Text-to-Speech (TTS) Systems

Language models can also be used in text-to-speech systems, where they help
generate natural-sounding speech from text. This process involves understanding the
linguistic structures and converting them into phonetic transcriptions, followed by
speech synthesis.

Example:

o A language model helps convert the sentence "Hello, how are you?" into fluent and
natural speech.

Applications:

o Assistive technologies for the visually impaired.


o Voice-based assistants.

Challenges in Language Models

Bias and Fairness:

o Language models may reflect biases present in their training data, leading to biased
outputs. Addressing this is a major challenge in developing responsible AI systems.

Context Understanding:

o While language models can understand context, they still struggle with long-term
dependencies or highly ambiguous situations.

Data and Resource Intensity:

Training state-of-the-art language models requires massive amounts of data and


computational resources, making them expensive and inaccessible to smaller
organizations.

Ethical Concerns:

o Language models may be misused for generating misleading information (e.g., fake
news) or malicious activities (e.g., phishing).

Conclusion
Language models serve as the backbone for many modern AI systems, enabling
machines to process, understand, and generate human language in a wide range of
applications, from chatbots and virtual assistants to machine translation and text
summarization. As NLP and AI technology evolve, the role of language models will
continue to expand, making interactions with machines more natural and intuitive.

7. Simple N-Gram Models


An N-gram model is a probabilistic language model used to predict the next word in
a sequence based on the previous N−1N−1 words. It is one of the simplest methods
for modeling sequences of text or speech.

N-Gram Definition:

1. An N-gram is a contiguous sequence of NN items (words, characters, etc.) in text or


speech. For example:
1. Unigram (N=1N=1): "I", "love", "coding"
2. Bigram (N=2N=2): "I love", "love coding"
3. Trigram (N=3N=3): "I love coding"

Conditional Probability:

1. The probability of a word wiwi given the previous N−1N−1 words is calculated
as:P(wi∣ wi−N+1,…,wi−1)P(wi∣ wi−N+1,…,wi−1)

Markov Assumption:

1. The N-gram model simplifies language modeling by assuming that the probability of
a word depends only on the
previous N−1N−1 words:P(w1,w2,…,wT)≈∏i=1TP(wi∣ wi−N+1,…,wi−1)P(w1,w2,…,wT
)≈i=1∏TP(wi∣ wi−N+1,…,wi−1) For example, in a bigram model
(N=2N=2):P(w1,w2,…,wT)≈∏i=1TP(wi∣ wi−1)P(w1,w2,…,wT)≈i=1∏TP(wi∣ wi−1)

Training N-Gram Models:

1. The probabilities are estimated from a corpus by counting the occurrences of word
sequences:P(wi∣ wi−N+1,…,wi−1)=Count(wi−N+1,…,wi)Count(wi−N+1,…,wi−1)P(wi
∣ wi−N+1,…,wi−1)=Count(wi−N+1,…,wi−1)Count(wi−N+1,…,wi)

Advantages

1. Simplicity:

Easy to implement and understand.

2. Efficiency:

Computation is straightforward, especially for smaller NN.

3. Interpretability:
Counts and probabilities can be easily understood.

Challenges

Data Sparsity:

1. Many NN-grams in the corpus may not appear in the training data, leading to zero
probabilities for unseen sequences.

Scalability:

1. For large NN, the number of possible NN-grams grows exponentially, requiring large
amounts of data.

Short Context:

1. NN-grams rely on fixed-length context, limiting the ability to capture long-term


dependencies in text.

Smoothing in N-Gram Models

To address the issue of zero probabilities, smoothing techniques are applied:

1. Laplace Smoothing: Adds a small constant (e.g., 1) to all counts.


2. Good-Turing Smoothing: Adjusts counts based on the frequency of frequencies.
3. Kneser-Ney Smoothing: Redistributes probabilities more effectively by considering lower-
order models.

Applications

 Predictive Text: Suggesting the next word in typing.


 Speech Recognition: Transcribing spoken language.
 Machine Translation: Translating sequences of text.
 Spam Filtering: Analyzing email content.

Example: Bigram Model

Training Corpus:

css
Copy code
I love coding. I love AI.
Bigram Probabilities:

 P("love"∣ "I")=Count("I love")Count("I")=22=1P("love"∣ "I")=Count("I")Count("I love")=22=1


 P("coding"∣ "love")=Count("love coding")Count("love")=12P("coding"∣ "love")=Count("love")
Count("love coding")=21

Sentence Probability:

For the sentence "I love coding":

P("I love coding")=P("I")⋅P("love"∣"I")⋅P("coding"∣"love")P("I lo


ve coding")=P("I")⋅P("love"∣"I")⋅P("coding"∣"love")

If P("I")P("I") is assumed uniform:

P("I")=13, P("love"∣"I")=1, P("coding"∣"love")=0.5P("I")=31


,P("love"∣"I")=1,P("coding"∣"love")=0.5P("I love coding")=13⋅1⋅0.5
=0.1667P("I love coding")=31⋅1⋅0.5=0.1667

This simple model can predict text probabilities, detect patterns, or suggest
completions.

8. Estimating Parameters and Smoothing


Estimating Parameters

In the context of probabilistic models (e.g., in Natural Language Processing or


Machine Learning), parameter estimation refers to determining the values of
parameters that best represent the data. These parameters define the model's behavior,
such as probabilities in probabilistic models.

Key Methods for Parameter Estimation

Maximum Likelihood Estimation (MLE):

1. The goal is to find the parameter values that maximize the likelihood of the
observed data.
2. For a dataset DD, the likelihood L(θ∣ D)L(θ∣ D) for parameter θθ is calculated
as:L(θ∣ D)=∏i=1NP(xi∣ θ)L(θ∣ D)=i=1∏NP(xi∣ θ) where xixi are data points.
3. Log-likelihood is often used for easier
computation:ℓ(θ∣ D)=∑i=1Nlog⁡P(xi∣ θ)ℓ(θ∣ D)=i=1∑NlogP(xi∣ θ)

Bayesian Estimation (Maximum A Posteriori - MAP):

1. Incorporates prior beliefs about parameters using Bayes'


Theorem:P(θ∣ D)=P(D∣ θ)P(θ)P(D)P(θ∣ D)=P(D)P(D∣ θ)P(θ)
2. The objective is to maximize P(θ∣ D)P(θ∣ D), combining prior
knowledge P(θ)P(θ) with observed data likelihood P(D∣ θ)P(D∣ θ).
Method of Moments:

1. Matches moments (e.g., mean, variance) of the distribution to those of the data for
parameter estimation.

Smoothing

Smoothing addresses the issue of assigning non-zero probabilities to unseen events in


probabilistic models, particularly language models. Without smoothing, unseen events
have a probability of zero, which can disrupt calculations and lead to poor
generalization.

Common Smoothing Techniques

Laplace Smoothing (Add-One Smoothing):

1. Adds 1 to each count to ensure no probability is zero.


2. Adjusted probability:P(wi∣ wi−1)=C(wi−1,wi)+1C(wi−1)+VP(wi∣ wi−1)=C(wi−1
)+VC(wi−1,wi)+1 where C(wi−1,wi)C(wi−1,wi) is the bigram count, C(wi−1)C(wi−1)is
the unigram count, and VV is the vocabulary size.

Additive Smoothing (Generalization of Laplace):

1. Adds a constant α>0α>0 (instead of 1) to each


count:P(wi∣ wi−1)=C(wi−1,wi)+αC(wi−1)+αVP(wi∣ wi−1)=C(wi−1)+αVC(wi−1,wi)+α

Good-Turing Smoothing:

1. Adjusts probabilities based on the frequency of frequencies.


2. Rare events’ probabilities are redistributed to unseen events.
3. Adjusted count:C∗=(r+1)Nr+1NrC∗=Nr(r+1)Nr+1 where rr is the observed
frequency, NrNr is the number of events with frequency rr.

Kneser-Ney Smoothing:

1. A more advanced smoothing method for language models.


2. Combines a back-off mechanism with adjusted probabilities, ensuring better
handling of rare and unseen n-grams.

Jelinek-Mercer Smoothing:

1. A linear interpolation approach:P(wi∣ wi−1)=λPMLE(wi∣ wi−1)+(1−λ)PMLE(wi)P(wi


∣ wi−1)=λPMLE(wi∣ wi−1)+(1−λ)PMLE(wi) where λλ is a weighting parameter.

Why Estimation and Smoothing Are Crucial


1. Parameter Estimation determines the structure of the model, ensuring it aligns with
observed data.
2. Smoothing prevents overfitting to seen data and enhances generalization to unseen data,
critical for robust probabilistic models like language models.

9. Evaluating Language Models


Evaluating a language model involves assessing its ability to perform various
language-related tasks effectively and accurately. This evaluation can be performed
using different methodologies, depending on the intended purpose of the model.
Below is a breakdown of common aspects of language model evaluation:

1. Intrinsic Evaluation

Measures the quality of the model's linguistic capabilities directly, often on pre-
defined tasks.

 Perplexity: Evaluates how well a model predicts a sample. Lower perplexity indicates better
prediction accuracy.
 BLEU/ROUGE/METEOR Scores: Measures how closely the model-generated text matches
reference text. Commonly used in machine translation and summarization tasks.
 Grammaticality: Evaluates the grammatical correctness of generated sentences.
 Language Understanding: Tasks like word similarity, analogy completion, and syntactic
parsing.

2. Extrinsic Evaluation

Focuses on how well the model performs in downstream tasks:

 Text Classification: E.g., sentiment analysis, spam detection.


 Named Entity Recognition (NER): Identifying entities like names, places, or dates.
 Machine Translation: Assessing translation quality across languages.
 Question Answering: Evaluating accuracy and relevance in responses to queries.

3. Human Evaluation

Uses human judgment to assess aspects not easily measured by automated metrics:

 Fluency: How natural the generated text sounds.


 Relevance: Appropriateness of the model's output to a prompt.
 Creativity: Ability to produce novel and meaningful text.
4. Ethical and Fairness Considerations

Evaluates the model's social impact and fairness:

 Bias and Fairness: Testing for discriminatory or biased outputs.


 Toxicity: Ensuring the model doesn't generate harmful or offensive text.
 Inclusivity: Checking for language inclusiveness and diversity.

5. Robustness and Generalization

Tests how well the model performs under challenging conditions:

 Adversarial Testing: Inputs designed to trick or confuse the model.


 Domain Generalization: Evaluating the model on data from domains it wasn’t trained on.

6. Scalability and Efficiency

Measures practical usability and resource constraints:

 Latency: Time taken to generate responses.


 Memory Usage: Computational and memory efficiency.
 Energy Consumption: Evaluating the carbon footprint of running the model.

Example Evaluation Metrics for Popular Tasks

Task Metric Explanation


Machine Matches n-grams in the generated and
BLEU
Translation reference translations.
Measures overlap of words/phrases
Summarization ROUGE
with reference summaries.
Language Predictive quality of the model's
Perplexity
Modeling probabilities.
Human
Text Generation Judges fluency and relevance.
Evaluation
Sentiment Accuracy/F1-
Measures classification correctness.
Analysis Score

Frameworks and Tools for Evaluation

 GLUE (General Language Understanding Evaluation): A benchmark suite for NLP tasks.
 SuperGLUE: More challenging tasks for advanced models.
 HumanEval: Human-based tasks for assessing language generation.
 LAMBADA: Tests understanding of long-range dependencies.

Evaluation of a language model depends heavily on its intended application,


balancing metrics, human feedback, and ethical concerns for comprehensive
assessment.

You might also like