Unit-3 (NLP)
Unit-3 (NLP)
The roots of language modeling as it exists today can be traced back to 1948. That
year, Claude Shannon published a paper titled "A Mathematical Theory of
Communication." In it, he detailed the use of a stochastic model called the Markov
chain to create a statistical model for the sequences of letters in English text. This
paper had a large impact on the telecommunications industry, laid the groundwork
for information theory and language modeling. The Markov model is still used
today, and n-grams specifically are tied very closely to the concept.
N-grams are defined as the contiguous sequence of n items that can be extracted
from a given sample of text or speech. The items can be letters, words, or base pairs
according to the application. The N-grams typically are collected from a text or
speech corpus (Usually a corpus of long text dataset).
But how do we calculate it? The answer lies in the chain rule of probability:
Now generalize the above equation:
• For unigram:
• For Bigram:
•
EXAMPLES:
Consider two sentences: "There was heavy rain" vs. "There was heavy flood".
From experience, we know that the former sentence sounds better. An N-gram
model will tell us that "heavy rain" occurs much more often than "heavy flood" in
the training corpus. Thus, the first sentence is more probable and will be
selected by the model.
An n-gram model for the above example would calculate the following
probability:
A model that simply relies on how often a word occurs without looking at
previous words is called unigram. If a model considers only the previous word
to predict the current word, then it's called bigram. If two previous words are
considered, then it's a trigram model.
• What are some limitations of N-gram models?
A model trained on the works of Shakespeare will not give good
predictions when applied to another genre. We need to
therefore ensure that the training corpus looks similar to
the test corpus.
Estimating Parameters:
In an N-gram model, the parameters to be estimated are the probabilities of
each n-gram (i.e., a sequence of n words) occurring in the training corpus. For
example, to estimate the probability of a bigram (i.e., a sequence of two
words), we count the number of times that bigram appears in the corpus and
divide it by the number of times the first word of that bigram appears in the
corpus. Similarly, to estimate the probability of a trigram (i.e., a sequence of
three words), we count the number of times that trigram appears in the corpus
and divide it by the number of times the first two words of that trigram appear
in the corpus.
Smoothing:
N-gram models suffer from the problem of sparsity, where many possible n-
grams have not been observed in the training data. To address this issue,
smoothing techniques are used to assign non-zero probabilities to unseen n-
grams. One of the most commonly used smoothing techniques in N-gram
models is the add-k smoothing method, where a small constant k is added to
the count of each n-gram. Another popular smoothing technique is the Good-
Turing smoothing, which estimates the probability of unseen n-grams by using
the frequencies of n-grams that occur only once in the training corpus.
• One of the main steps in the usage of language models is to evaluate the
performance beforehand and use them in further tasks.
• This lets us build confidence in the handling of the language models in
NLP and also lets us know if there are any places where the model may
behave uncharacteristically.
In practice, we need to decide on the dataset to use, the method to evaluate, and
also select a metric to evaluate language models. Let us learn about each of the
elements further.
Extrinsic Evaluation
Intrinsic Evaluation
Perplexity
The Intuition
• The basic intuition is that the higher the perplexity measure is,
the better the language model is at modeling unseen sentences.
• Perplexity can also be seen as a simple monotonic function of entropy.
But perplexity is often used instead of entropy due to the fact that it
is arguably more intuitive to our human minds than entropy.
Calculating Perplexity
• Perplexity of a probability model like language models in NLP: For a
model of an unknown probability distribution, and a proposed
probability model, we can evaluate perplexity measure mathematically
as b−N1∑i=1Nlogbq(xi)
o We can choose b as 2.
o In general, better models assign higher probabilities to the test
events, hence good models will have lower perplexity values and
are less surprised by the test sample.
o If all the probabilities were 1, then the perplexity would be 1 and
the model would perfectly predict the text. Conversely, the
perplexity will be higher for poorer language models.
PP(p):=2H(p)=2−∑xp(x)log2p(x)=∏xp(x)−p(x)
o Where H(p) is the entropy (in bits) of the distribution and x ranges
over events which we will learn about further.
o Perplexity of a random variable X may be defined as the perplexity
of the distribution over its possible values x.
• One other formulation for Perplexity from the perspective of language
models in NLP: It is the multiplicative inverse of the probability assigned
to the test set by the language model normalized by the number of
words in the test set.
o We can define perplexity mathematically as:
o PP(W)=P(w1w2…wN)^−N1
POS tagging is useful for a variety of NLP tasks, such as information extraction, named entity
recognition, and machine translation. It can also be used to identify the grammatical
structure of a sentence and to disambiguate words that have multiple meanings.
POS tagging is typically performed using machine learning algorithms, which are trained on
a large annotated corpus of text. The algorithm learns to predict the correct POS tag for a
given word based on the context in which it appears.
There are various POS tagging schemes that have been developed, each with its own set of
tags and rules. Some common POS tagging schemes include the Penn Treebank tag set and
the Universal Dependencies tag set.
POS tags:
• The: determiner
• cat: noun
• sat: verb
• on: preposition
• the: determiner
• mat: noun
In this example, each word in the sentence has been labelled with its corresponding part of
speech. The determiner “the” is used to identify specific nouns, while the noun “cat” refers
to a specific animal. The verb “sat” describes an action, and the preposition “on” describes
the relationship between the cat and the mat.
POS tagging is a useful tool in natural language processing (NLP) as it allows algorithms to
understand the grammatical structure of a sentence and to disambiguate words that have
multiple meanings. It is typically performed using machine learning algorithms that are
trained on a large annotated corpus of text.
Identifying part of speech of word is not just mapping words to their respective POS tags.
Same word might have different part of speech tag based on different context. Thus it is not
possible to have common mapping for parts of speech tags.
When you have a huge corpus manually finding different part-of-speech for each word is a
scalable solution. As tagging itself might take days. This is why we rely on tool-based POS
tagging.
But why are we tagging these words with their parts of speech?
• To improve the accuracy of NLP tasks: POS tagging can help improve the
performance of various NLP tasks, such as named entity recognition and text
classification. By providing additional context and information about the words in
a text, we can build more accurate and sophisticated algorithms.
• To facilitate research in linguistics: POS tagging can also be used to study the
patterns and characteristics of language use and to gain insights into the structure
and function of different parts of speech.
• Collect a dataset of annotated text: This dataset will be used to train and test the
POS tagger. The text should be annotated with the correct POS tags for each word.
• Pre-process the text: This may include tasks such as tokenization (splitting the text
into individual words), lowercasing, and removing punctuation.
• Divide the dataset into training and testing sets: The training set will be used to
train the POS tagger, and the testing set will be used to evaluate its performance.
• Train the POS tagger: This may involve building a statistical model, such as a
hidden Markov model (HMM), or defining a set of rules for a rule-based or
transformation-based tagger. The model or rules will be trained on the annotated
text in the training set.
• Test the POS tagger: Use the trained model or rules to predict the POS tags of the
words in the testing set. Compare the predicted tags to the true tags and calculate
metrics such as precision and recall to evaluate the performance of the tagger.
• Fine-tune the POS tagger: If the performance of the tagger is not satisfactory,
adjust the model or rules and repeat the training and testing process until the
desired level of accuracy is achieved.
• Use the POS tagger: Once the tagger is trained and tested, it can be used to perform
POS tagging on new, unseen text. This may involve preprocessing the text and
inputting it into the trained model or applying the rules to the text. The output
will be the predicted POS tags for each word in the text.
• Named entity recognition: POS tagging can be used to identify and classify named
entities in a text, such as people, places, and organizations. This is useful for tasks
such as building customer profiles or identifying key figures in a news story.
• Text classification: POS tagging can be used to help classify texts into different
categories, such as spam emails or sentiment analysis. By analysing the POS tags
of the words in a text, algorithms can better understand the content and tone of
the text.
• Machine translation: POS tagging can be used to help translate texts from one
language to another by identifying the grammatical structure and relationships
between words in the source language and mapping them to the target language.
In a rule-based POS tagging system, words are assigned POS tags based on their
characteristics and the context in which they appear. For example, a rule-based POS tagger
might assign the tag “noun” to any word that ends in “-tion” or “-ment,” as these suffixes are
often used to form nouns.
Rule-based POS taggers can be relatively simple to implement and are often used as a
starting point for more complex machine learning-based taggers. However, they can be less
accurate and less efficient than machine learning-based taggers, especially for tasks with
large or complex datasets.
• Define a set of rules for assigning POS tags to words. For example:
• Iterate through the words in the text and apply the rules to each word in turn. For
example:
This is a very basic example of a rule-based POS tagger, and more complex systems can
include additional rules and logic to handle more varied and nuanced text.
In statistical POS tagging, a model is trained on a large annotated corpus of text to learn the
patterns and characteristics of different parts of speech. The model uses this training data
to predict the POS tag of a given word based on the context in which it appears and the
probability of different POS tags occurring in that context.
Statistical POS taggers can be more accurate and efficient than rule-based taggers,
especially for tasks with large or complex datasets. However, they require a large amount of
annotated training data and can be computationally intensive to train.
• Collect a large annotated corpus of text and divide it into training and testing sets.
• Train a statistical model on the training data, using techniques such as maximum
likelihood estimation or hidden Markov models.
• Use the trained model to predict the POS tags of the words in the testing data.
• Evaluate the performance of the model by comparing the predicted tags to the true
tags in the testing data and calculating metrics such as precision and recall.
• Fine-tune the model and repeat the process until the desired level of accuracy is
achieved.
• Use the trained model to perform POS tagging on new, unseen text.
There are various statistical techniques that can be used for POS tagging, and the choice of
technique will depend on the specific characteristics of the dataset and the desired level of
accuracy.
In TBT, a set of rules is defined to transform the tags of words in a text based on the context
in which they appear. For example, a rule might change the tag of a verb to a noun if it
appears after a determiner such as “the.” The rules are applied to the text in a specific order,
and the tags are updated after each transformation.
TBT can be more accurate than rule-based tagging, especially for tasks with complex
grammatical structures. However, it can be more computationally intensive and requires a
larger set of rules to achieve good performance.
• If the word is a verb and appears after a determiner, change the tag to “noun.”
• If the word is a noun and appears after an adjective, change the tag to “adjective.”
• Iterate through the words in the text and apply the rules in a specific order. For
example:
• In the sentence “The cat sat on the mat,” the word “sat” would be changed from a
verb to a noun based on the first rule.
• In the sentence “The red cat sat on the mat,” the word “red” would be changed from
an adjective to a noun based on the second rule.
This is a very basic example of a TBT system, and more complex systems can include
additional rules and logic to handle more varied and nuanced text.
Hidden Markov models (HMMs) are a type of statistical model that can be used for part-of-
speech (POS) tagging in natural language processing (NLP). In an HMM-based POS tagger, a
model is trained on a large annotated corpus of text to learn the patterns and characteristics
of different parts of speech. The model uses this training data to predict the POS tag of a
given word based on the probability of different tags occurring in the context of the word.
An HMM-based POS tagger consists of a set of states, each corresponding to a possible POS
tag, and a set of transitions between the states. The model is trained on the training data to
learn the probabilities of transitioning from one state to another and the probabilities of
observing different words given a particular state.
To perform POS tagging on a new text using an HMM-based tagger, the model uses the
probabilities learned during training to compute the most likely sequence of POS tags for
the words in the text. This is typically done using the Viterbi algorithm, which calculates the
probability of each possible sequence of tags and selects the most likely one.
HMMs are widely used for POS tagging and other tasks in NLP due to their ability to model
complex sequential data and their efficiency in computation. However, they can be sensitive
to the quality of the training data and may require a large amount of annotated data to
achieve good performance.
Challenges in POS Tagging
Some common challenges in part-of-speech (POS) tagging include:
• Ambiguity: Some words can have multiple POS tags depending on the context in
which they appear, making it difficult to determine their correct tag. For example,
the word “bass” can be a noun (a type of fish) or an adjective (having a low
frequency or pitch).
• Out-of-vocabulary (OOV) words: Words that are not present in the training data of
a POS tagger can be difficult to tag accurately, especially if they are rare or specific
to a particular domain.
• Complex grammatical structures: Languages with complex grammatical
structures, such as languages with many inflections or free word order, can be
more challenging to tag accurately.
• Lack of annotated training data: Some languages or domains may have limited
annotated training data, making it difficult to train a high-performing POS tagger.
• Inconsistencies in annotated data: Annotated data can sometimes contain errors
or inconsistencies, which can negatively impact the performance of a POS tagger.
Sequence Labeling
Sequence labeling is a typical NLP task which assigns a class or label to each token in a given
input sequence. In this context, a single word will be referred to as a “token”. These tags or
labels can be used in further downstream models as features of the token, or to enhance
search quality by naming spans of tokens. In question answering and search tasks, we can use
these spans as entities to specify our search query (e.g..,. “Play a movie by Tom Hanks”) we
would like to label words such as: [Play, movie, Tom Hanks]. With these parts removed, we
can use the verb “play” to specify the wanted action, the word “movie” to specify the intent of
the action and Tom Hanks as the single subject for our search. To do this, we need a way of
• Token Labeling: Each token gets an individual Part of Speech (POS) label and
• Span Labeling: Labeling segments or groups of words that contain one tag
(Named Entity Recognition, Syntactic Chunks).
Raw labeling is a common task which involves labeling a single word unit with its respective
tag. One common application of this is part-of-speech(POS) tagging. Given a dataset of tokens
and their POS tags within their given context, it is possible to train a model that will learn
from the context and generalize to other unseen texts and predict their POS.
Segmentation labeling is another form of sequence tagging, where we have a single entity such as a
name that spans multiple tokens. To simplify this task, we write it as a raw labeling task with modified
labels to represent tokens as members of a span. These spans are labeled with a BIO tag representing
the Beginning, Inner, and Outside of entities:
By further breaking down multiword entities into groups of BIO tags that represent the span
of a single entity, we can train a model to tag where a single entity begins and ends. Here, “Arc
de Triomphe” are three tokens that represent a single entity. By identifying the beginning and
inner of the entity, they can be joined to form a single representation. This is important in
tasks such as question answering, where we want to know the tokens “Tom” and “Hanks”
refer to the same person, without separating them, thus allowing us to generate a more
accurate query. For more details, please refer to the blog by Hal Daumé III in Getting Started
HMMs are “a statistical Markov model in which the system being modeled is assumed to be a
Markov process with unobservable (i.e. hidden) states”. They are designed to model the joint
distribution P(H , O) , where H is the hidden state and O is the observed state. For example ,
in the context of POS tagging, the objective would be to build an HMM to model P(word |
tag) and compute the label probabilities given observations using Bayes’ Rule:
HMM graphs consist of a Hidden Space and Observed Space, where the hidden space
consists of the labels and the observed space is the input. These spaces are connected via
transition matrices {T,A} to represent the probability of transitioning from one state to
another following their connections. Each connection represents a distribution over possible
options; given our tags, this results in a large search space of the probability of all words given
the tag.
The main idea behind HMMs is that of making observations and traveling along connections
based on a probability distribution. In the context of sequence tagging, there exists a changing
observed state (the tag) which changes as our hidden state (tokens in the source text) also
changes.
However, there are two problems with HMMs. First, HMMs are limited to only discrete states
and only take into account the last known state. Furthermore, it is hard to create a state as a
function of multiple others, and the features allowed are limited. These problems limit the
utilization of our context, where it would be preferable to consider our sequence as a whole
rather than strictly assume independence as in HMMs. Hence, a new model was needed to
Given an observation space, Maximum Entropy Markov Models (MEMMs) predict the
state sequence. MEMMs use a maximum entropy framework for features and local
MEMMs also have a well-known issue known as label bias. The label bias problem was
introduced due to MEMMs applying local normalization. This often leads to the model getting
stuck in local minima during decoding. The local minima trap occurs because the overall
model favors nodes with the least amount of transitions. To solve this, Conditional Random
Lexical syntax:
Lexical syntax is a subfield of natural language processing (NLP) that deals with
the study of the syntax or grammatical structure of words or lexical items in a
language. It focuses on the relationships between the surface forms of words
and their syntactic properties, including the grammatical categories,
morphological properties, and syntactic dependencies.
Lexical syntax plays a crucial role in many NLP applications, such as part-of-
speech tagging, parsing, and semantic analysis. For instance, in part-of-speech
tagging, lexical syntax is used to determine the grammatical category or part of
speech of each word in a sentence. In parsing, it helps to identify the syntactic
structure of a sentence, such as the subject and object of a sentence. In
semantic analysis, it helps to determine the meaning of words and their
relationships in a sentence.
One of the main challenges of lexical syntax in NLP is dealing with the
ambiguity of natural language. Many words can have multiple syntactic and
semantic interpretations depending on the context in which they appear. To
overcome this challenge, lexical syntax relies on various techniques such as
rule-based parsing, statistical parsing, and machine learning algorithms.
In summary, lexical syntax is a critical area of study in NLP that focuses on the
analysis of the grammatical structure of words in a language. It plays a crucial
role in many NLP applications and helps to address the challenge of ambiguity
in natural language.
In natural language processing (NLP), lexical syntax refers to the rules and
patterns that govern how words are used in a language. These rules include the
structure of words, such as their inflection, tense, and grammatical category, as
well as their relationships to other words in a sentence.
Morphology: This refers to the study of the structure and formation of words,
including inflection, derivation, and compounding. For example, in English, the
word "walk" can be inflected to form "walked" (past tense), "walking" (present
participle), and "walks" (third-person singular).
Parts of speech: This refers to the grammatical categories that words can be
classified into, such as nouns, verbs, adjectives, and adverbs. Parts of speech
play a critical role in determining the structure and meaning of a sentence.
Syntax: This refers to the rules that govern the arrangement of words in a
sentence. For example, in English, the subject typically precedes the verb in a
sentence ("the cat chased the mouse").
Algorithm:
Applications of the algorithm
The forward algorithm is mostly used in applications that need us to determine
the probability of being in a specific state when we know about the sequence
of observations. We first calculate the probabilities over the states computed
for the previous observation and use them for the current observations, and
then extend it out for the next step using the transition probability table. The
approach basically caches all the intermediate state probabilities so they are
computed only once. This helps us to compute a fixed state path. The process
is also called posterior decoding. The algorithm computes probability much
more efficiently than the naive approach, which very quickly ends up in a
combinatorial explosion. Together, they can provide the probability of a given
emission/observation at each position in the sequence of observations. It is
from this information that a version of the most likely state path is computed
("posterior decoding"). The algorithm can be applied wherever we can train a
model as we receive data using Baum-Welch[2] or any general EM algorithm.
The Forward algorithm will then tell us about the probability of data with
respect to what is expected from our model. One of the applications can be in
the domain of Finance, where it can help decide on when to buy or sell
tangible assets. It can have applications in all fields where we apply Hidden
Markov Models. The popular ones include Natural language processing
domains like tagging part-of-speech and speech recognition.[1] Recently it is
also being used in the domain of Bioinformatics. Forward algorithm can also be
applied to perform Weather speculations. We can have a HMM describing the
weather and its relation to the state of observations for few consecutive days
(some examples could be dry, damp, soggy, sunny, cloudy, rainy etc.). We can
consider calculating the probability of observing any sequence of observations
recursively given the HMM. We can then calculate the probability of reaching
an intermediate state as the sum of all possible paths to that state. Thus the
partial probabilities for the final observation will hold the probability of
reaching those states going through all possible paths.
Viterbi algorithm
The Viterbi algorithm is a dynamic programming algorithm for obtaining
the maximum a posteriori probability estimate of the most likely sequence of
hidden states—called the Viterbi path—that results in a sequence of observed
events, especially in the context of Markov information sources and hidden
Markov models (HMM).
The algorithm has found universal application in decoding the convolutional
codes used in both CDMA and GSM digital cellular, dial-up modems, satellite,
deep-space communications, and 802.11 wireless LANs. It is now also
commonly used in speech recognition, speech synthesis, diarization,[1] keyword
spotting, computational linguistics, and bioinformatics. For example, in speech-
to-text (speech recognition), the acoustic signal is treated as the observed
sequence of events, and a string of text is considered to be the "hidden cause"
of the acoustic signal. The Viterbi algorithm finds the most likely string of text
given the acoustic signal.