N-gram Language Models
Dr. Wafa Zaal Almaaitah
Statistical Language Processing
• In the solution of many problems in the natural language processing, statistical
language processing techniques can be also used.
– optical character recognition
– spelling correction
– speech recognition
– machine translation
– part of speech tagging
– parsing
• Statistical techniques can be used to disambiguate the input.
• They can be used to select the most probable solution.
• Statistical techniques depend on the probability theory.
• To able to use statistical techniques, we will need corpora to collect statistics.
• Corpora should be big enough to capture the required knowledge.
Dr. Wafa Zaal Almaaitah
Basic Probability
• Probability Theory: predicting how likely it is that something will happen.
• Probabilities: numbers between 0 and 1.
• Probability Function:
– P(A) means that how likely the event A happens.
– P(A) is a number between 0 and 1
– P(A)=1 => a certain event
– P(A)=0 => an impossible event
• Example: a coin is tossed three times. What is the probability of 3 heads?
– 1/8
– uniform distribution
Dr. Wafa Zaal Almaaitah
Basic Probability
Since each coin toss is independent of the others, we can multiply the probabilities
together:
Probability of getting heads on the 1st toss = 0.5
Probability of getting heads on the 2nd toss = 0.5
Probability of getting heads on the 3rd toss = 0.5
To find the probability of all three events happening (getting heads on all three
tosses), we multiply the individual probabilities:
0.5×0.5×0.5=0.125
So, the probability of getting 3 heads when a coin is tossed three times is 0.125 or
12.5%.
Dr. Wafa Zaal Almaaitah
Probability Spaces
• There is a sample space and the subsets of this sample space describe the events.
• is a sample space.
– is the certain event
– the empty set is the impossible event.
P(A) is between 0 and 1
P() = 1
There is a capital Omega (Ω) symbol, which typically represents the sample space in probability
theory or the universal set in set theory. Inside Ω, there's a circle with the capital letter A inside it,
representing a subset of the sample space or a particular set within the universal set.
Dr. Wafa Zaal Almaaitah
Unconditional and Conditional Probability
• Unconditional Probability or Prior Probability
– P(A)
– the probability of the event A does not depend on other events.
• Conditional Probability -- Posterior Probability -- Likelihood
– P(A|B)
– this is read as the probability of A given that we know B.
• Example:
– P(put) is the probability of to see the word put in a text
– P(on|put) is the probability of to see the word on after seeing the word put.
Dr. Wafa Zaal Almaaitah
Unconditional and Conditional Probability
P(A|B) = P(AB) / P(B)
P(B|A) = P(AB) / P(A)
Dr. Wafa Zaal Almaaitah
Bayes’ Theorem
• Bayes’ theorem is used to calculate P(A|B) from given P(B|A).
• We know that:
P(AB) = P(A|B) P(B)
P(AB) = P(B|A) P(A)
• So, we will have:
P(B | A)P( A) P( A | B)P(B)
P( A | B) = P(B | A) =
P(B) P( A)
Dr. Wafa Zaal Almaaitah
Language Model
• Models that assign probabilities to sequences of words are called language models
(LMs).
• The simplest language model that assigns probabilities to sentences and sequences of
words is the n-gram.
• An n-gram is a sequence of N words:
– A 1-gram (unigram) is a single word sequence of words like “please” or “ turn”.
– A 2-gram (bigram) is a two-word sequence of words like “please turn”, “turn your”, or
”your homework”.
– A 3-gram (trigram) is a three-word sequence of words like “please turn your”, or “turn your
homework”.
• We can use n-gram models to estimate the probability of the last word of an n-gram
given the previous words, and also to assign probabilities to entire word sequences.
Dr. Wafa Zaal Almaaitah
Probabilistic Language Models
• Probabilistic language models can be used to assign a probability to a sentence in
many NLP tasks.
• Machine Translation:
– P(high winds tonight) > P(large winds tonight)
• Spell Correction:
– Thek office is about ten minutes from here
– P(The Office is) > P(Then office is)
• Speech Recognition:
– P(I saw a van) >> P(eyes awe of an)
• Summarization, question-answering, …
Dr. Wafa Zaal Almaaitah
Probabilistic Language Models
• Our goal is to compute the probability of a sentence or sequence of words W
(=w1,w2,…wn):
– P(W) = P(w1,w2,w3,w4,w5…wn)
• What is the probability of an upcoming word?:
– P(w5|w1,w2,w3,w4)
• A model that computes either of these:
– P(W) or P(wn|w1,w2…wn-1) is called a language model.
Dr. Wafa Zaal Almaaitah
Chain Rule of Probability
• How can we compute probabilities of entire word sequences like w1,w2,…wn?
– The probability of the word sequence w1,w2,…wn is P(w1,w2,…wn).
• We can use the chain rule of the probability to decompose this probability:
P(w1n) = P(w1) P(w2|w1) P(w3|w12) … P(wn|w1n-1)
n
= k 1 )
P(w
k =1
| w k −1
Example:
P(the man from jupiter) =
P(the) P(man|the) P(from|the man) P(jupiter|the man from)
Dr. Wafa Zaal Almaaitah
Chain Rule of Probability
and Conditional Probabilities
• The chain rule shows the link between computing the joint probability of a sequence
and computing the conditional probability of a word given previous words.
• Definition of Conditional Probabilities:
P(B|A) = P(A,B) / P(A) ➔ P(A,B) = P(A) P(B|A)
• Conditional Probabilities with More Variables:
P(A,B,C,D) = P(A) P(B|A) P(C|A,B) P(D|A,B,C)
• Chain Rule:
P(w1… wn) = P(w1) P(w2|w1) P(w3|w1w2) … P(wn|w1…wn-1)
Dr. Wafa Zaal Almaaitah
Computing Conditional Probabilities
• To compute the exact probability of a word given a long sequence of preceding words
is difficult (sometimes impossible).
• We are trying to compute P(wn|w1…wn-1) which is the probability of seeing wn after
seeing w1n-1.
• We may try to compute P(wn|w1…wn-1) exactly as follows:
P(wn|w1…wn-1) = count(w1…wn-1wn) / count(w1…wn-1)
• Too many possible sentences and we may never see enough data for estimating these
probability values.
• So, we need to compute P(wn|w1…wn-1) approximately.
Dr. Wafa Zaal Almaaitah
N-Grams
• The intuition of the n-gram model (simplifying assumption):
– instead of computing the probability of a word given its entire history, we can
approximate the history by just the last few words.
P(wn|w1…wn-1) ≈ P(wn) unigram
P(wn|w1…wn-1) ≈ P(wn|wn-1) bigram
P(wn|w1…wn-1) ≈ P(wn|wn-1wn-2) trigram
P(wn|w1…wn-1) ≈ P(wn|wn-1wn-2wn-3) 4-gram
P(wn|w1…wn-1) ≈ P(wn|wn-1wn-2wn-3wn-4) 5-gram
• In general, N-Gram is
𝐧−𝟏 )
P(wn|w1…wn-1) ≈ 𝐏(wn|𝐰𝐧−𝐍+𝟏
Dr. Wafa Zaal Almaaitah
N-Grams
computing probabilities of word sequences
Unigrams -- P(w1n) P(w )
k =1
k
Bigrams -- P(w1n) P(w
k =1
k | wk −1 )
n
Trigrams -- P(w1n) P(w
k =1
k | wk −1 wk −2 )
4-grams -- P(w1n) P(w
k =1
k | wk −1 wk −2 wk −3 )
Dr. Wafa Zaal Almaaitah
N-Grams
computing probabilities of word sequences (Sentences)
Unigram
P(<s> the man from jupiter came </s>)
P(the) P(man) P(from) P(jupiter) P(came)
Bigram
P(<s> the man from jupiter came </s>)
P(the|<s>) P(man|the) P(from|man) P(jupiter|from) P(came|jupiter) P(</s>|came)
Trigram
P(<s> the man from jupiter came </s>)
P(the|<s> <s>) P(man|<s> the) P(from|the man) P(jupiter|man from)
P(came|from jupiter) P(</s>|jupiter came) P(</s>|came </s>)
Dr. Wafa Zaal Almaaitah
Example: using cosine similarity
Corpus
•Document 1: "Python is a popular programming language."
•Document 2: "Java and Python are both programming languages.“
Query
•Query: "Python programming language"
Dr. Wafa Zaal Almaaitah
Steps
1. Tokenization: Tokenize the documents and the query into words.
•Document 1: ["Python", "is", "a", "popular", "programming", "language"]
•Document 2: ["Java", "and", "Python", "are", "both", "programming", "languages"]
•Query: ["Python", "programming", "language"]
2. Generate Bigrams:
•Document 1: [("Python", "is"), ("is", "a"), ("a", "popular"), ("popular", "programming"),
("programming", "language")]
•Document 2: [("Java", "and"), ("and", "Python"), ("Python", "are"), ("are", "both"), ("both",
"programming"), ("programming", "languages")]
•Query: [("Python", "programming"), ("programming", "language")]
3. Vector Representation: Represent each document and the query as a vector of bigram
frequencies.
• Document 1: [1, 1, 1, 1, 1, 0] (where the elements correspond to the presence of each
bigram)
• Document 2: [0, 0, 0, 0, 1, 1]
• Query: [1, 1, 0, 0, 0, 0]
Dr. Wafa Zaal Almaaitah
Steps
4. Cosine Similarity Calculation: Calculate the cosine similarity between the query
vector and each document vector.
Document 1 has a higher cosine similarity (0.89) with the query compared to Document 2
(0), indicating that Document 1 is more relevant to the query.
Dr. Wafa Zaal Almaaitah
Example: using conditional probabilities
Corpus
•Document 1: "Python is a popular programming language."
•Document 2: "Java and Python are both programming languages.“
Query
•Query: "Python programming language"
Dr. Wafa Zaal Almaaitah
Steps
1. Tokenization: Tokenize the documents and the query into words.
•Document 1: ["Python", "is", "a", "popular", "programming", "language"]
•Document 2: ["Java", "and", "Python", "are", "both", "programming", "languages"]
•Query: ["Python", "programming", "language"]
2. Generate Bigrams:
•Document 1: [("Python", "is"), ("is", "a"), ("a", "popular"), ("popular", "programming"),
("programming", "language")]
•Document 2: [("Java", "and"), ("and", "Python"), ("Python", "are"), ("are", "both"), ("both",
"programming"), ("programming", "languages")]
•Query: [("Python", "programming"), ("programming", "language")]
Dr. Wafa Zaal Almaaitah
Steps
3. Calculate Conditional Probabilities:
•For each document, calculate the conditional probability of observing each bigram in
the query given the previous bigram in the document.
Conditional probability formula:
Dr. Wafa Zaal Almaaitah
Steps
Calculation for Document 1:
•P("Python" | <start>) = 1 / 1 = 1
•P("programming" | "Python") = 1 / 1 = 1
•P("language" | "programming") = 1 / 1 = 1
•Overall Probability:
Calculation for Document 2:
1. Perform similar calculations as for Document 1.
4. Ranking:
1. Calculate the overall probability for each document.
2. Rank the documents based on these probabilities, with higher probabilities indicating
higher relevance to the query.
Dr. Wafa Zaal Almaaitah
Laplace smoothing
Laplace smoothing, also known as add-one smoothing, is a technique used to handle
the issue of zero probabilities for unseen events in probabilistic models, such as
language models.
Dr. Wafa Zaal Almaaitah
Laplace smoothing
By adding 1 to the numerator and V to the denominator, Laplace smoothing ensures
that even unseen N-grams have a non-zero probability.
This helps in making the model more robust and prevents it from assigning zero
probabilities to unseen events, which could lead to overly sparse or inaccurate
probability estimates.
Dr. Wafa Zaal Almaaitah
Calculation with Laplace Smoothing
Dr. Wafa Zaal Almaaitah
Estimating N-Gram Probabilities
A Bigram Example
• A mini-corpus: We augment each sentence with a special symbol <s> at the beginning of the
sentence, to give us the bigram context of the first word, and special end-symbol </s>.
<s> I am Sam </s>
<s> Sam I am </s>
<s> I fly </s>
• Unique words: I, am, Sam, fly
• Bigrams: <s> and </s> are also tokens. There are 6(4+2) tokens and 6*6=36 bigrams
P(I|<s>)=2/3 P(Sam|<s>)=1/3 P(am|<s>)=0 P(fly|<s>)=0 P(<s>|<s>)=0 P(</s>|<s>)=0
P(I|I)=0 P(Sam|I)=0 P(am|I)=2/3 P(fly|I)=1/3 P(<s>|I)=0 P(</s>|I)=0
P(I|am)=0 P(Sam|am)=1/2 P(am|am)=0 P(fly|am)=0 P(<s>|am)=0 P(</s>|am)=1/2
P(I|Sam)=1/2 P(Sam|Sam)=0 P(am|Sam)=0 P(fly|Sam)=0 P(<s>|Sam)=0 P(</s>|Sam)=1/2
P(I|fly)=0 P(Sam|fly)=0 P(am|fly)=0 P(fly|fly)=0 P(<s>|fly)=0 P(</s>|fly)=1
P(I|</s>)=0 P(Sam|</s>)=1/3 P(am|</s>)=1/3 P(fly|</s>)=1/3 P(<s>|</s>)=0 P(</s>|</s>)=0
Dr. Wafa Zaal Almaaitah
Estimating N-Gram Probabilities
Example
• Unigrams: I, am, Sam, fly
P(I)=3/8 P(am)=2/8 P(Sam)=2/8 P(fly)=1/8
• Trigrams: There are 6*6*6=216 trigrams.
– Assume there are two tokens <s> <s> at the begining, and two tokens </s> </s> at the end.
P(I|<s> <s>)=2/3 P(Sam|<s> <s>)=1/3
P(am|<s> I)=1/2 P(fly|<s> I)=1/2
P(I|<s> Sam)=1
P(Sam|I am)=1/2 P(</s>|I am)=1/2
P(</s>|am Sam)=1
P(</s>|Sam </s>)=1
Dr. Wafa Zaal Almaaitah
Estimating N-Gram Probabilities
Corpus: Berkeley Restaurant Project Sentences
• There are 9222 sentences in the corpus.
• Raw biagram counts of 8 words (out of 1446 words)
Dr. Wafa Zaal Almaaitah
Estimating N-Gram Probabilities
Corpus: Berkeley Restaurant Project Sentences
• Unigram counts:
• Normalize bigrams by unigram counts:
Dr. Wafa Zaal Almaaitah
Bigram Estimates of Sentence Probabilities
• Some other bigrams:
P(i|<s>)=0.25 P(english|want)=0.0011
P(food|english)=0.5 P(</s>|food)=0.68
• Compute the probability of sentence I want English food
P(<s> i want english food </s>)
= P(i|<s>) P(want|i) P(english|want) P(food|english) P(</s>|food)
= 0.25*0.33*0.0011*0.5*0.68
= 0.000031
Dr. Wafa Zaal Almaaitah