Week5 Lab Langmodels
Week5 Lab Langmodels
Week5 Lab Langmodels
Task 1: Estimation
Table 1 gives the bigram counts of a small corpus of Harry Potter (100 types, 2,000
tokens).
a) Which of the following three sentences is the most probable one according to the
bigram probabilities given above?
b) Write out the equation for trigram probability estimation and all non-zero trigram pro-
babilities for the modified ‘I am Sam’ corpus below:
c) When building an n-gram language model, what are the pros and cons of using very
short (e.g., 2-grams) or very long (e.g., 10-grams) sequences?
1
Task 2: Smoothing
a) Using a trigram language model with add-one smoothing, what is PLaplace (am|Sam
I) and PLaplace (Sam|I am) in the modified corpus above? Include <s> and </s> in your
counts just like any other token.
Task 3: Perplexity
a) Go through the calculation of perplexity in the lecture with the digits in English.
b) Now suppose that the number zero is really frequent and occurs far more often than
other numbers. Lets say that 0 occurs 91 times in the training set, and each of the other
digits occurred 1 time each. Now we see the following test set: 0 0 0 0 0 3 0 0 0 0. What
is the perplexity of this test set?
Task 4: Miscellaneous
Indicate if the following statements about language models are true (T) or false (F).
• We can use n-gram language models to generate the next word in a sentence.