[go: up one dir, main page]

0% found this document useful (0 votes)
18 views2 pages

Week5 Lab Langmodels

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 2

Computational Linguistics University of Passau

Lab session Annette Hautli-Janisz, Prof. Dr.


Summer 2022

Week 5: Language models

Task 1: Estimation
Table 1 gives the bigram counts of a small corpus of Harry Potter (100 types, 2,000
tokens).

<s> Hermione 0.005


Hermione likes 0.22
likes Ron 0.13
likes Harry 0.18
likes chocolate 0.01
Ron <\s> 0.11
Harry <\s> 0.05

Tabelle 1: Table 1: Bigram probabilities

a) Which of the following three sentences is the most probable one according to the
bigram probabilities given above?

• <s> Hermione likes Ron <\s>

• <s> Hermione likes Harry <\s>

• <s> Hermione likes chocolate <\s>

b) Write out the equation for trigram probability estimation and all non-zero trigram pro-
babilities for the modified ‘I am Sam’ corpus below:

<s> I am Sam </s>


<s> Sam I am </s>
<s> I am Sam </s>
<s> I do not like green eggs and ham </s>

c) When building an n-gram language model, what are the pros and cons of using very
short (e.g., 2-grams) or very long (e.g., 10-grams) sequences?

1
Task 2: Smoothing
a) Using a trigram language model with add-one smoothing, what is PLaplace (am|Sam
I) and PLaplace (Sam|I am) in the modified corpus above? Include <s> and </s> in your
counts just like any other token.

b) Read up on backoff and linear interpolation in Section 3.5.3 in https://web.stanford.


edu/~jurafsky/slp3/3.pdf -

c) If we use linear interpolation smoothing between a maximum-likelihood bigram model


and a maximum-likelihood unigram model with λ2 = 0.5 and λ1 = 0.5, respectively, what
is P(Sam|am) in the Sam corpus above? Include <s> and </s> in your counts just like
any other token.

Task 3: Perplexity
a) Go through the calculation of perplexity in the lecture with the digits in English.

b) Now suppose that the number zero is really frequent and occurs far more often than
other numbers. Lets say that 0 occurs 91 times in the training set, and each of the other
digits occurred 1 time each. Now we see the following test set: 0 0 0 0 0 3 0 0 0 0. What
is the perplexity of this test set?

Task 4: Miscellaneous
Indicate if the following statements about language models are true (T) or false (F).

• We can use n-gram language models to generate the next word in a sentence.

• n-gram language models assign negative probabilities to impossible n-gram com-


binations.

• Smoothing is a possible solution to sparse data.

• We can evaluate the performance of a language model by calculating its perple-


xity.

You might also like