Week5 Lab Langmodels

This document discusses a lab session on language models. It includes four tasks: 1) Estimating bigram and trigram probabilities from corpora and discussing pros and cons of short and long n-grams. 2) Calculating smoothed probabilities using add-one smoothing and discussing backoff and linear interpolation. 3) Calculating perplexity on digits and a test set with frequent zeros. 4) Indicating whether statements about language models are true or false.

Uploaded by

amine karoui

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views2 pages

Week5 Lab Langmodels

Uploaded by

amine karoui

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

Computational Linguistics University of Passau

Lab session Annette Hautli-Janisz, Prof. Dr.

Summer 2022

Week 5: Language models

Task 1: Estimation
Table 1 gives the bigram counts of a small corpus of Harry Potter (100 types, 2,000
tokens).

<s> Hermione 0.005

Hermione likes 0.22
likes Ron 0.13
likes Harry 0.18
likes chocolate 0.01
Ron <\s> 0.11
Harry <\s> 0.05

Tabelle 1: Table 1: Bigram probabilities

a) Which of the following three sentences is the most probable one according to the
bigram probabilities given above?

• <s> Hermione likes Ron <\s>

• <s> Hermione likes Harry <\s>

• <s> Hermione likes chocolate <\s>

b) Write out the equation for trigram probability estimation and all non-zero trigram pro-
babilities for the modiﬁed ‘I am Sam’ corpus below:

<s> I am Sam </s>

<s> Sam I am </s>
<s> I am Sam </s>
<s> I do not like green eggs and ham </s>

c) When building an n-gram language model, what are the pros and cons of using very
short (e.g., 2-grams) or very long (e.g., 10-grams) sequences?

1
Task 2: Smoothing
a) Using a trigram language model with add-one smoothing, what is PLaplace (am|Sam
I) and PLaplace (Sam|I am) in the modiﬁed corpus above? Include <s> and </s> in your
counts just like any other token.

b) Read up on backoff and linear interpolation in Section 3.5.3 in https://web.stanford.

edu/~jurafsky/slp3/3.pdf -

c) If we use linear interpolation smoothing between a maximum-likelihood bigram model

and a maximum-likelihood unigram model with λ2 = 0.5 and λ1 = 0.5, respectively, what
is P(Sam|am) in the Sam corpus above? Include <s> and </s> in your counts just like
any other token.

Task 3: Perplexity
a) Go through the calculation of perplexity in the lecture with the digits in English.

b) Now suppose that the number zero is really frequent and occurs far more often than
other numbers. Lets say that 0 occurs 91 times in the training set, and each of the other
digits occurred 1 time each. Now we see the following test set: 0 0 0 0 0 3 0 0 0 0. What
is the perplexity of this test set?

Task 4: Miscellaneous
Indicate if the following statements about language models are true (T) or false (F).

• We can use n-gram language models to generate the next word in a sentence.