plm.17
plm.17
plm.17
0
Noah A. Smith
c 2017
March 2017
Abstract
We introduce and explain language models. These notes are inspired in part by notes by Collins
[2011a], who provides more detail about some topics. See also Jurafsky and Martin [2016] for a textbook
treatment.
What makes a good English sentence? A human might tell you that a good sentence is grammatical,
meaningful, and “fits” into the context in which it is uttered or written. In NLP, all of these considerations
eventually come to play a role, but we begin with a much simpler observation: the vast majority of se-
quences of words are not good sentences, but almost anything is possible. Probabilistic language modeling—
assigning probabilities to pieces of language—is a flexible framework for capturing a notion of plausibility
that allows anything to happen but still tries to minimize surprise.
1 The Problem
Formally, the language modeling problem is as follows. Let V be the vocabulary: a (for now, finite) set of
discrete symbols. Let 8, called the “stop” symbol, be one element of V. Let V denote the size of V (also
written as |V|. Let V† be the (infinite) set of sequences of symbols from V whose final symbol is 8.
A language model is a probability distribution for random variable X, which takes values in V† (i.e.,
sequences in the vocabulary that end in 8). Therefore, a language model defines p : V† → R such that:
1
2 Why Language Modeling?
One motivation for language modeling comes from the noisy channel paradigm, a metaphor and modeling
pattern that has inspired considerable research in NLP.
Suppose we have two random variables, O (which is the observed input to our system) and D (which is
the desired output of our system). We might view O’s value as a ciphertext and D as a plaintext into which
we would like to decode the value of O. To do this, we must solve:
source −→ D −→ channel −→ O
First, the source distribution randomly generates a value for the plaintext random variable D. Second, this
plaintext passes through a channel that corrupts it (or “adds noise to it”), resulting in a value for the observed
random variable O. Decoding works in the reverse direction, applying Bayes rule (Equation 4) to solve for
the most likely value of D given the observation that O = o.
Classic examples of the noisy channel model include speech recognition and machine translation; in both
cases, the source model is a language model over word sequences in the output language; D ranges over
V† . In speech recognition, the acoustic model gives a distribution over sounds given words. In machine
translation, the translation model gives a distribution over input-language sentences given output-language
sentences. Both of these are channel models; note that they counterintuitively model a process of transform-
ing an output into an input.
The other motivation for studying language models first in a natural language processing course is that
the techniques we use in language modeling provide a kind of simplest first case for methods that are used
repeatedly later on.
A high-quality language model will assign high probability to real sentences it has never observed (i.e., that
were not part of the data used to train the model). The standard way to measure the quality of a language
model is to consider a dataset that is separate from the training data, called a test dataset, x̄1:m , and calculate
the model’s perplexity on this dataset:
m m
1 1
Y X
− log2 p(x̄i )
− log2 p(x̄i )
M M
perplexity(p; x̄1:m ) = 2 i=1 =2 i=1 (6)
2 of 15
Do not be thrown by the “two to the power . . . ” at the beginning of this expression, or by the loga-
rithm! Remember that exponentiation and logarithms are inverses; what’s going on in this formula is kind of
intuitive if we take it step by step. We’ll consider the expression on the right of Equation 6.
1. First, calculate the probability of each element of the test data x̄i , for i ∈ {1, . . . , m}.
2. Taking the (base 2) logarithm of each of those values gives us a nonpositive number that could have
arbitrarily large magnitude. Big negative values go with tiny probabilities; values close to zero go with
larger probabilities. Recall that the logarithm function is monotonic; as a increases (decreases), log2 a
also increases (decreases).
3. Negating those logarithms gives us something known in machine learning as the log loss. Now, the values
are nonnegative; large values mean low probabilities (bad), and values close to zero mean higher proba-
bilities. This is sometimes described as a measure of surprise (or, in information theory, “information”)
that the model experiences when it observe a test sentence.
4. Take a per-word average of those sentence losses, by summing them up and dividing by the number of
words. This converts to a per-word surprise score.
5. Raise 2 to this power. Again, this is a monotonic transformation (as a increases, so does 2a , only much
faster).
Perplexity can therefore be understood as a kind of branching factor: “in general,” how many choices
must the model make among the possible next words from V? Consider a simpler case where we have only
1
− |x|
log2 p(x̄)
one test sentence, x̄. Perplexity is then 2 . A few boundary cases are interesting to consider:
1
• If the model (miraculously) assigns probability of 1 to x̄, then perplexity will be 1, since 2− M log2 1 = 1.
Here, there is always one choice, and it’s always correct! This never happens, because in reality anyone
could say anything at any time.
|x̄|
• A model that assigns p(x̄) = V1 —essentially rolling an evenly weighted V -sided die for each word in
x̄—will have perplexity V . Every word is uniformly possible at every step and therefore pretty surprising.
It doesn’t get more “random” than this, as there are always V choices.
• A model that assigns p(x̄) = 0 will have infinite perplexity, because log2 0 = −∞.
Perplexity is not a perfect measure of the quality of a language model. It is sometimes the case that
improvements to perplexity don’t correspond to improvements in the quality of the output of the system that
uses the language model. Some researchers see it as a “pure,” system-agnostic way to test the quality of
a model (or model family, or estimation procedure). When you see perplexity scores in the literature, take
them with a grain of salt.
Two important caveats to remember about perplexity are (i) that you cannot directly compare two models’
perplexity if they do not use identical vocabularies, and (ii) perplexity is not meaningful if the sum-to-one
constraint (Equation 2) is not satisfied by your language model.
4 Vocabulary
Our starting point for defining V, the vocabulary for our language model, is to include every word that occurs
in the training dataset x1:n . It should be immediately obvious that a finite training dataset is not going to
include all of the possible words in a natural language, because:
• New words show up all the time, either because they are borrowed from other languages or because they
are invented (“neologisms”).
• Many languages form words productively through their rules of morphology, so that the number of possi-
ble words results from a combinatorial explosion.
3 of 15
• Real-world data often shows variation in how words are written or spelled, either because there aren’t
standards or because people sometimes do not adhere to them.
A common solution is to introduce a special symbol into V called the “unknown word” denoted “UNK”
for any out-of-vocabulary word. In the training dataset, some words—usually some or all instances of some
or all of the rarest word types—are replaced by UNK.
Another solution, for morphologically rich languages, is to build a computational model that maps be-
tween the “surface forms” of words (as they are observed in text) and their “underlying” morpheme se-
quences. For example, in English, the word starting might map to hstart, -ingi, the word vocabularies might
map to hvocabulary, -si, and the word morphologically might map to hmorph, -ology, -ical, -lyi. If every
word in a text dataset can be decomposed into its morphemes, we could build a language model with V as
the set of morphemes, rather than surface words. Morphological ambiguity, where a word can be broken
down in multiple ways,1
A third solution is to build language models at the level of characters, or to combine character- and
word-level language models. This powerful idea has become more popular recently, but in order to read
those papers, you first need a basic understanding of how language models work with the simpler “UNK”
solution.
4 of 15
assumed to appear just before every sentence; the random variable X0 always takes the value x0 .2
p(X1 = x1 | X0 = x0 )
· p(X2 = x2 | X0:: 1 = x0:1 )
p(X = x) = · p(X3 = x3 | X0:2 = x0:2 ) (8)
..
.
· p(X` = 8 | X0:`−1 = x0:`−1 )
`
Y
= p(Xj = xj | X0:j−1 = x0:j−1 ) (9)
j=1
Note that this decomposition does not involve any assumptions. Each word is generated conditioned on the
history of words that came before, and the entire history is considered.
How would we estimate the component distributions? The starting point for estimating distributions
like these is the same as in the creepy language model: probabilities should be proportional to counts (or
frequencies). For conditional probabilities like those in Equation 9, the relative frequency estimate is:
cx1:n (x0:j )
p(Xj = xj | X0:j−1 = x0:j−1 ) = (10)
cx1:n (x0:j−1 )
That is, the probability is estimated as the frequency of observing a history/word pair, divided by the fre-
quency of observing just the history (in the training dataset). Plugging in to Equation 9, we get:
`
Y cx1:n (x0:j )
p(X = x) = (11)
cx1:n (x0:j−1 )
j=1
`
Y
cx1:n (x0:j )
j=1
= `
(12)
Y
cx1:n (x0:j−1 )
j=1
cx1:n (x0:1 ) · cx1:n (x0:2 ) · cx1:n (x0:3 ) · · · cx1:n (x0:`−1 ) · cx1:n (x0:` )
= (13)
cx1:n (x0 ) · cx1:n (x0:1 ) · cx1:n (x0:2 ) · cx1:n (x0:3 ) · · · cx1:n (x0:`−1 )
cx (x0:` )
= 1:n (14)
cx1:n (x0 )
cx (x)
= 1:n (15)
n
This is identical to Equation 7, so it should be clear that nothing has changed yet. (Note that cx1:n (x0 )
is exactly n, the number of sentences in the training dataset, since the start symbol occurs before the
beginning of each sentence, and only there.)
5 of 15
Mathematically:
assumption
p(Xj = xj | X0:j−1 = x0:j−1 ) = pθ (Xj = xj ) (16)
`
Y
pθ (X = x) = pθ (X = xj ) (17)
j=1
This model also is known as a (first-order) Markov model, because the assumption in Equation 20 is a
(first-order) Markov assumption.
6 of 15
The parameters of this model, θ, include a value for the probability of every word v in V, conditioned on
every word v 0 in V, plus the start symbol . The relative frequency estimate is:
It’s important to notice that the denominator here is the count of the history word, occurring as a history word
with any vocabulary item u ∈ V. That means that when we estimate (for example) θThe| , the denominator
will count the number of times some word occurs after (which will be the number of sentences n, since
every sentence’s first word follows an implied ).
The number of parameters that need to be estimated for the bigram model is given by the number of
histories (V + 1, accounting for ) times the vocabulary size (V ): V (V + 1).
The parameters of this model, θ, include a value for the probability of every word v in V, conditioned
on every history. Histories are now a bit more complicated, because we need a (n − 1)-length history for
every word, including the first one (which up until now was only preceded by x0 = ). To keep notation
simple, we assume that every sentence is preceded by as many copies of as we need to ensure that x1 has
a (n − 1)-length history. This means that x−(n−2) = x−(n−3) = · · · = x−1 = x0 = . We use h to denote
a history of length (n − 1). The relative frequency estimate is:
As before, the denominator here is the count of the history word, occurring as a history word.
Notice that, as we increase n, these models approach the creepy language model, remembering more
and more of each word’s history. With relative frequency estimation, they will never allow a word to occur
following a history it wasn’t observed with in the training dataset. As we decrease n, we approach a unigram
model that simply ignores the history. The best n value for your problem will depend on the training dataset,
especially its size (n and N ), and on the size of the vocabulary (V ). When making a high-level design choice
like n, it is advisable to reserve a portion of data, separate from your training dataset, and use it to make
the choice. High-level choices like n’s value are sometimes called model selection or hyperparameter
selection. Here’s a simple procedure for choosing n:
7 of 15
1. When constructing your training dataset, hold some data aside (i.e., do not include it in x1:n ). Call this
your development dataset; we’ll refer to it as ẋ1:d .
2. Let N be the set of values for n you are willing to consider.
3. For each g ∈ N:
(g)
(a) Estimate the parameters θ̂ for a g-model from the training dataset x1:n .
(b) Calculate perplexity(p (g) ; ẋ1:d ), the perplexity of the g-gram model on the development dataset.
θ̂
4. Choose the g ∗ with the lowest development-dataset perplexity, and let n = g ∗ .
n-gram models are fairly easy to build, requiring only a pass over the data to gather all of the appropriate
counts, then simple operations to obtain relative frequency estimates. When working with very large training
datasets some engineering is required to work with them efficiently. Many of the key tricks are implemented
in open-source implementations like KenLM3 and SRILM.4
It is not difficult to prove that relative frequency estimation gives the maximum likelihood estimate (MLE)
for conditional categorical distributions like those our n-gram models are built out of.
One of the reasons statisticians like the MLE in general is theoretical. In the limit as n → ∞, the MLE
will converge on the true distribution that generated the data, provided that the model family (here, an n-
model for a particular value of n) includes the true distribution. In NLP, we almost never believe that the
model family includes the true distribution; we accept that our models are imperfect and simply try to make
them as useful as possible.
And MLE n-models are not very useful, because, for history h, the probability of any word not observed
immediately after that history will be 0. Further, for any history that wasn’t observed, the distribution over
words that can immediately follow it is completely undefined (since its frequency is 0). This spells disaster
for any perplexity calculations!
If you’ve taken a class in machine learning, you’ve likely learned about the problem of overfitting,
that is, when a model’s performance on training data is very good, but it doesn’t generalize well to test (or
3
https://kheafield.com/code/kenlm/
4
http://www.speech.sri.com/projects/srilm/
8 of 15
development) data. The inability of MLE n-gram models to deal well with data sparsity can be seen as an
example of overfitting.
5.6 Smoothing
In order to make n-gram models useful, we must apply a smoothing transformation that eliminates zero-
counts, both for histories and for n-grams. There are many methods available for smoothing, and some of
them are quite complicated. To understand why, consider the original definition of a language model, which
requires that the sum of probabilities over all sequences in V† be one (Equation 2. It is straightforward to
show that this constraint is met by MLE with relative frequency estimation; it is harder to guarantee when
we start manipulating the estimation procedure.
Instead of explaining specific smoothing algorithms in detail, we will consider some “safe” transforma-
tions that provide building blocks for smoothing.
where α ∈ [0, 1]. This is a bigram model whose estimate is given partly by MLE and partly by a unigram
MLE estimate. Unlike the unsmoothed bigram estimate, this inteprolated estimate will never assign zero to
a word that has a nonzero unigram count, as long as α > 0. Unlike the unsmoothed unigram estimate, the
interpolated estimate pays attention to the context, as long as α < 1.
The above idea generalizes to any number (K) of models. Let α ∈ RK + be a vector of positive interpola-
tion coefficients that sums to one. Then let
K
(interp) (k)
X
θ̂v|h = αk θ̂v|h (30)
k=1
(k)
(We abuse notation slightly; some models may not define θv|h for the full history h. In such cases, we
truncate the history as needed to fit the model.)
This idea leads naturally to the question of how to choose α. For the unigram/bigram example above,
it’s straightforward to show that if we follow the maximum likelihood principle and choose α to make the
training dataset as likely as possible, we will pick α = 0; a bigram model has more flexibility than a unigram
model and can always make the training dataset more likely.
We therefore use development data to choose α, simulating the test-time experiment where the model
is exposed to new data. α is an example of a hyperparameter, so a model selection method like the one
presented in Section 5.4 (for choosing n) can be applied.
(MLE) cx1:n (v 00 v 0 v)
θ̂v|v00 v0 = P 00 0
(31)
u∈V cx1:n (v v u)
9 of 15
If, prior to this calculation, we augmented every count by a fixed quantity λ, we’d have:
(add-λ) λ + cx1:n (v 00 v 0 v) λ + cx1:n (v 00 v 0 v)
θ̂v|v00 v0 = P 00 0
= (32)
V λ + u∈V cx1:n (v 00 v 0 u)
P
u∈V (λ + cx1:n (v v u))
5.6.3 Discounting
A third tool, similar in spirit to adding λ, is discounting, where we subtract away from each count before
normalizing into probabilities. Let δ ∈ (0, 1) be the discount value. For a given history h, we divide the
vocabulary V into two disjoint sets:
A(h) = {v ∈ V : cx1:n (hv) > 0} (33)
B(h) = {v ∈ V : cx1:n (hv) = 0} (34)
For v ∈ A(h), define:
(disc) cx (hv) − δ
θ̂v|h = P 1:n (35)
u∈V cx1:n (hu)
The result of this transformation will be that the probabilities sum up to a value less than one. Let the
“missing mass” be denoted by
X (disc)
q(h) = 1 − θ̂v|h (36)
v∈A(h)
This mass, q(h), will be divided up among all of the words in B(h). One common way to do it is to divide
up proportionally according to a shorter-n n-gram model. This is known as backoff. A simpler way is to
simply divide it up uniformly across B(h). The value of δ can be chosen using model selection.
Note that, in general, δ need not be a fixed value for every history length, every history frequency, or even
every history. One famous method for discounting, Good-Turing discounting takes the total count mass for
words with observed count c and redistributes it among words with observed count c − 1. This means that,
for a given history, zero-frequency words are given a probability similar to what we would normally assign to
words that occurred once, which are given a probability similar to what we would normally assign to words
that occurred twice, and so on.
6 Algorithms
There are three useful language model-related algorithms you should be able to figure out:
1. Given a language model p and a sentence x, calculate p(x), or its logarithm.5 You should be able to do
this in O(`) runtime, where ` is the length of the sentence. Because language model probabilities are tiny
and can lead to underflow, the following equation is extremely useful:
`
X
log p(x) = log p(xj | x0:j−1 ) (37)
j=1
5
The logarithm is sometimes preferred to avoid underflow, and it’s what you need for perplexity (Equation 6).
10 of 15
That is, summing logarithms of probabilities (log-probabilities) is a smart way to avoid the underflow that
would likely result if we multiplied together probabilities. It doesn’t really matter which logarithm base
you use, as long as you exponentiate with the same base (the default in most programming languages is
base e).
2. Estimate a language model’s parameters from a training dataset. You should be able to do this in O(N )
runtime, where N is the number of words in the training data, and also O(N ) space.6 Importantly, you
should not require O(V n ) space, which is what you would need if you naı̈vely stored every n-gram’s
probability! Depending on your smoothing method, many n-grams will have the same probability, and
you will want to store that value just once if you can. In particular, an n-gram that had a training dataset
frequency of zero should probably take a “default” probability value, depending on some properties of the
history it contains.
3. Randomly sample a sentence x from a language model p. You should be able to sample each word in
O(V ) time.
Another solution to the data sparseness problem is to rethink the way we parameterize language models.
Instead of associating a parameter θv|h with every n-gram, the approach we consider next converts the n-
gram into a collection of attributes (known as features) and uses those to assign it a probability.
See Collins [2011b] or Smith [2004] for general introductions to log-linear models. Here we present
them for the language modeling case, specifically.
To define a log-linear language model, we first introduce a function φ that maps a history and a word (in
V) to a real-valued vector in Rd . Associated with the kth dimension of this vector is a coefficient or weight
wk , and together these are stacked into w ∈ Rd , which are the parameters of the model. The form of the
language model is given by:
`
Y
pw (X = x) = pw (Xj = xj | X0:j−1 = x0:j−1 ) (38)
j=1
`
Y exp w · φ(x0:j−1 , xj )
= (39)
Zw (x0:j−1 )
j=1
`
assumption
Y exp w · φ(xj−n+1:j−1 , xj )
= (40)
Zw (xj−n+1:j−1 )
j−1
`
Y exp w · φ(hj , xj )
= (41)
Zw (hj )
j=1
The assumption in Equation 40 is the standard (n − 1)-order Markov assumption, and the result is that the
feature vector φ maps n-grams (hv) to vectors. The Zw function in the denominator is shorthand for a
summation that guarantees that the distribution over words given history h will sum to one. Its full form is:
X
Zw (h) = exp w · φ(h, v) (42)
v∈V
The form of the log-linear language model is a bit daunting at first. It’s helpful to consider how it’s built
from the inside out, which we’ll do in the next few sections.
6
If you use interpolation or backoff, you might need O(N n) runtime and space.
11 of 15
7.1 Features
When we consider an n-gram hv, we start by mapping it to its feature vector φ(hv). Some kinds of features
conventionally used include:
• Indicators for n-grams. For example, a feature for the trigram do not disturb would return 1 if and only if
the last two words in h are do not and v is disturb; otherwise it returns 0. There are V 3 trigram features
(and V n features for general n-length n-grams).
• Indicators for gappy n-grams. This allows, for example, the old man, the healthy man, the bearded man,
and so on, to all have probabilities that rise or fall together, through a feature that returns 1 if and only if
the second-to-last word in h is the and v is man.
• Spelling features, such as whether a word begins with a vowel or a capital letter, or whether a word
contains a digit. These can be conjoined with a history word; for example, a feature that returns 1 if and
only if the last word of h is an and v starts with a vowel could let the language model learn to prefer an
(to a) before words that start with vowels.
• Class features, that use some external resource to define sets of words that share membership in some
class like “protein names” or “geographic place names” or “transitive verbs.” Such a feature returns 1 if v
is in the class, and 0 if it isn’t; as with the spelling features, it can be conjoined with some feature of the
history.
You can define any features as you’d like, as long as they can be calculated by considering only h and v.7
It’s common in NLP for features to be binary indicators, like those above, and to form new features by
conjoining the values of simpler binary indicator features.
The first challenge of log-linear language models is in choosing good features. If you choose too many,
your model will be prone to overfitting the training dataset. If you don’t choose enough features that capture
the properties of the language, your model will not learn to generalize.
7.2 Coefficients
Once we’ve mapped hv to its feature vector representation φ(h, v), we take an inner product with the
coefficients w:
d
X
w · φ(h, v) = wk φk (h, v) (43)
k=1
This linear mapping can be understood as projecting our n-gram to a linear score. For a binary feature φk ,
the value of the corresponding coefficient wk tells us immediately how the presence of the feature changes
the score:
• If wk > 0, then the score of any n-gram that “has” this feature (i.e., where φk (h, v) = 1) increases by wk .
This will make the n-gram more likely.
• If wk < 0, then the score of any n-gram that “has” this feature (i.e., where φk (h, v) = 1) decreases by
−wk . This will make the n-gram less likely.
• If wk = 0, then the score of any n-gram that “has” this feature (i.e., where φk (h, v) = 1) is unaffected.
7
It can be shown that a feature that depends only on h and not v will have no effect on the probability distribution for history h.
This is left as an exercise.
12 of 15
7.3 Softmax: Exponentiate and Normalize
Once we map n-grams to scores, we can transform the scores into probability distributions for each history
h, by applying a transformation often called the softmax:
* +
ea1 ea2 eaV
softmax (ha1 , a2 , . . . , aV i) = PV , PV , . . . , PV (44)
ak ak ak
k=1 e k=1 e k=1 e
The softmax takes every value in a vector—here, the collection of scores for words in V paired with a single
history h—and exponentiates them before renormalizing them. This transformation has some desirable
properties:
• It preserves monotonicity (i.e., if ai > aj then the ith value of the new vector will be larger than the jth
value of the new vector). This means that the higher an n-gram’s score is, the higher its probability will
be.
• It gives a proper probability distribution over V, i.e., one comprised of nonnegative values that sum to one.
• If the scores are all finite, then every n-gram will have positive probability; there will be no zeroes.
This problem is concave in w and differentiable with respect to w, which means that we can negate it and
apply standard methods for convex optimization to solve it with high precision. Two popular methods are:
• L-BFGS, a “batch” method that iteratively calculates the gradient of the total log-likelihood with respect
to w and updates in the gradient direction, with clever math for deciding how far to go in that direction;
and
• stochastic gradient descent, a method that calculates the gradient of the log-likelihood with respect to one
(or a small number) of instances i, often chosen by random shuffling of the data, and updating w after
each such calculation.
There is an intuitive way to understand Equation 47. For each word xi in the training set, the goal is to
increase the score that it receives with its observed history h (the first term inside the summation over i).
8
Since the logarithm function is monotonic, the parameters that maximize log-likelihood will be identical to those that maximize
likelihood.
13 of 15
We can’t stop there, though; if we did, then the trivial solution would drive every wk to +∞. We must give
the optimization method something to decrease, and that’s where the second term, the logarithm of Zw (h)
comes in. We sometimes refer to these two components as the “hope” and “fear” parts of the objective.
Letting s(v) denote the linear score of a word with history h (suppressed for clarity), we have:
X
increase s(xi ) while decreasing log exp s(v) (48)
| {z }
v∈V
hope | {z }
fear
P
Note that the “log exp” transformation is a smooth upper bound on the max function. It arises frequently
in machine learning. Here, the intuition comes from imagining that it (approximately) picks out the score
of the currently-most probable v given history h, and tries to push it down. In fact, the scores of all v get
pushed down, in aggregate. The only way to achieve the maximum likelihood solution is to achieve balance
between hope and fear, so that the score of xi is as close as possible to the (approximately) most-probable
next word.
Of course, in real data, many words may share this history, so really what we’re trying to achieve for
history h is:
X X
increase s(xi ) while decreasing |{i : hi = h}| · log exp s(v) (49)
| {z }
i:hi =h v∈V
hope | {z }
fear
So, in aggregate, we want the scores of words that actually follow h to come close to the (approximately)
most probable word’s scores. Words that occur with h more often will be counted more times in the “hope”
term, so we’ll have more to gain by increasing their probabilities—just as in relative frequency estimation
for n-gram models.
Of course, because the coefficients w are shared across all histories, we can’t solve the above problem
separately for each history by itself. The larger maximum likelihood problem considers all of the data
together, balancing between hope and fear across all histories.
Notice that calculating Zw requires summing up scores over the vocabulary size; simply calculating the
objective function will require O(V N d) runtime. This implies that solving Equation 47 will be a fairly
expensive operation, especially compared to classical n-gram models. This is one reason that log-linear
models never became truly mainstream.
A second challenge with log-linear language models is that they tend to overfit. The overfitting problem
here has less to do with data sparsity and more with the rich expressive power of models that use many
features. It’s straightforward to show that, under certain conditions, a feature’s coefficient might tend toward
positive or negative infinity as we approach the maximum likelihood solution. One way to prevent this is to
regularize the log-likelihood objective by adding a penalty that grows as coefficients take on more extreme
values. The simplest version of this is a quadratic penalty:
N d
!
X X
max w · φ(hi , xi ) − log Zw (hi ) − λ wk2 (50)
w∈Rd
i=1 k=1
where λ is a hyperparameter chosen via model selection. This technique is sometimes called (squared) `2
regularization, because the penalty is proportional to the squared `2 -norm of the w vector, written as kwk22 .
The `1 norm can also be used; it has the interesting and arguably desirable property of driving many of the
wk to 0, so that they can be safely eliminated from the model.
14 of 15
References
Michael Collins. Course notes for COMS w4705: Language modeling, 2011a. URL http://www.cs.
columbia.edu/˜mcollins/courses/nlp2011/notes/lm.pdf.
Michael Collins. Log-linear models, MEMMs, and CRFs, 2011b. URL http://www.cs.columbia.
edu/˜mcollins/crf.pdf.
Daniel Jurafsky and James H. Martin. N-grams (draft chapter), 2016. URL https://web.stanford.
edu/˜jurafsky/slp3/4.pdf.
Noah A. Smith. Log-linear models, 2004. URL http://homes.cs.washington.edu/˜nasmith/
papers/smith.tut04.pdf.
15 of 15