Model With One-Word Context: 2vec 2vec 2vec 2vec
Model With One-Word Context: 2vec 2vec 2vec 2vec
Model With One-Word Context: 2vec 2vec 2vec 2vec
Another thing that’s important to understand, is that word2vec is not deep learning;
both CBOW and skip-gram are “shallow” neural models. Tomas Mikolov even said that
the whole idea behind word2vec was to demonstrate that you can get be#er word
representa!ons if you trade the model’s complexity for efficiency, i.e. the ability to
learn from much bigger datasets. In his papers, Mikolov recommends using the skip-
gram model with nega!ve sampling, as it outperformed the other variants on analogy
tasks.
We start from the simplest version. We assume that there is only one word considered
per context. The CBOW model will predict one target word given one context word. The
skip-gram model will predict one context word given one target word. In the one-word
context, the CBOW model and skip-gram model are actually completely same.
Figure 1 shows the model under the simplified one-word context.
Model defini!on
The vocabulary size is V , and it means there are V unique words in the vocabulary.
Each input word can be be mapped with a func!on onehot(vk ) to a V -dimension one-
hot encoded vector x. For example, when the input word is vk (1 ≤ k ≤ V ), the k -th
word in vocabulary.
⎡ x1 ⎤ ⎡0⎤
⎢ ⋮ ⎥ ⎢⋮⎥
x=⎢ ⎥ ⎢ ⎥
⎢ xk ⎥ = onehot(vk ) = ⎢1⎥
⎢ ⋮ ⎥ ⎢⋮⎥
⎣x ⎦ ⎣0⎦
V
where only the k -th element of x, xk = 1, and all other elements xk′ = 0 when k ′ ≠
k.
V
hi = ∑ wk,i × xk
k=1
Note: we use the subscript wk,i instead of wi,k because it’s more convenient for
vectorized representa!ons shown later.
⎡ x1 ⎤
⎢ ⋮ ⎥
hi = [w1,i ⋯ ⋯ wV ,i ] × ⎢
⎢ ⎥
⎥ = wT × x
⎢ ⋮ ⎥
wk,i xk i
⎣x ⎦
V
⎡ w1,i ⎤
⎢ ⋮ ⎥
wi = ⎢ ⎥
⎢ wk,i ⎥
⎢ ⋮ ⎥
⎣wV ,i ⎦
We have
⎡ w1,1 ⋯ wk,1 ⋯ wV ,1 ⎤ ⎡ 0 ⎤ ⎡ wk,1 ⎤
⎢ ⋮ ⋮ ⋱ ⋮ ⋮ ⎥ ⎢⋮⎥ ⎢ ⋮ ⎥
h = WT × x = ⎢
⎢ w1,i ⋯ ⋯ wV ,i ⎥ ⎢ ⎥ ⎢ ⎥
⎥ × ⎢xk ⎥ = ⎢ wk,i ⎥
⎢ ⋮ ⋮ ⎥ ⎢⋮⎥ ⎢ ⋮ ⎥
wk,i
⋮ ⋱ ⋮
⎣w ⋯ wk,N ⋯ wV ,N ⎦ ⎣ 0 ⎦ ⎣wk,N ⎦
1,N
Note: As men!oned before, the input x is a V -dimension one-hot vector with only
xk = 1.
From the above equa!on, we can find h is actually same as the k -th column of W T or
the k -th row of the matrix W .
We define the the k -th row of the matrix W as Wk,∗ , and we can get
⎡ wk,1 ⎤
⎢ ⋮ ⎥
h=⎢ ⎥
⎢ wk,i ⎥ = Wk,∗
⎢ ⋮ ⎥
⎣wk,N ⎦
⎡ y1 ⎤
⎢ ⋮ ⎥
y=⎢ ⎥
⎢ yj ⎥
⎢ ⋮ ⎥
⎣y ⎦
V
Each element yj in y is the predic!on probability that the output word is vj , the j -th
word in the vocabulary, given the input word’s one-hot vector representa!on is x =
onehot(vk ).
⎡ h1 ⎤
⎢ ⋮ ⎥
′
uj = [w1,j ⋯ ′
⋯ ′ ⎢ ⎥
,j ] × ⎢ hi ⎥ = (wj ) × x
′ T
⎢ ⋮ ⎥
wi,j wN
⎣hN ⎦
⎡ ⎤
′
w 1,j
⎢ ⋮ ⎥
wj′ = ⎢
⎢
′ ⎥
⎥
⎢ ⋮ ⎥
w i,j
⎣w ′ ⎦
N ,j
⎡ ⎤
′ ′ ′
w 1,1 ⋯ w1,j ⋯ w1,V
⎢ ⋮ ⋮ ⋱ ⋮ ⋮ ⎥
W ′ = [w1′ ⋯ wj′ ⋯ wV′ ] = ⎢
⎢
′
⋯ ′
⋯ ′ ⎥
⎥
⎢ ⋮ ⋮ ⎥
w i,1 wi,j wi,V
⎣w ′ ⎦
⋮ ⋱ ⋮
′ ′
N ,1 ⋯ wN ,j ⋯ wN ,V
⎡ u1 ⎤
⎢ ⋮ ⎥
u=⎢ ⎥
⎢ uj ⎥
⎢ ⋮ ⎥
⎣uV ⎦
We have
⎡ ,1 ⎤ ⎡ h1 ⎤
′ ′ ′
w1,1 ⋯ wi,1 ⋯ wN
⎢ ⋮ ⋮ ⋱ ⋮ ⋮ ⎥ ⎢ ⋮ ⎥
u = (W ′ )T × h = ⎢
⎢
′
⋯ ′
⋯ ′ ⎥ ⎢ ⎥
,j ⎥ × ⎢ hi ⎥
⎢ ⋮ ⋮ ⎥ ⎢ ⋮ ⎥
w 1,j wi,j wN
⋮ ⋱ ⋮
⎣w ′ ⋯ ′
wi,V ⋯ ′
wN ⎦ ⎣h ⎦
1,V ,V N
N
uj = ∑ wi,j
′
× hi = (wj′ )T × h = (W∗,j
′ T
) × Wk,∗
i=1
We can get
′ T
e(W∗,j ) Wk,∗
yj = p(vj ∣vk ) = V ′ T
(W∗,l ) Wk,∗
∑l=1 e
Parameter fi%ng
In the CBOW model, each input x(m) is the context word, and each output y (m) is the
target word. In the skip-gram model, each input x(m) is the target word, and each
output y (m) is the context word. As shown before, x(m) is a one-hot V -dimension
vector, and y (m) is also a a one-hot V -dimension vector.
M training examples mean we extract M context/target word pairs from the whole
corpus.
(m) (m)
When xk = 1 or x(m) = onehot(vk ), and yj = 1 or y (m) = onehot(vj ), it
means that the current input word is vk , the k -th word in the vocabulary, and the
corresponding output word is vj , the j -th word in the vocabulary.
(m)
Note: yj and yj are different.
Assuming that the M training examples were generated independently, we can then
write down the likelihood of the parameters as
M
L(θ) = P (Y ∣X; θ) = ∏ p(y (m) ∣x(m) ; θ)
m=1
M
ℓ(θ) = log L(θ) = ∑ log p(y (m) ∣x(m) ; θ)
m=1
′
where θ = {W∗,j , Wk,∗ }.
We know that p(y (m) ∣x(m) ; θ) is defined by the sof tmax func!on, when x(m) =
onehot(vk ) we can get
V V
log p(y (m) ∣x(m) ; θ) = log ∏ p(vl ∣vk ) 1{yl(m) =1}
= ∑ 1{yl
(m)
= 1} log p(vl ∣vk )
l=1 l=1
Here, we introduce one more very useful piece of nota!on. An indicator func!on 1{⋅}
takes on a value of 1 if its argument is true, and 0 otherwise (1{T rue} =
1, 1{F alse} = 0).
If the current output y (m) is corresponding to the j -th word vj in the vocabulary, then
(m) (m) (m) (m)
yj = 1 and 1{yj = 1} = 1, and for any l ≠ j , we get yl = 0 and 1{yl =
1} = 0.
= 1} log p(vj ∣vk ) + ∑ 1{yl
(m) (m)
log p(y (m) ∣x(m) ; θ) = 1{yj = 1} log p(vl ∣vk )
l≠j
∂aT b ∂bT a
= =b
∂a ∂a
So we can get
′ T
(m) (m) ′ T V (W ) Wk,∗
∂ log p(y ∣x ; θ) ∂(W∗,j ) × Wk,∗ ∂ log ∑l=1 e ∗,l
′ = ′ − ′
∂W∗,j ∂W∗,j ∂W∗,j
′ T
∂ log ∑l=1 e(W∗,l ) Wk,∗
V
= Wk,∗ − ′
∂W∗,j
We first calculate
′ T ′ T
V (W∗,l ) Wk,∗ V (W∗,l ) Wk,∗
∂ log ∑l=1 e 1 ∂ ∑l=1 e
′ = ′ T × ′ )T × W )
∂W∗,j ∑l=1 e(W∗,l ) Wk,∗ ∂exp((W∗,j
V
k,∗
′ T ′ T
(W∗,j ) Wk,∗
∂((W∗,j ) × Wk,∗ )
∂e
× ′ )T × W ) × ′
∂((W∗,j k,∗ ∂W∗,j
1 ′ T
(W∗,j ) Wk,∗
= V (W ′ )T Wk,∗ × 1 × e × Wk,∗
∑l=1 e ∗,l
′ T
(W∗,j ) Wk,∗
e
= V ′ T
(W∗,l ) Wk,∗
× Wk,∗
∑l=1 e
= p(vj ∣vk ) × Wk,∗
We can get
We calculate
′ T
∂ log ∑l=1 e(W∗,l )
V Wk,∗
∂Wk,∗
′ T
V (W∗,l ) Wk,∗
1 ∂ ∑l=1 e
= ′ ×
∑l=1 e(W∗,l ) ∂Wk,∗
V TW
k,∗
′ T
∑l=1 ∂e(W∗,l )
V Wk,∗
1
= ′ T
(W∗,l ) Wk,∗
×
V
∑l=1 e ∂Wk,∗
V
1
× ∑e
′ T
(W∗,l ) Wk,∗ ′
= V ′ T
(W∗,l ) Wk,∗
× W∗,l
∑l=1 e l=1
′
V (W∗,l )T Wk,∗
e
= ∑( V ′ )T W
(W∗,l
′
× W∗,l )
k,∗
l=1 ∑l′ =1 e ′
V
= ∑(p(vl ∣vk ) × W∗,l
′
)
l=1
We can get
V
∂ log p(y (m) ∣x(m) ; θ) ′
= W∗,j − ∑(p(vl ∣vk ) × W∗,l
′
)
∂Wk,∗
l=1
Note:
Because from the input x to the hidden layer h, only a linear func!on is used, and h is
just the k -th row of the weight matrix W , we can treat the model as “dynamic”
so"max, and use the above method to calculate the par!al deriva!ves, which is
illustrated in the lectures of Richard Socher and Ali Ghodsi. Actually the original papers
of Tomas Mikolov also used this method. In these literature, generally the word
embedding vector of the input word is defined as win , and the word embedding vector
of the output word is defined as wout . Actually, win is same as Wk,∗ and wout is same
′
as W∗,l in our nota!on.
But we can also treat the above model as the conven!onal neural network, and use the
back-propaga!on method to calculate the par!al deriva!ves. The par!al deriva!ves
′
with respect to W∗,j is between the hidden layer h and the output layer y . Because
′
W∗,j is the j -th column of W ′ , so for calcula!ng the par!al deriva!ves if we go through
′
all j from 1 to V for each column vector W∗,j , then we can go through each element
′ ′
Wi,j in the matrix W ′ . For W∗,j , the back-propaga!on method is same as the above
calcula!on of “dynamic” so#max we showed. The par!al deriva!ves with respect to
Wk,∗ is between the hidden layer x and the output layer h, because h is a linear
combina!on of x based on the weight matrix W , and due to the one-hot nature of x, h
is same as Wk,∗ when xk =1. The par!al deriva!ves with respect to Wk,∗ is same as that
∂h
with respect to h. For other Wk′ ,∗ when k ′ ≠ k , we have ∂W = 0, so based on the
k′ ,∗
∂ log p(y (m) ∣x(m) ;θ)
chain rule ∂Wk′ ,∗
= 0 as well. Even for the back-propaga!on method, we only
need to consider the par!al deriva!ves with respect to Wk,∗ , and ignore other Wk′ ,∗ ,
and that is same as the above calcula!on of “dynamic” so#max we showed.
But in prac!ce, the above method for parameter fi$ng is not prac!cal. Because yj =
′ T
p(vj ∣vk ) is very expensive to compute due to the summa!on ∑l=1 e(W∗,l ) Wk,∗ over all
V
V context words (there can be hundreds of thousands of them). One way of making the
computa!on more tractable is to replace the so"max with an hierarchical so"max, but
we will not elaborate on this direc!on. Instead, we will focus on the recommended skip-
gram model with nega!ve sampling. Other methods are very similar in terms of
mathema!cal deriva!on.
The target words in the corpus are represented as vin , and each target word has c
context words vout . In the skip-gram model, the input is x = onehot(vin ) and the
output is y = onehot(vout ). We want to maximize the corpus probability
here D is the set of all target (input) word and context (output) pairs we extract from the
corpus.
nega!ve sampling
As we discussed before, if we use the so"max to calculate the p(vout ∣vin ; θ), it is
imprac!cal. So Mikolov et al. present the nega!ve-sampling approach as a more
efficient way of deriving word vectors. The jus!fica!on of nega!ve-sampling approach
is based on Noise Contras!ve Es!ma!on (NCE), and please refer to the paper of
Gutmann et al.
Consider a pair (vin , vout ) of target and context. Did this pair come from the training
data? Let’s denote by p(D = 1∣vin , vout ) the probability that (vin , vout ) came from the
corpus data. Correspondingly, p(D = 0∣vin , vout ) = 1 − p(D = 1∣vin , vout ) will be
the probability that (vin , vout ) did not come from the corpus data. As before, assume
there are parameters θ controlling the distribu!on: p(D = 1∣vin , vout ; θ).
Our goal is now to find parameters to maximize the probabili!es that all of the
observa!ons indeed came from the data:
′
Note: For simplicity, we will use win and wout instead of Wk,∗ and W∗,l to represent
the word vectors of vin and vout respec!vely.
1
p(D = 1∣vin , vout ; θ) =
1 + e−win wout
T
So
1
ℓ(θ) = ∑ log
1 + e−win wout
T
This objec!ve has a trivial solu!on if we set θ such that p(D = 1∣vin , vout ; θ) = 1
for every pair (vin , vout ). This can be easily achieved by se$ng θ such that win = wout
T
and win wout = K for all win , wout , where K is large enough number (prac!cally, we
get a probability of 1 as soon as K ≈ 40).
We need a mechanism that prevents all the vectors from having the same value, by
disallowing some (vin , vout ) combina!ons. One way to do so, is to present the model
with some (vin , vout ) pairs for which p(D = 1∣vin , vout ; θ) must be low, i.e. pairs
which are not in the data. This is achieved by genera!ng the set D ′ of random
(vin , vout ) pairs, assuming they are all incorrect (the name “nega!ve sampling” stems
from the set D ′ of randomly sampled nega!ve examples). The op!miza!on objec!ve
now becomes:
= ∑ T
log σ(win wout )
(vin ,vout )∈D
+ ∑ T
log σ(−win wout )
(vin ,vout )∈D′
The above equa!on is almost same as the one in the paper of Mikolov et al.
The difference from Mikolov et al. is that here we present the objec!ve for the en!re
corpus D ∪ D ′ , while they present it for one example (vin , vout ) ∈ D and n examples
(vin , vout,j ) ∈ D′ , following a par!cular way of construc!ng D′ .
This is equivalent to drawing the samples (vin , vout,j ) ∈ D ′ from the distribu!on
puni (vin ) × (puni (vout ))3/4
(vin , vout,j ) ∼
Z
where puni (vin ) and puni (vout ) are the unigram distribu!ons of targets and contexts
respec!vely, and Z is a normaliza!on constant.
count(vin )
puni (vin ) =
∣Corpus∣
count(vout )
puni (vout ) =
∣Corpus∣
Note:
Unlike the original skip-gram model described in the begining, the formula!on in this
sec!on of nega!ve sampling does not model p(vout ∣vin ) but instead models a quan!ty
related to the joint distribu!on of vout and vin .
If we fix the targets representa!on and learn only the contexts representa!on, or fix the
contexts representa!on and learn only the targets representa!ons, the model reduces to
logis!c regression, and is convex. However, in this model the targets and contexts
representa!ons are learned jointly, making the model non-convex.
Parameter fi%ng
∂ℓ(θ) 1
= ∑ (1{(vin , vout ) ∈ D} ×
∂wout Tw )
σ(win out
(vin ,vout )∈D∪D′
T T
× σ(win wout )(1
− σ(win wout )) × win
1
+ (1 − 1{(vin , vout ) ∈ D}) × T w ) × (−1)
1 − σ(win out
T T
× σ(win wout )(1 − σ(win wout )) × win )
= ∑ T
(1{(vin , vout ) ∈ D} − σ(win wout )) × win
(vin ,vout )∈D∪D′
∂ℓ(θ)
Because win and wout are completely symmetrical, we can get ∂win from the
∂ℓ(θ)
calcula!on same as ∂wout .
∂ℓ(θ)
= ∑ T
(1{(vin , vout ) ∈ D} − σ(win wout )) × wout
∂win
(vin ,vout )∈D∪D′
In our above calcula!on, we first finish sampling and then mix all generated (posi!ve and
nega!ve) pairs of (vin , vout ) into one unified train set D ∪ D ′ .
But in some papers or notes, the calcula!on of the par!al deriva!ves looks a li&le
different, because they use a different index mechanism. During scanning the whole
corpus, there are n posi!ve output (context) words vpos for each input (target) word vin .
It means there are n posi!ve pairs of (vin , vpos ) for each scanned vin . Further, for each
posi!ve pair (vin , vpos ), n nega!ve pairs (vin , vneg ) are generated. As men!oned
before, the nega!ve pair number is n !mes of the posi!ve pair number.
Here we only consider one posi!ve pair (vin , vpos ) of current input vin , and n nega!ve
pairs (vin , vneg,s ) with (1 ≤ s ≤ n) corresponding to (vin , vpos ).
The log-likelihood
n
T
ℓvin (θ) = log(σ(win wpos )) + ∑ log(1 − σ(win
T
wneg,s ))
s=1
Further Reading
Omer Levy and Yoav Goldberg later found out that skip-gram model with nega!ve
sampling is implicitly factorizing a word-context matrix, whose cells are the pointwise
mutual informa!on (PMI) of the respec!ve target and context pairs, shi#ed by a global
constant. PMI matrices are commonly used by the tradi!onal approach to represent
words (o#en dubbed “distribu!onal seman!cs”). What’s really striking about this
discovery, is that word2vec (specifically, skip-gram model with nega!ve sampling) is
doing something very similar to what the NLP community has been doing for about 20
years; it’s just doing it really well.
Reference
Omer Levy and Yoav Goldberg, Neural Word Embedding as Implicit Matrix
Factoriza!on, NIPS-2014
h&p://web.stanford.edu/class/cs224n/lecture_notes/cs224n-2017-notes1.pdf