100% found this document useful (1 vote)

177 views17 pages

Model With One-Word Context: 2vec 2vec 2vec 2vec

word2vec is not a single algorithm but a family of algorithms. It contains two distinct models (CBOW and skip-gram) with variations that amount to a small space of algorithms. word2vec is also not deep learning as both CBOW and skip-gram are shallow neural models aimed at efficiency over complexity. Mikolov recommends using skip-gram with negative sampling as it performs best on analogy tasks.

Uploaded by

Jun Wang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

177 views17 pages

Model With One-Word Context: 2vec 2vec 2vec 2vec

Uploaded by

Jun Wang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Notes on word2vec

word2vec is not one algorithm.

word2vec is not deep learning.

First and foremost, it is important to understand that word2vec is not a single

monolithic algorithm. In fact, word2vec contains two dis!nct models (CBOW and skip-
gram), each with two diﬀerent training methods (with/without nega!ve sampling) and
other varia!ons (e.g. hierarchical so"max), which amount to small “space” of algorithms.
To top that, it also contains a honed pre-processing pipeline, whose eﬀects on the
overall performance have yet to be studied properly.

Another thing that’s important to understand, is that word2vec is not deep learning;
both CBOW and skip-gram are “shallow” neural models. Tomas Mikolov even said that
the whole idea behind word2vec was to demonstrate that you can get be#er word
representa!ons if you trade the model’s complexity for eﬃciency, i.e. the ability to
learn from much bigger datasets. In his papers, Mikolov recommends using the skip-
gram model with nega!ve sampling, as it outperformed the other variants on analogy
tasks.

Model with one-word context

We start from the simplest version. We assume that there is only one word considered
per context. The CBOW model will predict one target word given one context word. The
skip-gram model will predict one context word given one target word. In the one-word
context, the CBOW model and skip-gram model are actually completely same.
Figure 1 shows the model under the simplified one-word context.

Model defini!on

The vocabulary size is V , and it means there are V unique words in the vocabulary.

Each input word can be be mapped with a func!on onehot(vk ) to a V -dimension one-
hot encoded vector x. For example, when the input word is vk (1 ≤ k ≤ V ), the k -th
word in vocabulary.

⎡ x1 ⎤ ⎡0⎤
⎢ ⋮ ⎥ ⎢⋮⎥
x=⎢ ⎥ ⎢ ⎥
⎢ xk ⎥ = onehot(vk ) = ⎢1⎥
⎢ ⋮ ⎥ ⎢⋮⎥
⎣x ⎦ ⎣0⎦
V

where only the k -th element of x, xk = 1, and all other elements xk′ = 0 when k ′ ≠
k.

The hidden layer can be represented as a N -dimension vector

⎡ h1 ⎤
⎢ ⋮ ⎥
h=⎢ ⎥
⎢ hi ⎥
⎢ ⋮ ⎥
⎣h ⎦
N

For an element hi in h, we have

V
hi = ∑ wk,i × xk
k=1

Note: we use the subscript wk,i instead of wi,k because it’s more convenient for
vectorized representa!ons shown later.

We can show the calcula!on of hi in a vectorized representa!on.

⎡ x1 ⎤
⎢ ⋮ ⎥
hi = [w1,i ⋯ ⋯ wV ,i ] × ⎢
⎢ ⎥
⎥ = wT × x
⎢ ⋮ ⎥
wk,i xk i

⎣x ⎦
V

where wi is also V -dimension vector.

⎡ w1,i ⎤
⎢ ⋮ ⎥
wi = ⎢ ⎥
⎢ wk,i ⎥
⎢ ⋮ ⎥
⎣wV ,i ⎦

For 1 ≤ i ≤ N , we can assemble all wi into a V × N matrix

⎡ w1,1 ⋯ w1,i ⋯ w1,N ⎤

⎢ ⋮ ⋮ ⋱ ⋮ ⋮ ⎥
W = [w1 ⋯ ⋯ wN ] = ⎢
⎢ wk,1 ⋯ ⋯ wk,N ⎥⎥
⎢ ⋮ ⋮ ⎥
wi wk,i
⋮ ⋱ ⋮
⎣w ⋯ wV ,i ⋯ wV ,N ⎦
V ,1

We have
⎡ w1,1 ⋯ wk,1 ⋯ wV ,1 ⎤ ⎡ 0 ⎤ ⎡ wk,1 ⎤
⎢ ⋮ ⋮ ⋱ ⋮ ⋮ ⎥ ⎢⋮⎥ ⎢ ⋮ ⎥
h = WT × x = ⎢
⎢ w1,i ⋯ ⋯ wV ,i ⎥ ⎢ ⎥ ⎢ ⎥
⎥ × ⎢xk ⎥ = ⎢ wk,i ⎥
⎢ ⋮ ⋮ ⎥ ⎢⋮⎥ ⎢ ⋮ ⎥
wk,i
⋮ ⋱ ⋮
⎣w ⋯ wk,N ⋯ wV ,N ⎦ ⎣ 0 ⎦ ⎣wk,N ⎦
1,N

Note: As men!oned before, the input x is a V -dimension one-hot vector with only
xk = 1.

From the above equa!on, we can find h is actually same as the k -th column of W T or
the k -th row of the matrix W .

We define the the k -th row of the matrix W as Wk,∗ , and we can get

⎡ wk,1 ⎤
⎢ ⋮ ⎥
h=⎢ ⎥
⎢ wk,i ⎥ = Wk,∗
⎢ ⋮ ⎥
⎣wk,N ⎦

The output y is also a V -dimension vector

⎡ y1 ⎤
⎢ ⋮ ⎥
y=⎢ ⎥
⎢ yj ⎥
⎢ ⋮ ⎥
⎣y ⎦
V

Each element yj in y is the predic!on probability that the output word is vj , the j -th
word in the vocabulary, given the input word’s one-hot vector representa!on is x =
onehot(vk ).

yj = p(vj ∣x = onehot(vk )) = p(vj ∣vk )

V
We have ∑j=1 yj = 1, so we can use sof tmax ac!va!on func!on to calculate
N
uj = ∑ wi,j
′
× hi
i=1
euj
yj = V
∑l=1 eul

We can show the calcula!on of uj in a vectorized representa!on.

⎡ h1 ⎤
⎢ ⋮ ⎥
′
uj = [w1,j ⋯ ′
⋯ ′ ⎢ ⎥
,j ] × ⎢ hi ⎥ = (wj ) × x
′ T
⎢ ⋮ ⎥
wi,j wN

⎣hN ⎦

where wj′ is also N -dimension vector.

⎡ ⎤
′
w 1,j

⎢ ⋮ ⎥
wj′ = ⎢
⎢
′ ⎥
⎥
⎢ ⋮ ⎥
w i,j

⎣w ′ ⎦
N ,j

For 1 ≤ j ≤ V , we can assemble all wj′ into a N × V matrix

⎡ ⎤
′ ′ ′
w 1,1 ⋯ w1,j ⋯ w1,V
⎢ ⋮ ⋮ ⋱ ⋮ ⋮ ⎥
W ′ = [w1′ ⋯ wj′ ⋯ wV′ ] = ⎢
⎢
′
⋯ ′
⋯ ′ ⎥
⎥
⎢ ⋮ ⋮ ⎥
w i,1 wi,j wi,V

⎣w ′ ⎦
⋮ ⋱ ⋮
′ ′
N ,1 ⋯ wN ,j ⋯ wN ,V

We define a V -dimension vector

⎡ u1 ⎤
⎢ ⋮ ⎥
u=⎢ ⎥
⎢ uj ⎥
⎢ ⋮ ⎥
⎣uV ⎦

We have
⎡ ,1 ⎤ ⎡ h1 ⎤
′ ′ ′
w1,1 ⋯ wi,1 ⋯ wN
⎢ ⋮ ⋮ ⋱ ⋮ ⋮ ⎥ ⎢ ⋮ ⎥
u = (W ′ )T × h = ⎢
⎢
′
⋯ ′
⋯ ′ ⎥ ⎢ ⎥
,j ⎥ × ⎢ hi ⎥
⎢ ⋮ ⋮ ⎥ ⎢ ⋮ ⎥
w 1,j wi,j wN
⋮ ⋱ ⋮
⎣w ′ ⋯ ′
wi,V ⋯ ′
wN ⎦ ⎣h ⎦
1,V ,V N

We define wj′ as W∗,j

′
, which is the j -th column of the matrix W ′ .

N
uj = ∑ wi,j
′
× hi = (wj′ )T × h = (W∗,j
′ T
) × Wk,∗
i=1

We can get
′ T
e(W∗,j ) Wk,∗
yj = p(vj ∣vk ) = V ′ T
(W∗,l ) Wk,∗
∑l=1 e

Parameter fi%ng

Calcula!on of Par!al Deriva!ves

{(x(m) , y (m) ); m = 1, ..., M } is a training set containing M training examples.

(x(m) , y (m) ) is the m-th training example in the training set. In each pair (x(m) , y (m) ),
x(m) is an input and y (m) is a corresponding output.

In the CBOW model, each input x(m) is the context word, and each output y (m) is the
target word. In the skip-gram model, each input x(m) is the target word, and each
output y (m) is the context word. As shown before, x(m) is a one-hot V -dimension
vector, and y (m) is also a a one-hot V -dimension vector.

M training examples mean we extract M context/target word pairs from the whole
corpus.

(m) (m)
When xk = 1 or x(m) = onehot(vk ), and yj = 1 or y (m) = onehot(vj ), it
means that the current input word is vk , the k -th word in the vocabulary, and the
corresponding output word is vj , the j -th word in the vocabulary.
(m)
Note: yj and yj are diﬀerent.

Assuming that the M training examples were generated independently, we can then
write down the likelihood of the parameters as

M
L(θ) = P (Y ∣X; θ) = ∏ p(y (m) ∣x(m) ; θ)
m=1
M
ℓ(θ) = log L(θ) = ∑ log p(y (m) ∣x(m) ; θ)
m=1

′
where θ = {W∗,j , Wk,∗ }.

We know that p(y (m) ∣x(m) ; θ) is defined by the sof tmax func!on, when x(m) =
onehot(vk ) we can get
V V
log p(y (m) ∣x(m) ; θ) = log ∏ p(vl ∣vk ) 1{yl(m) =1}
= ∑ 1{yl
(m)
= 1} log p(vl ∣vk )
l=1 l=1

Here, we introduce one more very useful piece of nota!on. An indicator func!on 1{⋅}
takes on a value of 1 if its argument is true, and 0 otherwise (1{T rue} =
1, 1{F alse} = 0).

If the current output y (m) is corresponding to the j -th word vj in the vocabulary, then
(m) (m) (m) (m)
yj = 1 and 1{yj = 1} = 1, and for any l ≠ j , we get yl = 0 and 1{yl =
1} = 0.
= 1} log p(vj ∣vk ) + ∑ 1{yl
(m) (m)
log p(y (m) ∣x(m) ; θ) = 1{yj = 1} log p(vl ∣vk )
l≠j

= 1 × log p(vj ∣vk ) + ∑ 0 × log p(vl ∣vk )

l≠j
= log p(vj ∣vk )
′ T
e(W∗,j ) Wk,∗
= log V ′ T
(W∗,l ) Wk,∗
∑l=1 e
V
) Wk,∗ − log ∑ e(W∗,l )
′ T
′ T Wk,∗
= (W∗,j
l=1

Note: If a and b are both vectors, and we can get

∂aT b ∂bT a
= =b
∂a ∂a
So we can get
′ T
(m) (m) ′ T V (W ) Wk,∗
∂ log p(y ∣x ; θ) ∂(W∗,j ) × Wk,∗ ∂ log ∑l=1 e ∗,l

′ = ′ − ′
∂W∗,j ∂W∗,j ∂W∗,j
′ T
∂ log ∑l=1 e(W∗,l ) Wk,∗
V
= Wk,∗ − ′
∂W∗,j

We first calculate
′ T ′ T
V (W∗,l ) Wk,∗ V (W∗,l ) Wk,∗
∂ log ∑l=1 e 1 ∂ ∑l=1 e
′ = ′ T × ′ )T × W )
∂W∗,j ∑l=1 e(W∗,l ) Wk,∗ ∂exp((W∗,j
V
k,∗
′ T ′ T
(W∗,j ) Wk,∗
∂((W∗,j ) × Wk,∗ )
∂e
× ′ )T × W ) × ′
∂((W∗,j k,∗ ∂W∗,j
1 ′ T
(W∗,j ) Wk,∗
= V (W ′ )T Wk,∗ × 1 × e × Wk,∗
∑l=1 e ∗,l

′ T
(W∗,j ) Wk,∗
e
= V ′ T
(W∗,l ) Wk,∗
× Wk,∗
∑l=1 e
= p(vj ∣vk ) × Wk,∗
We can get

∂ log p(y (m) ∣x(m) ; θ)

′ = Wk,∗ − p(vj ∣vk ) × Wk,∗
∂W∗,j
= (1 − p(vj ∣vk )) × Wk,∗

We can also have

′ T
′ T
∂ log ∑l=1 e(W∗,l )
V
∂ log p(y (m) ∣x(m) ; θ) ∂(W∗,j ) × Wk,∗ Wk,∗
= −
∂Wk,∗ ∂Wk,∗ ∂Wk,∗
′ T
∂ log ∑l=1 e(W∗,l ) ×Wk,∗
V
′
= W∗,j −
∂Wk,∗

We calculate
′ T
∂ log ∑l=1 e(W∗,l )
V Wk,∗

∂Wk,∗
′ T
V (W∗,l ) Wk,∗
1 ∂ ∑l=1 e
= ′ ×
∑l=1 e(W∗,l ) ∂Wk,∗
V TW
k,∗

′ T
∑l=1 ∂e(W∗,l )
V Wk,∗
1
= ′ T
(W∗,l ) Wk,∗
×
V
∑l=1 e ∂Wk,∗
V
1
× ∑e
′ T
(W∗,l ) Wk,∗ ′
= V ′ T
(W∗,l ) Wk,∗
× W∗,l
∑l=1 e l=1
′
V (W∗,l )T Wk,∗
e
= ∑( V ′ )T W
(W∗,l
′
× W∗,l )
k,∗
l=1 ∑l′ =1 e ′

V
= ∑(p(vl ∣vk ) × W∗,l
′
)
l=1

We can get

V
∂ log p(y (m) ∣x(m) ; θ) ′
= W∗,j − ∑(p(vl ∣vk ) × W∗,l
′
)
∂Wk,∗
l=1

Note:
Because from the input x to the hidden layer h, only a linear func!on is used, and h is
just the k -th row of the weight matrix W , we can treat the model as “dynamic”
so"max, and use the above method to calculate the par!al deriva!ves, which is
illustrated in the lectures of Richard Socher and Ali Ghodsi. Actually the original papers
of Tomas Mikolov also used this method. In these literature, generally the word
embedding vector of the input word is defined as win , and the word embedding vector
of the output word is defined as wout . Actually, win is same as Wk,∗ and wout is same
′
as W∗,l in our nota!on.

But we can also treat the above model as the conven!onal neural network, and use the
back-propaga!on method to calculate the par!al deriva!ves. The par!al deriva!ves
′
with respect to W∗,j is between the hidden layer h and the output layer y . Because
′
W∗,j is the j -th column of W ′ , so for calcula!ng the par!al deriva!ves if we go through
′
all j from 1 to V for each column vector W∗,j , then we can go through each element
′ ′
Wi,j in the matrix W ′ . For W∗,j , the back-propaga!on method is same as the above
calcula!on of “dynamic” so#max we showed. The par!al deriva!ves with respect to
Wk,∗ is between the hidden layer x and the output layer h, because h is a linear
combina!on of x based on the weight matrix W , and due to the one-hot nature of x, h
is same as Wk,∗ when xk =1. The par!al deriva!ves with respect to Wk,∗ is same as that
∂h
with respect to h. For other Wk′ ,∗ when k ′ ≠ k , we have ∂W = 0, so based on the
k′ ,∗
∂ log p(y (m) ∣x(m) ;θ)
chain rule ∂Wk′ ,∗
= 0 as well. Even for the back-propaga!on method, we only
need to consider the par!al deriva!ves with respect to Wk,∗ , and ignore other Wk′ ,∗ ,
and that is same as the above calcula!on of “dynamic” so#max we showed.

But in prac!ce, the above method for parameter fi$ng is not prac!cal. Because yj =
′ T
p(vj ∣vk ) is very expensive to compute due to the summa!on ∑l=1 e(W∗,l ) Wk,∗ over all
V

V context words (there can be hundreds of thousands of them). One way of making the
computa!on more tractable is to replace the so"max with an hierarchical so"max, but
we will not elaborate on this direc!on. Instead, we will focus on the recommended skip-
gram model with nega!ve sampling. Other methods are very similar in terms of
mathema!cal deriva!on.

skip-gram model with nega!ve sampling

Original skip-gram model

The target words in the corpus are represented as vin , and each target word has c
context words vout . In the skip-gram model, the input is x = onehot(vin ) and the
output is y = onehot(vout ). We want to maximize the corpus probability

L(θ) = ∏ ( ∏ p(vout ∣vin ; θ))

vin ∈Corpus vout ∈Context(vin )

Alterna!vely, we can maximize

L(θ) = ∏ p(vout ∣vin ; θ)

(vin ,vout )∈D

here D is the set of all target (input) word and context (output) pairs we extract from the
corpus.

nega!ve sampling

As we discussed before, if we use the so"max to calculate the p(vout ∣vin ; θ), it is
imprac!cal. So Mikolov et al. present the nega!ve-sampling approach as a more
eﬃcient way of deriving word vectors. The jus!fica!on of nega!ve-sampling approach
is based on Noise Contras!ve Es!ma!on (NCE), and please refer to the paper of
Gutmann et al.

Consider a pair (vin , vout ) of target and context. Did this pair come from the training
data? Let’s denote by p(D = 1∣vin , vout ) the probability that (vin , vout ) came from the
corpus data. Correspondingly, p(D = 0∣vin , vout ) = 1 − p(D = 1∣vin , vout ) will be
the probability that (vin , vout ) did not come from the corpus data. As before, assume
there are parameters θ controlling the distribu!on: p(D = 1∣vin , vout ; θ).

Our goal is now to find parameters to maximize the probabili!es that all of the
observa!ons indeed came from the data:

L(θ) = ∏ p(D = 1∣vin , vout ; θ)

(vin ,vout )∈D
It is equal to maximize the log likelihood

ℓ(θ) = log ∏ p(D = 1∣vin , vout ; θ)

(vin ,vout )∈D

= ∑ log p(D = 1∣vin , vout ; θ)

(vin ,vout )∈D

′
Note: For simplicity, we will use win and wout instead of Wk,∗ and W∗,l to represent
the word vectors of vin and vout respec!vely.

The sigmoid func!on can be used to define

1
p(D = 1∣vin , vout ; θ) =
1 + e−win wout
T

1
ℓ(θ) = ∑ log
1 + e−win wout
T

(vin ,vout )∈D

This objec!ve has a trivial solu!on if we set θ such that p(D = 1∣vin , vout ; θ) = 1
for every pair (vin , vout ). This can be easily achieved by se$ng θ such that win = wout
T
and win wout = K for all win , wout , where K is large enough number (prac!cally, we
get a probability of 1 as soon as K ≈ 40).

We need a mechanism that prevents all the vectors from having the same value, by
disallowing some (vin , vout ) combina!ons. One way to do so, is to present the model
with some (vin , vout ) pairs for which p(D = 1∣vin , vout ; θ) must be low, i.e. pairs
which are not in the data. This is achieved by genera!ng the set D ′ of random
(vin , vout ) pairs, assuming they are all incorrect (the name “nega!ve sampling” stems
from the set D ′ of randomly sampled nega!ve examples). The op!miza!on objec!ve
now becomes:

L(θ) = ∏ p(D = 1∣vin , vout ; θ) ∏ p(D = 0∣vin , vout ; θ)

(vin ,vout )∈D (vin ,vout )∈D′

= ∏ p(D = 1∣vin , vout ; θ) ∏ (1 − p(D = 1∣vin , vout ; θ))

(vin ,vout )∈D (vin ,vout )∈D′
We can get

ℓ(θ) = ∑ log p(D = 1∣vin , vout ; θ)

(vin ,vout )∈D

+ ∑ log(1 − p(D = 1∣vin , vout ; θ))

(vin ,vout )∈D′
1
= ∑ log T

(vin ,vout )∈D

1 + e−win wout
1
+ ∑ log(1 − )
e−win wout
T
1+
(vin ,vout )∈D′
1
= ∑ log T

(vin ,vout )∈D

1 + e−win wout
1
+ ∑ log( T w )
1 + ewin out
(vin ,v )∈D′
out

= ∑ T
log σ(win wout )
(vin ,vout )∈D

+ ∑ T
log σ(−win wout )
(vin ,vout )∈D′

where σ is the sigmoid func!on.

The above equa!on is almost same as the one in the paper of Mikolov et al.

The diﬀerence from Mikolov et al. is that here we present the objec!ve for the en!re
corpus D ∪ D ′ , while they present it for one example (vin , vout ) ∈ D and n examples
(vin , vout,j ) ∈ D′ , following a par!cular way of construc!ng D′ .

Specifically, with nega!ve sampling of n, Mikolov et al.’s constructed D ′ is n !mes larger

than D , and for each (vin , vout ) ∈ D we construct n samples
(vin , vout,1 ), ⋯ , (vin , vout,j ), ⋯ , (vin , vout,n ), where each vout,j is drawn
according to its unigram distribu!on raised to the 3/4 power.

This is equivalent to drawing the samples (vin , vout,j ) ∈ D ′ from the distribu!on
puni (vin ) × (puni (vout ))3/4
(vin , vout,j ) ∼
Z

where puni (vin ) and puni (vout ) are the unigram distribu!ons of targets and contexts
respec!vely, and Z is a normaliza!on constant.

count(vin )
puni (vin ) =
∣Corpus∣
count(vout )
puni (vout ) =
∣Corpus∣

Note:
Unlike the original skip-gram model described in the begining, the formula!on in this
sec!on of nega!ve sampling does not model p(vout ∣vin ) but instead models a quan!ty
related to the joint distribu!on of vout and vin .

If we fix the targets representa!on and learn only the contexts representa!on, or fix the
contexts representa!on and learn only the targets representa!ons, the model reduces to
logis!c regression, and is convex. However, in this model the targets and contexts
representa!ons are learned jointly, making the model non-convex.

Parameter fi%ng

We use an indicator func!on 1{(vin , vout ) ∈ D} to indicate if a pair (vin , vout ) ∈ D .

We can rewrite the above likelihood

L(θ) = ∏ p(D = 1∣vin , vout ; θ)

(vin ,vout )∈D

× ∏ (1 − p(D = 1∣vin , vout ; θ))

(vin ,vout )∈D′

= ∏ (p(D = 1∣vin , vout ; θ)1{(vin ,vout )∈D}

(vin ,vout )∈D∪D′

×(1 − p(D = 1∣vin , vout ; θ))(1−1{(vin ,vout )∈D}) )

And we can also get

ℓ(θ) = ∑ log(p(D = 1∣vin , vout ; θ)1{(vin ,vout )∈D}

(vin ,vout )∈D∪D′

×(1 − p(D = 1∣vin , vout ; θ))(1−1{(vin ,vout )∈D}) )

= ∑ (1{(vin , vout ) ∈ D} log(p(D = 1∣vin , vout ; θ)
(vin ,vout )∈D∪D′
+ (1 − 1{(vin , vout ) ∈ D}) log(1 − p(D = 1∣vin , vout ; θ)))
= ∑ T
(1{(vin , vout ) ∈ D} log σ(win wout )
(vin ,vout )∈D∪D′
T
+ (1 − 1{(vin , vout ) ∈ D}) log(1 − σ(win wout )))

We calculate par!al deriva!ves with respect to win and wout

∂ℓ(θ) 1
= ∑ (1{(vin , vout ) ∈ D} ×
∂wout Tw )
σ(win out
(vin ,vout )∈D∪D′
T T
× σ(win wout )(1
− σ(win wout )) × win
1
+ (1 − 1{(vin , vout ) ∈ D}) × T w ) × (−1)
1 − σ(win out
T T
× σ(win wout )(1 − σ(win wout )) × win )
= ∑ T
(1{(vin , vout ) ∈ D} − σ(win wout )) × win
(vin ,vout )∈D∪D′

∂ℓ(θ)
Because win and wout are completely symmetrical, we can get ∂win from the
∂ℓ(θ)
calcula!on same as ∂wout .

∂ℓ(θ)
= ∑ T
(1{(vin , vout ) ∈ D} − σ(win wout )) × wout
∂win
(vin ,vout )∈D∪D′

In our above calcula!on, we first finish sampling and then mix all generated (posi!ve and
nega!ve) pairs of (vin , vout ) into one unified train set D ∪ D ′ .

But in some papers or notes, the calcula!on of the par!al deriva!ves looks a li&le
diﬀerent, because they use a diﬀerent index mechanism. During scanning the whole
corpus, there are n posi!ve output (context) words vpos for each input (target) word vin .
It means there are n posi!ve pairs of (vin , vpos ) for each scanned vin . Further, for each
posi!ve pair (vin , vpos ), n nega!ve pairs (vin , vneg ) are generated. As men!oned
before, the nega!ve pair number is n !mes of the posi!ve pair number.

L(θ) = ∏ ( ∏ (p(vpos ∣vin ; θ)

vin ∈Corpus vpos ∈P ositiveContext(vin )

∏ p(vneg ∣vin ; θ)))

vneg ∈N egativeContext(vin ,vpos )

Here we only consider one posi!ve pair (vin , vpos ) of current input vin , and n nega!ve
pairs (vin , vneg,s ) with (1 ≤ s ≤ n) corresponding to (vin , vpos ).

The log-likelihood

n
T
ℓvin (θ) = log(σ(win wpos )) + ∑ log(1 − σ(win
T
wneg,s ))
s=1

Omer Levy, h&ps://www.quora.com/How-does-word2vec-work

Yoav Goldberg and Omer Levy, word2vec Explained: Deriving Mikolov et al.'s Nega!ve-
Sampling Word-Embedding Method

Richard Socher, CS224D Lecture 2 - 31st Mar 2016, h&ps://www.youtube.com/watch?

v=xhHOL3TNyJs

Ali Ghodsi, Lec [3,2]: Deep Learning, Word2vec, h&ps://www.youtube.com/watch?

v=nuirUEmbaJU

Xin Rong, word2vec Parameter Learning Explained, arXiv arXiv:1411.2738 (2014)

Omer Levy and Yoav Goldberg, Neural Word Embedding as Implicit Matrix
Factoriza!on, NIPS-2014

h&p://web.stanford.edu/class/cs224n/lecture_notes/cs224n-2017-notes1.pdf

Gutmann, Michael, and Aapo Hyvärinen. “Noise-contras!ve es!ma!on: A new

es!ma!on principle for unnormalized sta!s!cal models.” Interna!onal Conference on
Ar!ficial Intelligence and Sta!s!cs. 2010.

Skip Gram
100% (1)
Skip Gram
37 pages
BIOT - SAVART LAW Multiple Choice Questions
100% (2)
BIOT - SAVART LAW Multiple Choice Questions
2 pages
Lda-The Gritty Details
100% (1)
Lda-The Gritty Details
12 pages
Cosine
No ratings yet
Cosine
10 pages
OP5205: BUSINESS STATISTICS (2 Credits) Session 21-24 - Confidence Interval & Hypothesis Testing
No ratings yet
OP5205: BUSINESS STATISTICS (2 Credits) Session 21-24 - Confidence Interval & Hypothesis Testing
77 pages
The Spectral Recording Process
No ratings yet
The Spectral Recording Process
42 pages
Statistics Presentation
No ratings yet
Statistics Presentation
21 pages
Lesson 4 Logic and Knowledge Representation
No ratings yet
Lesson 4 Logic and Knowledge Representation
100 pages
CS 4 - Knowledge Representation - First Order Logic
No ratings yet
CS 4 - Knowledge Representation - First Order Logic
86 pages
Prompt Engineering For Vision Models Slides 1720084286
No ratings yet
Prompt Engineering For Vision Models Slides 1720084286
17 pages
BTM Engineer Guide Section D
No ratings yet
BTM Engineer Guide Section D
56 pages
Statistics Powerpoint Presentation - Regression
No ratings yet
Statistics Powerpoint Presentation - Regression
17 pages
神经网络中涉及的向量和矩阵求导
100% (1)
神经网络中涉及的向量和矩阵求导
18 pages
PPT03-First Order Logic & Inference in FOL
No ratings yet
PPT03-First Order Logic & Inference in FOL
59 pages
m8 Fol
No ratings yet
m8 Fol
27 pages
Spring Water Alberta
No ratings yet
Spring Water Alberta
84 pages
Language Modeling: Prabhleen Juneja Thapar Institute of Engineering & Technology
No ratings yet
Language Modeling: Prabhleen Juneja Thapar Institute of Engineering & Technology
36 pages
Inference in First Order Logic
No ratings yet
Inference in First Order Logic
26 pages
Iso 14731 - 2019
No ratings yet
Iso 14731 - 2019
18 pages
Module 9 Going Places
100% (1)
Module 9 Going Places
36 pages
Lecture 05 - Part A First Order Logic (FOL) : Dr. Shazzad Hosain
No ratings yet
Lecture 05 - Part A First Order Logic (FOL) : Dr. Shazzad Hosain
80 pages
Vector Database in LLMs
No ratings yet
Vector Database in LLMs
14 pages
An Introduction To Seaborn
No ratings yet
An Introduction To Seaborn
42 pages
SVM
No ratings yet
SVM
12 pages
Outdoor Lighting Service Guide
No ratings yet
Outdoor Lighting Service Guide
56 pages
Web3 Presentation
No ratings yet
Web3 Presentation
32 pages
Celta Coursework Examples
100% (2)
Celta Coursework Examples
7 pages
Knowledge Representation First Order Logic
No ratings yet
Knowledge Representation First Order Logic
49 pages
All Pairs Shortest Path
No ratings yet
All Pairs Shortest Path
28 pages
Notes On Beta and Dirchilet Distribution
No ratings yet
Notes On Beta and Dirchilet Distribution
19 pages
001 John D. Anderson - Fundamentals of Aerodynamics-McGraw-Hill Education 2016 6e-1074-1085
No ratings yet
001 John D. Anderson - Fundamentals of Aerodynamics-McGraw-Hill Education 2016 6e-1074-1085
12 pages
Notes On Jensen's Inequality
No ratings yet
Notes On Jensen's Inequality
7 pages
Coastal Sentiment Review Using Naïve Bayes With Feature Selection Genetic Algorithm
No ratings yet
Coastal Sentiment Review Using Naïve Bayes With Feature Selection Genetic Algorithm
10 pages
Ch-4 Ensemble Learning
No ratings yet
Ch-4 Ensemble Learning
18 pages
Lecture Notes - Logistic Regression
100% (1)
Lecture Notes - Logistic Regression
11 pages
Tyler S Model of Curriculum Development
100% (1)
Tyler S Model of Curriculum Development
15 pages
Chapters 8 & 9 First-Order Logic: Dr. Daisy Tang
No ratings yet
Chapters 8 & 9 First-Order Logic: Dr. Daisy Tang
76 pages
Rural Development and Self Employment
No ratings yet
Rural Development and Self Employment
11 pages
Decision Tree & Random Forest
No ratings yet
Decision Tree & Random Forest
28 pages
542 315 Word2vec
No ratings yet
542 315 Word2vec
20 pages
Exponential Family Related To LDA
No ratings yet
Exponential Family Related To LDA
12 pages
Explainable Ai in Pervasive Healthcare
No ratings yet
Explainable Ai in Pervasive Healthcare
25 pages
Text Manual For Seismic Design Help 2015
No ratings yet
Text Manual For Seismic Design Help 2015
26 pages
Standard Operating Procedure For Pipettes
No ratings yet
Standard Operating Procedure For Pipettes
25 pages
Thomas Alva Edison
No ratings yet
Thomas Alva Edison
9 pages
ch9 Ensemble Learning
No ratings yet
ch9 Ensemble Learning
19 pages
Topic For The Class:: Knowledge and Reasoning
No ratings yet
Topic For The Class:: Knowledge and Reasoning
41 pages
NEURAL NETWORKS and Deep Learning: Going Deep About Neural Network
No ratings yet
NEURAL NETWORKS and Deep Learning: Going Deep About Neural Network
4 pages
2017 Reliability Engineering - Theory and Practice PDFDrive
No ratings yet
2017 Reliability Engineering - Theory and Practice PDFDrive
24 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
Examples of Variational Inference With Gaussian-Gamma Distribution
No ratings yet
Examples of Variational Inference With Gaussian-Gamma Distribution
6 pages
Module-5:: Network Analysis
No ratings yet
Module-5:: Network Analysis
22 pages
Unit 2
No ratings yet
Unit 2
112 pages
KCCTech Services and Solutions TIP SI
No ratings yet
KCCTech Services and Solutions TIP SI
7 pages
Knowledge Representation
No ratings yet
Knowledge Representation
47 pages
Technical Seminar: Sapthagiri College of Engineering
No ratings yet
Technical Seminar: Sapthagiri College of Engineering
18 pages
Automatic Music Generation
No ratings yet
Automatic Music Generation
16 pages
Nivea Case Sun Category
No ratings yet
Nivea Case Sun Category
2 pages
6 - Train - Test - Split - Ipynb - Colaboratory
No ratings yet
6 - Train - Test - Split - Ipynb - Colaboratory
5 pages
Consolidated Readiness Assessment Tool - 23 July 2024
No ratings yet
Consolidated Readiness Assessment Tool - 23 July 2024
8 pages
Signals and Systems Assignment
No ratings yet
Signals and Systems Assignment
6 pages
Phan Ngọc Diễm B1908962-Lý thuyết dịch
0% (1)
Phan Ngọc Diễm B1908962-Lý thuyết dịch
13 pages
Cambridge International AS & A Level: 9696/42 Geography
No ratings yet
Cambridge International AS & A Level: 9696/42 Geography
4 pages
LP French Revolution (1)
No ratings yet
LP French Revolution (1)
4 pages
Data Science Introduction
No ratings yet
Data Science Introduction
82 pages
USB-CAN User Manual (v7.20)
No ratings yet
USB-CAN User Manual (v7.20)
12 pages
Logistic Regression and Cross-Entropy
No ratings yet
Logistic Regression and Cross-Entropy
3 pages
GRINNELL Figure 780 Grooved Snap Couplings 1-1/4 Thru 8 Inch (DN32 Thru DN200) General Description
No ratings yet
GRINNELL Figure 780 Grooved Snap Couplings 1-1/4 Thru 8 Inch (DN32 Thru DN200) General Description
2 pages
AI Unit4 LogicAgents
No ratings yet
AI Unit4 LogicAgents
17 pages
Best Practices For Prompt Engineering With The OpenAI
No ratings yet
Best Practices For Prompt Engineering With The OpenAI
6 pages
Data Science Intervieew Questions
100% (1)
Data Science Intervieew Questions
16 pages
Mining The Web Graph: Technical Seminar Presentation On
No ratings yet
Mining The Web Graph: Technical Seminar Presentation On
15 pages
Notes On Backpropagation
No ratings yet
Notes On Backpropagation
14 pages
Repeat Radiography Analysis and Corrective Measures in A Private Medical College and Teaching Hospital
No ratings yet
Repeat Radiography Analysis and Corrective Measures in A Private Medical College and Teaching Hospital
5 pages
Predicate Logic
No ratings yet
Predicate Logic
64 pages
Programming Agents Williams
No ratings yet
Programming Agents Williams
31 pages
Gradient Descent Algorithms and Variations - PyImageSearch
No ratings yet
Gradient Descent Algorithms and Variations - PyImageSearch
21 pages
AIML - 04 Single Layer Perceptron
No ratings yet
AIML - 04 Single Layer Perceptron
11 pages
Navies Bayes
No ratings yet
Navies Bayes
18 pages
GNN-XAI 学习提纲.md
No ratings yet
GNN-XAI 学习提纲.md
4 pages
Toddle F3acbef2 A4b8 40ee Aedf Ff82d692ee35 Edited Winter Wonderland Insert
No ratings yet
Toddle F3acbef2 A4b8 40ee Aedf Ff82d692ee35 Edited Winter Wonderland Insert
3 pages
Application of First-Order Logic in Knowledge Based Systems PDF
No ratings yet
Application of First-Order Logic in Knowledge Based Systems PDF
7 pages
Artificial Intelligence For R-2017 by Krishna Sankar P., Shangaranarayanee N. P., Nithyananthan S.
0% (1)
Artificial Intelligence For R-2017 by Krishna Sankar P., Shangaranarayanee N. P., Nithyananthan S.
8 pages
CSC445: Neural Networks
No ratings yet
CSC445: Neural Networks
51 pages
Medanta Hospital Lucknow Full Address - Google Search
No ratings yet
Medanta Hospital Lucknow Full Address - Google Search
1 page
Knowledge Representation Additional Reading
No ratings yet
Knowledge Representation Additional Reading
26 pages
Knowledge Based Systems (Sistem Berbasis Pengetahuan) : Ir. Wahidin Wahab M.SC PH.D
No ratings yet
Knowledge Based Systems (Sistem Berbasis Pengetahuan) : Ir. Wahidin Wahab M.SC PH.D
33 pages
Dropout Vs Pruning
No ratings yet
Dropout Vs Pruning
2 pages
Machine Learning (Analytics Vidhya) : What Is Logistic Regression?
100% (1)
Machine Learning (Analytics Vidhya) : What Is Logistic Regression?
5 pages
QUESTION BANK UNIT 5 - Computer Organization and Architecture
No ratings yet
QUESTION BANK UNIT 5 - Computer Organization and Architecture
9 pages
AutoGen - The Automated Program Generator
No ratings yet
AutoGen - The Automated Program Generator
196 pages
Gradient Descent
No ratings yet
Gradient Descent
15 pages
Short Report On Expert Systems
100% (1)
Short Report On Expert Systems
12 pages
Bias and Variance
No ratings yet
Bias and Variance
6 pages
Matplotlib PDF
No ratings yet
Matplotlib PDF
16 pages
Lab I TENSOR FLOW AND KERAS
No ratings yet
Lab I TENSOR FLOW AND KERAS
3 pages
Hebbian Learning: Fundamentals and Applications for Uniting Memory and Learning
From Everand
Hebbian Learning: Fundamentals and Applications for Uniting Memory and Learning
Fouad Sabry
No ratings yet
Hopfield Networks: Fundamentals and Applications of The Neural Network That Stores Memories
From Everand
Hopfield Networks: Fundamentals and Applications of The Neural Network That Stores Memories
Fouad Sabry
No ratings yet
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
From Everand
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
Fouad Sabry
No ratings yet