David Packard, A Concordance to Livy (1968)
Natural Language Processing
Info 159/259
Lecture 2: Vector semantics and static word embeddings
(Jan 20, 2022)
David Bamman, UC Berkeley
“TOM!” No answer. “TOM!” No answer. “What's gone with that boy, I wonder?
You TOM!” No answer. The old lady pulled her spectacles down and looked
over them about the room; then she put them up and looked out under them.
She seldom or never looked through them for so small a thing as a boy; they
were her state pair, the pride of her heart, and were built for “style,” not
service--she could have seen through a pair of stove-lids just as well. She
looked perplexed for a moment, and then said, not fiercely, but still loud
enough for the furniture to hear: “Well, I lay if I get hold of you I'll--” She did not
finish, for by this time she was bending down and punching under the bed
with the broom, and so she needed breath to punctuate the punches with. She
resurrected nothing but the cat. “I never did see the beat of that boy!” She
went to the open door and stood in it and looked out among the tomato vines
and “jimpson” weeds that constituted the garden. No Tom. So she lifted up her
voice at an angle calculated for distance and shouted: “Y-o-u-u TOM!” There
was a slight noise behind her and she turned just in time to seize a small boy
by the slack of his roundabout and arrest his flight. “There! I might 'a' thought
of that closet. What you been doing in there?” “Nothing.” “Nothing! Look at
“TOM!” No answer. “TOM!” No answer. “What's gone with that boy, I wonder?
You TOM!” No answer. The old lady pulled her spectacles down and looked
over them about the room; then she put them up and looked out under them.
She seldom or never looked through them for so small a thing as a boy; they
were her state pair, the pride of her heart, and were built for “style,” not
service--she could have seen through a pair of stove-lids just as well. She
looked perplexed for a moment, and then said, not fiercely, but still loud
enough for the furniture to hear: “Well, I lay if I get hold of you I'll--” She did not
finish, for by this time she was bending down and punching under the bed
with the broom, and so she needed breath to punctuate the punches with. She
resurrected nothing but the cat. “I never did see the beat of that boy!” She
went to the open door and stood in it and looked out among the tomato vines
and “jimpson” weeds that constituted the garden. No Tom. So she lifted up her
voice at an angle calculated for distance and shouted: “Y-o-u-u TOM!” There
was a slight noise behind her and she turned just in time to seize a small boy
by the slack of his roundabout and arrest his flight. “There! I might 'a' thought
of that closet. What you been doing in there?” “Nothing.” “Nothing! Look at
Lexical semantics
“You shall know a word by the company it keeps”
[Firth 1957]
Zellig Harris, “Distributional Structure” (1954) Ludwig Wittgenstein, Philosophical Investigations (1953)
everyone likes ______________
a bottle of ______________ is on the table
______________ makes you drunk
a cocktail with ______________ and seltzer
context
everyone likes ______________
a bottle of ______________ is on the table
______________ makes you drunk
a cocktail with ______________ and seltzer
Distributed representation
• Vector representation that encodes information about the distribution of
contexts a word appears in
• Words that appear in similar contexts have similar representations (and similar
meanings, by the distributional hypothesis).
• We have several different ways we can encode the notion of “context.”
Term-document matrix
Romeo Julius
Hamlet Macbeth Richard III Tempest Othello King Lear
& Juliet Caesar
knife 1 1 4 2 2 10
dog 6 12 2
sword 2 2 7 5 5 17
love 64 135 63 12 48
like 75 38 34 36 34 41 27 44
Context = appearing in the same document.
Vectors
knife 1 1 4 2 2 10
sword 2 2 7 5 5 17
Vector representation of the term; vector size
= number of documents
Cosine Similarity
=
( , )=
= =
• We can calculate the cosine similarity of two vectors to judge the degree of
their similarity [Salton 1971]
• Euclidean distance measures the magnitude of distance between two points
• Cosine similarity measures their orientation
Hamlet Macbeth R&J R3 JC Tempest Othello KL
knife 1 1 4 2 2 10
dog 6 12 2
sword 2 2 7 5 5 17
love 64 135 63 12 48
like 75 38 34 36 34 41 27 44
cos(knife, knife) 1
cos(knife, dog) 0.11
cos(knife, sword) 0.99
cos(knife, love) 0.65
cos(knife, like) 0.61
Weighting dimensions
• Not all dimensions are equally informative
TF-IDF
• Term frequency-inverse document frequency
• A scaling to represent a feature as function of how frequently it appears in
a data point but accounting for its frequency in the overall collection
• IDF for a given term = the number of documents in collection / number of
documents that contain term
TF-IDF
• Term frequency (tft,d) = the number of times term t occurs in document d;
several variants (e.g., passing through log function).
• Inverse document frequency = inverse fraction of number of documents
containing (Dt) among total number of documents N
N
tf idf (t, d) = tft,d log
Dt
IDF
Romeo Richard Julius King
Hamlet Macbeth Tempest Othello IDF
& Juliet III Caesar Lear
knife 1 1 4 2 2 2 0.12
dog 2 6 6 2 12 0.20
sword 17 2 7 12 2 17 0.12
love 64 135 63 12 48 0.20
like 75 38 34 36 34 41 27 44 0
IDF for the informativeness of the terms when
comparing document vectors
PMI
• Mutual information provides a measure of how independent two variables
(X and Y) are.
• Pointwise mutual information measures the independence of two
outcomes (x and y)
PMI
P (x, y)
log2
P (x)P (y)
P (w, c) What’s this value for w and c
w = word, c = context log2 that never occur together?
P (w)P (c)
P (w, c)
P P M I = max log2 ,0
P (w)P (c)
Romeo Richard Julius King
Hamlet Macbeth Tempest Othello total
& Juliet III Caesar Lear
knife 1 1 4 2 2 2 12
dog 2 6 6 2 12 28
sword 17 2 7 12 2 17 57
love 64 135 63 12 48 322
like 75 38 34 36 34 41 27 44 329
total 159 41 186 119 34 59 27 123 748
135
748
P M I(love, R&J) = 186 322
748 748
Term-context matrix
• Rows and columns are both words; cell counts = the number of times
word wi and wj show up in the same context (e.g., a window of 2
tokens).
Dataset
• the big dog ate dinner
• the small cat ate dinner
• the white dog ran down
the street
• the yellow cat ran inside
Dataset
DOG terms (window = 2)
• the big dog ate dinner
the big ate dinner the
• the small cat ate dinner white ran down
• the white dog ran down
the street
• the yellow cat ran inside
Dataset
DOG terms (window = 2)
• the big dog ate dinner
the big ate dinner the
• the small cat ate dinner white ran down
• the white dog ran down CAT terms (window = 2)
the street the small ate dinner the
yellow ran inside
• the yellow cat ran inside
Term-context matrix
contexts
the big ate dinner …
term
dog 2 1 1 1 …
cat 2 0 1 1 …
• Each cell enumerates the number of times a context word appeared in a
window of 2 words around the term.
• How big is each representation for a word here?
We can also define “context” to be directional ngrams (i.e., ngrams of
a defined order occurring to the left or right of the term)
Dataset
DOG terms (window = 2)
• the big dog ate dinner
L: the big, R: ate dinner,
• the small cat ate dinner L: the white, R: ran down
• the white dog ran down CAT terms (window = 2)
the street
L: the small, R: ate
dinner, L: the yellow, R:
• the yellow cat ran inside
ran inside
Term-context matrix
contexts
R: ate
L: the big L: the small L: the yellow …
dinner
dog 1 1 0 0 …
term
cat 0 1 1 1 …
• Each cell enumerates the number of time a directional context phrase appeared in a
specific position around the term.
write a book
write a poem
• First-order co-occurrence (syntagmatic association): write co-occurs with
book in the same sentence.
• Second-order co-occurrence (paradigmatic association): book co-occurs
with poem (since each co-occur with write)
Syntactic context
Lin 1998; Levy and Goldberg 2014
Evaluation
Intrinsic Evaluation
human
word 1 word 2
score
midday noon 9.29
• Relatedness: correlation journey voyage 9.29
(Spearman/Pearson) between
vector similarity of pair of words car automobile 8.94
and human judgments … … …
professor cucumber 0.31
king cabbage 0.23
WordSim-353 (Finkelstein et al. 2002)
Intrinsic Evaluation
• Analogical reasoning (Mikolov et al. 2013). For analogy Germany : Berlin ::
France : ???, find closest vector to v(“Berlin”) - v(“Germany”) + v(“France”)
target
possibly impossibly certain uncertain
generating generated shrinking shrank
think thinking look looking
Baltimore Maryland Oakland California
shrinking shrank slowing slowed
Rabat Morocco Astana Kazakhstan
A 0
Sparse vectors a
aa
aal
0
0
0
aalii 0
aam 0
Aani 0
aardvark 1
“aardvark” aardwolf 0
... 0
zymotoxic 0
zymurgy 0
Zyrenian 0
Zyrian 0
V-dimensional vector, single 1 for the Zyryan 0
zythem 0
identity of the element Zythia 0
zythum 0
Zyzomys 0
Zyzzogeton 0
Dense
1
vectors
0.7
→ 1.3
-4.5
Singular value decomposition
• Any n⨉p matrix X can be decomposed into the product of three
matrices (where m = the number of linearly independent rows)
9
4
3
1
2
⨉ 7 ⨉
9
8
1
nxm mxm mxp
(diagonal)
Singular value decomposition
• We can approximate the full matrix by only considering the leftmost k
terms in the diagonal matrix
9
4
0
0
0
⨉ 0 ⨉
0
0
0
nxm mxm mxp
(diagonal)
Singular value decomposition
• We can approximate the full matrix by only considering the leftmost k
terms in the diagonal matrix (the k largest singular values)
9
4
0
0
0
⨉ 0 ⨉
0
0
0
nxm mxm mxp
Romeo Richard Julius King
Hamlet Macbeth Tempest Othello
& Juliet III Caesar Lear
knife 1 1 4 2 2 2
dog 2 6 6 2 12
sword 17 2 7 12 2 17
love 64 135 63 12 48
like 75 38 34 36 34 41 27 44
knife Hamlet Macbeth Romeo Richard Julius Tempest Othello King
& Juliet III Caesar Lear
dog
sword
love
like
Low-dimensional Low-dimensional
representation for representation for
terms (here 2-dim) documents (here 2-dim)
knife Hamlet Macbeth Romeo Richard Julius Tempest Othello King
& Juliet III Caesar Lear
dog
sword
love
like
Latent semantic analysis
• Latent Semantic Analysis/Indexing (Deerwester et al. 1998) is this process
of applying SVD to the term-document co-occurence matrix
• Terms typically weighted by tf-idf
• This is a form of dimensionality reduction (for terms, from a D-dimensionsal
sparse vector to a K-dimensional dense one), K << D.
Dense vectors from prediction
• Learning low-dimensional representations of words by framing a
predicting task: using context to predict words in a surrounding window
• Transform this into a supervised prediction problem; similar to language
modeling but we’re ignoring order within the context window
Dense vectors from prediction
Word2vec Skipgram model x y
(Mikolov et al. 2013): given a
gin a
single word in a sentence,
predict the words in a context gin cocktail
window around it.
gin with
gin and
a cocktail with gin
gin seltzer
and seltzer
Window size = 3
Dimensionality reduction
… …
the 1
a 0 the
an 0
for 0 4.1
in 0 -0.9
on 0
dog 0
cat 0
… …
the is a point in V-dimensional space the is a point in 2-dimensional space
W V
gin x1 y gin
h1
cocktail x2 y cocktail
h2
globe x3 y globe
x W V y
gin 0 -0.5 1.3 4.1 0.7 0.1 1
cocktail 1 0.4 0.08 -0.9 1.3 0.3 0
globe 0 1.7 3.1 0
W V
gin x1 y gin
h1
cocktail x2 y cocktail
h2
globe x3 y globe
Only one of the inputs W V
is nonzero.
-0.5 1.3 4.1 0.7 0.1
= the inputs are really 0.4 0.08 -0.9 1.3 0.3
Wcocktail
1.7 3.1
x W
0.13 0.56
-1.75 0.07
0.80 1.19
-0.11 1.38
-0.62 -1.46
-1.16 -1.24
0.99 -0.26
-1.46 -0.85
0.79 0.47
0.06 -1.21 x W =
-0.31 0.00
1 -1.01 -2.52 -1.01 -2.52
-1.50 -0.14
-0.14 0.01
-0.13 -1.76
-1.08 -0.56 This is the embedding
-0.17 -0.74
of the context
0.31 1.03
-0.24 -0.84
-0.79 -0.18
Word embeddings
• Can you predict the output word from a vector representation of the
input word?
• Rather than seeing the input as a one-hot encoded vector specifying
the word in the vocabulary we’re conditioning on, we can see it as
indexing into the appropriate row in the weight matrix W
Word embeddings
• Similarly, V has one H-dimensional vector for each element in the vocabulary
(for the words that are being predicted)
gin cocktail cat globe
This is the embedding
of the word
4.1 0.7 0.1 1.3
-0.9 1.3 0.3 -3.4
1 2 3 4 … 50
the 0.418 0.24968 -0.41242 0.1217 … -0.17862
, 0.013441 0.23682 -0.16899 0.40951 … -0.55641
. 0.15164 0.30177 -0.16763 0.17684 … -0.31086
of 0.70853 0.57088 -0.4716 0.18048 … -0.52393
to 0.68047 -0.039263 0.30186 -0.17792 … 0.13228
… … … … … … …
chanty 0.23204 0.025672 -0.70699 -0.04547 … 0.34108
kronik -0.60921 -0.67218 0.23521 -0.11195 … 0.85632
rolonda -0.51181 0.058706 1.0913 -0.55163 … 0.079711
zsombor -0.75898 -0.47426 0.4737 0.7725 … 0.84014
sandberger 0.072617 -0.51393 0.4728 -0.52202 … 0.23096
https://nlp.stanford.edu/projects/glove/
y
dog
cat puppy
wrench
screwdriver
• Why this behavior? dog, cat show up in similar positions
the black cat jumped on the table
the black dog jumped on the table
the black puppy jumped on the table
the black skunk jumped on the table
the black shoe jumped on the table
• Why this behavior? dog, cat show up in similar positions
the black [0.4, 0.08] jumped on the table
the black [0.4, 0.07] jumped on the table
the black puppy jumped on the table
the black skunk jumped on the table
the black shoe jumped on the table
To make the same predictions, these numbers need to be close to each other.
“Word embedding” in NLP papers
0.7
0.525
0.35
0.175
0
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
Data from ACL papers in the ACL Anthology (https://www.aclweb.org/anthology/)
Analogical inference
• Mikolov et al. 2013 show that vector representations have some potential for
analogical reasoning through vector arithmetic.
apple - apples ≈ car - cars
king - man + woman ≈ queen
Mikolov et al., (2013), “Linguistic Regularities in Continuous Space Word Representations” (NAACL)
Bias
• Allocational harms: automated systems allocate resources unfairly to
different groups (access to housing, credit, parole).
• Representational harms: automated systems represent one group less
favorably than another (including demeaning them or erasing their
existence).
Blodgett et al. (2020), “Language (Technology) is Power: A Critical Survey of “Bias” in NLP”
Representations
• Embeddings for African-American first names are closer to “unpleasant”
words than European names (Caliskan et al. 2017)
• Sentiment analysis over sentences containing African-American first
names are more negative than identical sentences with European names
Kiritchenko and Mohammad (2018), "Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems"
Interrogating “bias”
• Kozlowski et al. (2019), “The
Geometry of Culture:
Analyzing the Meanings of
Class through Word
Embeddings,” American
Sociological Review.
• An et al. 2018, “SemAxis: A
Lightweight Framework to
Characterize Domain-Specific
Word Semantics Beyond
Sentiment”
Low-dimensional distributed
representations
• Low-dimensional, dense word representations are extraordinarily powerful
(and are arguably responsible for much of gains that neural network
models have in NLP).
• Lets your representation of the input share statistical strength with words
that behave similarly in terms of their distributional properties (often
synonyms or words that belong to the same class).
Two kinds of “training” data
• The labeled data for a specific task (e.g., labeled sentiment for movie
reviews): ~ 2K labels/reviews, ~1.5M words → used to train a supervised
model
• General text (Wikipedia, the web, books, etc.), ~ trillions of words → used to
train word distributed representations
Using dense vectors
• In neural models (CNNs, RNNs, LM), replace the V-dimensional sparse
vector with the much smaller K-dimensional dense one.
• Can also take the derivative of the loss function with respect to those
representations to optimize for a particular task.
emoji2vec
Eisner et al. (2016), “emoji2vec: Learning Emoji Representations from their Description”
node2vec
Grover and Leskovec (2016), “node2vec: Scalable Feature Learning for Networks”
Trained embeddings
• Word2vec
https://code.google.com/archive/p/word2vec/
• Glove
http://nlp.stanford.edu/projects/glove/
HW1 out today
• Due Wed Jan 26 @ 11:59pm