Term document scoring and vector space model
Ayan Das
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Outline
1 tf − idf scoring
2 Vector space model
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 2 / 33
Lecture outline
1 tf − idf scoring
2 Vector space model
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 3 / 33
tf − idf scoring
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 4 / 33
Ranked retrieval
For Boolean queries, documents either match or don’t match.
Not possible to judge the degree of relevance of a document with
respect to a query.
Good for expert users with precise understanding of their needs and the
collection
Not good for the majority of users as they are incapable of writing
Boolean queries
Thus, Boolean retrieval not suitable for web search
Boolean queries often result in either too few (=0) or too many
(1000s) results.
“standard user dlink 650” → 200,000 hits
“standard user dlink 650 no card found” → 0 hits
AND gives too few; OR gives too many results
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 5 / 33
Ranked retrieval models
Boolean retrieval: Returns set of documents satisfying a query
expression
Ranked retrieval: The system returns an ordering over the (top)
documents in the collection for a query
When a system produces a ranked result set
the size of the result set is not an issue
The top k most relevant (highest ranking) documents can be returned
Don’t overwhelm the user
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 6 / 33
Ranked retrieval models
Return the documents in an order most likely to be useful to the
searcher
How can we rank-order the documents in the collection with respect
to a query?
Assign a score to each document which estimates how well the
document “matches” the query
The score may be in the range [0, 1]
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 7 / 33
Query-document matching scores
We need a way of assigning a score to a query/document pair
For a given one-term query
IF (query term not in document): score = 0
The more frequent the query term in the document, the higher the
score
The rest of the discussion is based on this idea and we explore several
alternatives and extensions
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 8 / 33
Query-document matching scores
Term frequency (tf): The number of times a term occurs in the
document.
Rare terms in a collection are more informative than frequent terms.
The documents themselves may vary in length
We need a more sophisticated way of normalizing the length of
document
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 9 / 33
Term-document incidence matrix
Each document is represented by a binary vector ∈ 0, 1|V|
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 10 / 33
Term-document count matrices
Consider the number of occurrences of a term in a document:
Each document is a count vector
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 11 / 33
Bag of words model
Considers a document or a query as a multi-set set of terms
Does not take the order of the words in the document into account
In the vector space representation also the ordering is not
maintained
John runs faster than Mary and Mary runs faster than John have the
same vector representation
Step back from positional indexing
John runs faster than Mary
John runs faster
than Mary 1 1 1 1 1
Mary runs faster
than John 1 1 1 1 1
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 12 / 33
Term frequency - tf
The term frequency tft,d of term t in document d is defined as the
number of times that t occurs in d.
We want to use t when computing query-document match scores.
Using raw term frequency has some disadvantages
Relevance is not directly proportional to the term frequency
A document with 10 occurrences of a term may be more relevant than
a document containing the term once but not 10 times more relevant
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 13 / 33
Term frequency - tf
A normalization
{ scheme is the log frequency weight of term t in d is
1 + log10 tft,d if tft,d > 0
wt,d =
0 otherwise
tftd → wtd :
0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4
Score for a document-query pair: sum over terms t in both q and d:
∑
tf_matching_score(q, d) = t∈q∩d (1 + logtft,d )
The score is 0 if none of the query terms is present in the document.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 14 / 33
Document Frequency
Document frequency (dft ): Number of documents in the collection
in which term t occurs.
Use the frequency of the term in the collection for weighting and
ranking.
Rare terms are more informative than frequent terms
Consider a term in the query that is rare in the collection e.g.
arachnocentric
A document containing this term is very likely to be relevant.
We want high weights for rare terms like arachnocentric
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 15 / 33
Document Frequency
Frequent terms are less informative than rare terms.
Consider a term in the query that is frequent in the collection (e.g.,
good, increase, line).
A document containing this term is more likely to be relevant than a
document that doesn’t.
These frequent terms are not sure indicators of relevance.
For these frequent terms we want positive weights, but lower weights
than the rare terms
Need high weights for rare terms
We can use the document frequency to factor this phenomenon into
computing the matching score.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 16 / 33
Inverse document frequency (idf) score
dft is an inverse measure of the informativeness of term t
Inverse document frequency: Is a measure of the informativeness
of term t.
We define the idf weight of term t as follows
idft = log(N/dft )
(N is the number of documents in the collection.)
idft is a measure of the informativeness of the term.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 17 / 33
Effect of idf on ranking
Does idf have an effect on ranking for one-term queries e.g. iPhone
idf has no effect on ranking one term queries
idf used to measure the relative importance of terms
idf affects the ranking of documents for queries with at least two terms
For the query capricious person, idf weighting makes occurrences of
capricious count for much more in the final document ranking than
occurrences of person
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 18 / 33
Collection vs. Document frequency
The collection frequency of t is the number of occurrences of t in the
collection, counted multiple occurrences
Word Collection Frequency
insurance 10440 3997
try 10422 8760
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 19 / 33
tf − idf weighting
The tf − idf weight of a term is the product of its tf weight and its idf
weight.
N
wt,d = (1 + log tft,d ).log
dft
Best known weighting scheme in information retrieval
Increases with the
number of occurrences within a document (term frequency)
rarity of the term in the collection (inverse document frequency)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 20 / 33
Vector space model
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 21 / 33
tf − idf score matrix
Each document is now represented by a real-valued vector of tf − idf
weights ∈ R|V|
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 22 / 33
Documents as Vectors
Each document is now represented by a real-valued vector of tf-idf
weights ∈ R|V|
It is a |V|-dimensional real-valued vector space.
Terms are axes of the space.
Documents are points or vectors in this space.
Very high-dimensional: tens of millions of dimensions when you apply
this to web search engines
Each vector is very sparse - most entries are zero.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 23 / 33
Queries as vectors
Key idea 1: Do the same for queries: represent them as vectors in the
space
Key idea 2: Rank documents according to their proximity to the query
in this space
proximity = similarity of vectors
proximity ≈ inverse of distance
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 24 / 33
Queries as vectors
Key idea 1: Do the same for queries: represent them as vectors in the
space
Key idea 2: Rank documents according to their proximity to the query
in this space
proximity = similarity of vectors
proximity ≈ inverse of distance
How to quantify the similarity between the vectors?
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 24 / 33
Similarity using distance (difference)
The Euclidean distance between q and d2 is large even though the
distribution of terms in the query q and the distribution of terms in the
document d2 are very similar
Vector representation
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 25 / 33
Use angle instead of distance
Take a document d and append it to itself. Call this document d̂.
Semantically d and d̂ have the same distance
The Euclidean distance between the two documents can be quite large
The angle between the two documents is 0, corresponding to maximal
similarity
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 26 / 33
Use angle instead of distance
Take a document d and append it to itself. Call this document d̂.
Semantically d and d̂ have the same distance
The Euclidean distance between the two documents can be quite large
The angle between the two documents is 0, corresponding to maximal
similarity
Key idea: Rank documents according to angle with query
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 26 / 33
From angles to cosines
The following two notions are equivalent
Rank documents in decreasing order of the angle between query and
document
Rank documents in increasing order of cosine of the angle between
query and document
Cosine is a monotonically decreasing function for the interval [0◦ ,
180◦ ]
Advantages of cosine similarity
Cosine score is proportional to similarity
Scales down the similarity score in the range [0,1]
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 27 / 33
Cosine(query, document)
qi is the tf-idf weight of term i in the query
di is the tf-idf weight of term i in the document
−
→ −
→
cos(−→q , d ) is the cosine of the angle between −→q and d .
−
→ −
→
cos(−→q , d ) is the cosine similarity of −
→
q and d
−
→
If −
→
q and d are length normalized, then cosine similarity is the scalar
(dot) product.
|V|
−
→ −
→ −
→ → ∑
−
cos( q , d ) = q · d = qi di (1)
i=1 . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 28 / 33
Length normalization
A vector can be (length-) normalized by dividing each of its
components by its length.
For this we use the L2 norm
√∑
∥x∥2 = 2
i xi
Dividing a vector by its L2 norm makes it a unit (length) vector
Effect on the two documents d and d̂ (d appended to itself) from
earlier slide: they have identical vectors after length-normalization
Long and short documents now have comparable weights
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 29 / 33
Cosine similarity
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 30 / 33
Cosine similarity among 3 documents
How similar are the novels?
SaS: Sense and Sensibility , PaP: Pride and Prejudice, and WH:
Wuthering Heights
term SaS PaP WH
affection 115 58 20
jealous 10 7 11
gossip 2 0 6
wuthering 0 0 38
Table: Term frequencies (counts)
Note: To simplify this example, we don’t do idf weighting
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 31 / 33
Cosine similarity among 3 documents
term SaS PaP WH term SaS PaP WH
affection 3.06 2.76 2.30 affection 0.789 0.832 0.524
jealous 2.00 1.85 2.04 jealous 0.515 0.555 0.465
gossip 1.30 0 1.78 gossip 0.335 0 0.405
wuthering 0 0 2.58 wuthering 0 0 0.588
Table: Log frequency weighting Table: After length normalization
cos(SaS,PaP) ≈
0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0
≈0.94
cos(SaS,WH) ≈ 0.79
cos(PaP,WH) ≈ 0.69
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 32 / 33
Computing cosine scores
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 33 / 33