[go: up one dir, main page]

0% found this document useful (0 votes)
65 views23 pages

Vector Space Model for IR Students

The document discusses the vector space model, which is a technique used in information retrieval systems. In the vector space model, documents and queries are represented as vectors of identifiers such as index terms. Similarities between documents and queries are calculated by measuring the similarity between their vector representations, such as using the inner product of the vectors. Term weighting methods such as TF-IDF are used to assign weights to terms in the vectors based on factors like term frequency and inverse document frequency.

Uploaded by

Bushra Mamoud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views23 pages

Vector Space Model for IR Students

The document discusses the vector space model, which is a technique used in information retrieval systems. In the vector space model, documents and queries are represented as vectors of identifiers such as index terms. Similarities between documents and queries are calculated by measuring the similarity between their vector representations, such as using the inner product of the vectors. Term weighting methods such as TF-IDF are used to assign weights to terms in the vectors based on factors like term frequency and inverse document frequency.

Uploaded by

Bushra Mamoud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

CS444: Information Retrieval

and Web Search


Fall 2021

CHAPTER 5:
VECTOR SPACE MODEL
Abstraction of search engine architecture
Indexed corpus
Crawler
Ranking procedure

Feedback Evaluation
Doc Analyzer
(Query)
Doc Representation
Query Rep User

Indexer Index Ranker results

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 2
Exact (Boolean) Retrieval Model:

MATCHED

Retrieval result

UNMATCHED

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 3
Weighted Retrieval Model:
A formal method that predicts
the degree of relevance of
document to a query.
It assigns a weight to each term and takes into account the length
of a document.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 4
(The Best-Match) Retrieval Models
Best-match models predict the degree to which a document is
relevant to a query
Ideally, this would be expressed as RELEVANT(q,d)
In practice, it is expressed as SIMILAR(q,d)
How might you compute the similarity between q and d?
(Relevance)

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 5
Classification of IR Model

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 6
Vector Space model
Formally, a vector space is defined by a set of linearly independent basis vectors
The basis vectors correspond to the dimensions or directions of the vector space

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 7
What’s the Vector?
A vector is a point in a vector space and has length (from the origin to the point) and direction
• A 2-dimensional vector can be written as [x,y]
• A 3-dimensional vector can be written as [x,y,z]

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 8
The Vector-Space Model
Assume t distinct terms remain after preprocessing
These “orthogonal” terms form a vector space.
 Dimensionality = t = |vocabulary|
Both documents and queries are expressed as t-dimensional vectors
So, if we have 50-term dictionary means that we have 50-dimension space
The vector space model ranks documents (top-k documents) based on the
vector-space similarity between the query vector and the document vector
There are many ways to compute the similarity between two vectors

9 ENGIN BY ZAINAB AHMED MOHAMMED


CS444 INFORMATION RETRIVAL & WEB SEARCH
Example: T3
D1 = 2T1 + 3T2 + 5T3
5
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3
D1 = 2T1+ 3T2 + 5T3

Q = 0T1 + 0T2 + 2T3


2 3
T1
D2 = 3T1 + 7T2 + T3

7
T2

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 10
Issues for Vector Space Model
How to determine important words in a document?
How to determine the degree of importance of a term
within a document and within the entire collection?
How to determine the degree of similarity between a
document and the query?
In the case of the web, what is the collection and what are
the effects of links, formatting information, etc.?

11 ENGIN BY ZAINAB AHMED MOHAMMED


CS444 INFORMATION RETRIVAL & WEB SEARCH
Term weighting:
(what are the most important terms?)
The terms of a document are not equally useful for describing the document contents
There are properties of an index term which are useful for evaluating the importance of the
term in a document
For instance, a term which appears in all documents of a collection is completely useless for
retrieval tasks
> 0 if term i found in document j
term importance can be characterized by a weight 𝑤𝑖,j = = 0 if term i not found in document j
These weights are useful to compute a rank for each document in the collection
The ranked query assigns a number from the interval 0 to 1 to a document.
The closer the number to 1 is, the higher match with a query exists.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 12
Term Weights:(Term Frequency) TF
More frequent terms in a document are more
important, i.e. more indicative of the topic.
fij = frequency of term i in document j
The weights wi,j can be computed using the frequencies of occurrence of the
terms within documents

The total frequency of occurrence Fi of term ki in the collection N is defined


as:

where N is the number of documents in the collection


CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 13
TF normalization
Two views of document length
◦ A doc is long because it is verbose
◦ A doc is long because it has more content
Raw TF is inaccurate
◦ Document length variation
◦ “Repeated occurrences” are less informative than the “first occurrence”
◦ Relevance does not increase proportionally with number of term occurrence
May want to normalize term frequency (tf) by:

otfij = fij / maxi{fij} (Maximum TF scaling) (Normalize by the most frequent word in this doc)
otfij = 1+log fij (Sublinear TF scaling)

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 14
Term Weight:(Inverse Document Frequency) IDF
Terms that appear in many different documents are
less indicative of overall topic.
df i = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
= log2 (N/ df i), (N: total number of documents)
Assign higher weights to the rare terms
Log used to dampen the effect relative to tf.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 15
Term Weighs: (TF-IDF)
Combining TF and IDF
◦ Common in doc  high tf  high weight
◦ Rare in collection high idf high weight
◦ 𝑤 𝑡, 𝑑 = 𝑇𝐹 𝑡, 𝑑 × 𝐼𝐷𝐹 𝑡
◦ 𝑤 𝑡, 𝑑 = 𝑇𝐹 𝑡, 𝑑 × log2 (N/ df(t))
We can use normalized TF
𝑤 𝑡, 𝑑 = (1 + log(𝐹 𝑡, 𝑑 )) × log2 (N/ df(t))
𝑤 𝑡, 𝑑 = (𝐹 𝑡, 𝑑 /max{𝑓(𝑡, 𝑑)}) × log2 (N/ df(t))

A term occurring frequently in the document, but rarely in the rest of the collection is given
high weight.
Many other ways of determining term weights have been proposed.
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 16
(TF-IDF)
Most well-known document representation schema in IR! (G Salton et al. 1983)

“Salton was perhaps the


leading computer scientist
working in the field of
information retrieval during his
time.” - wikipedia

Gerard Salton Award


– highest achievement award in IR

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 17
Computing TF-IDF -- An Example
Given a document containing a term with given frequencies:
A(3), B(2), C(1)
Assume collection contains N=10,000 documents and document frequencies of these terms are:
A(50), B(1300), C(250)
Then (Using maximum scaling for tf):
A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6
B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0
C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8

Exercise: Try it using Sublinear scaling

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 18
Similarity Measure
A similarity measure is a function that computes the degree of
similarity between two vectors.
Using a similarity measure between the query and each document
Similarity measure can be computed by many way

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 19
Good Similarity Measure - Inner Product
Similarity between vectors for the document di and query q can be computed
as the vector inner product:
t


sim(d ,q) = d •q =
ww
i 1
ij iq
j j

where wij is the weight of term i in document j and wiq is the weight of term i in the query

For binary vectors, the inner product is the number of matched query terms in
the document (size of intersection).
For weighted term vectors, it is the sum of the products of the weights of the
matched terms.

20 ENGIN BY ZAINAB AHMED MOHAMMED


CS444 INFORMATION RETRIVAL & WEB SEARCH
Inner Product – Examples
Binary:
◦ D = 1, 1, 1, 0, 1, 1, 0
Size of vector = size of vocabulary = 7
◦ Q = 1, 0 , 1, 0, 0, 1, 1
0 means corresponding term not found in
document or query
sim(D, Q) = 3

Weighted:
D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3
Q = 0T1 + 0T2 + 2T3

sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10


sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2

21 ENGIN BY ZAINAB AHMED MOHAMMED


CS444 INFORMATION RETRIVAL & WEB SEARCH
Cosine Similarity Measure t3

Cosine similarity measures the cosine of the angle 1


between two vectors.
  t D1
dj q   ( wij  wiq)

Q
CosSim(dj, q) =   i 1
t t
2 t1
 wij   wiq
2 2
dj  q
i 1 i 1

t2 D2
D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81
D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3

D1 is 6 times better than D2 using cosine similarity

22 ENGIN BY ZAINAB AHMED MOHAMMED


CS444 INFORMATION RETRIVAL & WEB SEARCH
Problems with Vector Space Model
Missing semantic information (e.g. word sense).
Missing syntactic information (e.g. phrase structure, word order,
proximity information).
Assumption of term independence (e.g. ignores synonomy).
Lacks the control of a Boolean model (e.g., requiring a term to
appear in a document).
Given a two-term query “A B”, may prefer a document containing A
frequently but not B, over a document that contains both A and B, but both
less frequently.

23 ENGIN BY ZAINAB AHMED MOHAMMED


CS444 INFORMATION RETRIVAL & WEB SEARCH

You might also like