0% found this document useful (0 votes)

65 views23 pages

Vector Space Model for IR Students

The document discusses the vector space model, which is a technique used in information retrieval systems. In the vector space model, documents and queries are represented as vectors of identifiers such as index terms. Similarities between documents and queries are calculated by measuring the similarity between their vector representations, such as using the inner product of the vectors. Term weighting methods such as TF-IDF are used to assign weights to terms in the vectors based on factors like term frequency and inverse document frequency.

Uploaded by

Bushra Mamoud

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views23 pages

Vector Space Model for IR Students

Uploaded by

Bushra Mamoud

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

CS444: Information Retrieval

and Web Search

Fall 2021

CHAPTER 5:
VECTOR SPACE MODEL
Abstraction of search engine architecture
Indexed corpus
Crawler
Ranking procedure

Feedback Evaluation
Doc Analyzer
(Query)
Doc Representation
Query Rep User

Indexer Index Ranker results

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 2
Exact (Boolean) Retrieval Model:

MATCHED

Retrieval result

UNMATCHED

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 3
Weighted Retrieval Model:
A formal method that predicts
the degree of relevance of
document to a query.
It assigns a weight to each term and takes into account the length
of a document.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 4
(The Best-Match) Retrieval Models
Best-match models predict the degree to which a document is
relevant to a query
Ideally, this would be expressed as RELEVANT(q,d)
In practice, it is expressed as SIMILAR(q,d)
How might you compute the similarity between q and d?
(Relevance)

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 5
Classification of IR Model

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 6
Vector Space model
Formally, a vector space is defined by a set of linearly independent basis vectors
The basis vectors correspond to the dimensions or directions of the vector space

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 7
What’s the Vector?
A vector is a point in a vector space and has length (from the origin to the point) and direction
• A 2-dimensional vector can be written as [x,y]
• A 3-dimensional vector can be written as [x,y,z]

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 8
The Vector-Space Model
Assume t distinct terms remain after preprocessing
These “orthogonal” terms form a vector space.
 Dimensionality = t = |vocabulary|
Both documents and queries are expressed as t-dimensional vectors
So, if we have 50-term dictionary means that we have 50-dimension space
The vector space model ranks documents (top-k documents) based on the
vector-space similarity between the query vector and the document vector
There are many ways to compute the similarity between two vectors

9 ENGIN BY ZAINAB AHMED MOHAMMED

CS444 INFORMATION RETRIVAL & WEB SEARCH
Example: T3
D1 = 2T1 + 3T2 + 5T3
5
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3
D1 = 2T1+ 3T2 + 5T3

Q = 0T1 + 0T2 + 2T3

2 3
T1
D2 = 3T1 + 7T2 + T3

7
T2

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 10
Issues for Vector Space Model
How to determine important words in a document?
How to determine the degree of importance of a term
within a document and within the entire collection?
How to determine the degree of similarity between a
document and the query?
In the case of the web, what is the collection and what are
the effects of links, formatting information, etc.?

11 ENGIN BY ZAINAB AHMED MOHAMMED

CS444 INFORMATION RETRIVAL & WEB SEARCH
Term weighting:
(what are the most important terms?)
The terms of a document are not equally useful for describing the document contents
There are properties of an index term which are useful for evaluating the importance of the
term in a document
For instance, a term which appears in all documents of a collection is completely useless for
retrieval tasks
> 0 if term i found in document j
term importance can be characterized by a weight 𝑤𝑖,j = = 0 if term i not found in document j
These weights are useful to compute a rank for each document in the collection
The ranked query assigns a number from the interval 0 to 1 to a document.
The closer the number to 1 is, the higher match with a query exists.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 12
Term Weights:(Term Frequency) TF
More frequent terms in a document are more
important, i.e. more indicative of the topic.
fij = frequency of term i in document j
The weights wi,j can be computed using the frequencies of occurrence of the
terms within documents

The total frequency of occurrence Fi of term ki in the collection N is defined

as:

where N is the number of documents in the collection

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 13
TF normalization
Two views of document length
◦ A doc is long because it is verbose
◦ A doc is long because it has more content
Raw TF is inaccurate
◦ Document length variation
◦ “Repeated occurrences” are less informative than the “first occurrence”
◦ Relevance does not increase proportionally with number of term occurrence
May want to normalize term frequency (tf) by:

otfij = fij / maxi{fij} (Maximum TF scaling) (Normalize by the most frequent word in this doc)
otfij = 1+log fij (Sublinear TF scaling)

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 14
Term Weight:(Inverse Document Frequency) IDF
Terms that appear in many different documents are
less indicative of overall topic.
df i = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
= log2 (N/ df i), (N: total number of documents)
Assign higher weights to the rare terms
Log used to dampen the effect relative to tf.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 15
Term Weighs: (TF-IDF)
Combining TF and IDF
◦ Common in doc  high tf  high weight
◦ Rare in collection high idf high weight
◦ 𝑤 𝑡, 𝑑 = 𝑇𝐹 𝑡, 𝑑 × 𝐼𝐷𝐹 𝑡
◦ 𝑤 𝑡, 𝑑 = 𝑇𝐹 𝑡, 𝑑 × log2 (N/ df(t))
We can use normalized TF
𝑤 𝑡, 𝑑 = (1 + log(𝐹 𝑡, 𝑑 )) × log2 (N/ df(t))
𝑤 𝑡, 𝑑 = (𝐹 𝑡, 𝑑 /max{𝑓(𝑡, 𝑑)}) × log2 (N/ df(t))

A term occurring frequently in the document, but rarely in the rest of the collection is given
high weight.
Many other ways of determining term weights have been proposed.
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 16
(TF-IDF)
Most well-known document representation schema in IR! (G Salton et al. 1983)

“Salton was perhaps the

leading computer scientist
working in the field of
information retrieval during his
time.” - wikipedia

Gerard Salton Award

– highest achievement award in IR

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 17
Computing TF-IDF -- An Example
Given a document containing a term with given frequencies:
A(3), B(2), C(1)
Assume collection contains N=10,000 documents and document frequencies of these terms are:
A(50), B(1300), C(250)
Then (Using maximum scaling for tf):
A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6
B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0
C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8

Exercise: Try it using Sublinear scaling

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 18
Similarity Measure
A similarity measure is a function that computes the degree of
similarity between two vectors.
Using a similarity measure between the query and each document
Similarity measure can be computed by many way

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 19
Good Similarity Measure - Inner Product
Similarity between vectors for the document di and query q can be computed
as the vector inner product:
t


sim(d ,q) = d •q =
ww
i 1
ij iq
j j

where wij is the weight of term i in document j and wiq is the weight of term i in the query

For binary vectors, the inner product is the number of matched query terms in
the document (size of intersection).
For weighted term vectors, it is the sum of the products of the weights of the
matched terms.

20 ENGIN BY ZAINAB AHMED MOHAMMED

CS444 INFORMATION RETRIVAL & WEB SEARCH
Inner Product – Examples
Binary:
◦ D = 1, 1, 1, 0, 1, 1, 0
Size of vector = size of vocabulary = 7
◦ Q = 1, 0 , 1, 0, 0, 1, 1
0 means corresponding term not found in
document or query
sim(D, Q) = 3

Weighted:
D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3
Q = 0T1 + 0T2 + 2T3

sim(D1 , Q) = 20 + 30 + 5*2 = 10

sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2

21 ENGIN BY ZAINAB AHMED MOHAMMED

CS444 INFORMATION RETRIVAL & WEB SEARCH
Cosine Similarity Measure t3

Cosine similarity measures the cosine of the angle 1

between two vectors.
  t D1
dj q   ( wij  wiq)

Q
CosSim(dj, q) =   i 1
t t
2 t1
 wij   wiq
2 2
dj  q
i 1 i 1

t2 D2
D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81
D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3

D1 is 6 times better than D2 using cosine similarity

22 ENGIN BY ZAINAB AHMED MOHAMMED

CS444 INFORMATION RETRIVAL & WEB SEARCH
Problems with Vector Space Model
Missing semantic information (e.g. word sense).
Missing syntactic information (e.g. phrase structure, word order,
proximity information).
Assumption of term independence (e.g. ignores synonomy).
Lacks the control of a Boolean model (e.g., requiring a term to
appear in a document).
Given a two-term query “A B”, may prefer a document containing A
frequently but not B, over a document that contains both A and B, but both
less frequently.

23 ENGIN BY ZAINAB AHMED MOHAMMED

CS444 INFORMATION RETRIVAL & WEB SEARCH

Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
33 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
ISR Chap... 5
No ratings yet
ISR Chap... 5
34 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
Chapter 5 IR
No ratings yet
Chapter 5 IR
46 pages
Learning Guide Unit 4 - Home
No ratings yet
Learning Guide Unit 4 - Home
14 pages
Text Processing & Term Weighting
100% (2)
Text Processing & Term Weighting
38 pages
TF-IDF and Ranked Retrieval Basics
No ratings yet
TF-IDF and Ranked Retrieval Basics
51 pages
TF Idf
100% (3)
TF Idf
38 pages
IR Systems Usually Adopt Index Terms To Process Queries Index Term
No ratings yet
IR Systems Usually Adopt Index Terms To Process Queries Index Term
24 pages
Chapter Five IR Models
No ratings yet
Chapter Five IR Models
28 pages
Chapter-3 Termweighting
No ratings yet
Chapter-3 Termweighting
17 pages
5 IRModels IR
No ratings yet
5 IRModels IR
25 pages
3 termWeightingIR
No ratings yet
3 termWeightingIR
32 pages
Chapter 4 - Part II
No ratings yet
Chapter 4 - Part II
44 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
35 pages
Lecture 04
No ratings yet
Lecture 04
41 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
43 pages
4 IRinArabic2021 Ranked Retrieval I
No ratings yet
4 IRinArabic2021 Ranked Retrieval I
49 pages
IR Models for Students
No ratings yet
IR Models for Students
62 pages
IR Models for Information Retrieval
No ratings yet
IR Models for Information Retrieval
51 pages
4 IRModels
No ratings yet
4 IRModels
30 pages
L04
No ratings yet
L04
35 pages
ISR Chap..3
No ratings yet
ISR Chap..3
26 pages
Introduction to IR Models
No ratings yet
Introduction to IR Models
22 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
Lecture 6 Score - Term Weight - Vector Space Model
No ratings yet
Lecture 6 Score - Term Weight - Vector Space Model
43 pages
Introduction to IR Models and Techniques
100% (1)
Introduction to IR Models and Techniques
32 pages
IR Chap4
100% (1)
IR Chap4
32 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
33 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
Chapter 3 Term Weighting
No ratings yet
Chapter 3 Term Weighting
11 pages
Chapter 6 - Scoring Term Weighting and Vector Space Model
No ratings yet
Chapter 6 - Scoring Term Weighting and Vector Space Model
43 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
25 pages
Vector Space Model
No ratings yet
Vector Space Model
7 pages
Term Weighting & The Vector Space Model
No ratings yet
Term Weighting & The Vector Space Model
2 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
34 pages
3 Retrieval Models
No ratings yet
3 Retrieval Models
87 pages
L14 VSM
No ratings yet
L14 VSM
24 pages
5 IRModels
No ratings yet
5 IRModels
30 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
IRS Unit 4
No ratings yet
IRS Unit 4
63 pages
Document Ranking Using Customizes Vector Method
No ratings yet
Document Ranking Using Customizes Vector Method
6 pages
Vector Space Model
No ratings yet
Vector Space Model
11 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Web Search
No ratings yet
Web Search
30 pages
Text
No ratings yet
Text
11 pages
3 Term Weighting
No ratings yet
3 Term Weighting
34 pages
Op Amp Electronics Lab Report
No ratings yet
Op Amp Electronics Lab Report
4 pages
IR System Architecture Guide
No ratings yet
IR System Architecture Guide
36 pages
Chapter - 04 Oscillators
100% (1)
Chapter - 04 Oscillators
46 pages
IR - ch6 - Web Crawler
No ratings yet
IR - ch6 - Web Crawler
21 pages
IR ch4 - Inverted-Index
No ratings yet
IR ch4 - Inverted-Index
44 pages
ES Chapter3 6
No ratings yet
ES Chapter3 6
10 pages
Record Exam
100% (2)
Record Exam
4 pages
Presentation On Ms Access
No ratings yet
Presentation On Ms Access
12 pages
92 Sde Diagram
No ratings yet
92 Sde Diagram
1 page
AIS Course Guide for Cooperatives
No ratings yet
AIS Course Guide for Cooperatives
4 pages
Topic 1 Msbte Questions and Answers
No ratings yet
Topic 1 Msbte Questions and Answers
14 pages
Transaction Processing and Query Optimization
No ratings yet
Transaction Processing and Query Optimization
20 pages
SQL Practical
No ratings yet
SQL Practical
13 pages
General Enterprise Data Flow
No ratings yet
General Enterprise Data Flow
33 pages
SAP Transaction Codes
No ratings yet
SAP Transaction Codes
2 pages
Types of Archives College/University Archives: College and University Archives Collect
No ratings yet
Types of Archives College/University Archives: College and University Archives Collect
4 pages
Full Stack Data Science With Gen AI
No ratings yet
Full Stack Data Science With Gen AI
4 pages
Govt Data Processing with ICT
No ratings yet
Govt Data Processing with ICT
12 pages
Lecture 2.3.5 Views
No ratings yet
Lecture 2.3.5 Views
22 pages
R Programming Lab
No ratings yet
R Programming Lab
57 pages
Ais lý thuyết
No ratings yet
Ais lý thuyết
13 pages
Comprehensive Medical Directory Links
No ratings yet
Comprehensive Medical Directory Links
11 pages
Overview of Storage and Indexing: Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1
No ratings yet
Overview of Storage and Indexing: Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1
65 pages
User Guide Template
No ratings yet
User Guide Template
12 pages
Relevance Views in Information Science
No ratings yet
Relevance Views in Information Science
21 pages
Document Scanning Log
No ratings yet
Document Scanning Log
25 pages
Knowledge Management in Development
No ratings yet
Knowledge Management in Development
5 pages
Caching
No ratings yet
Caching
7 pages
SAP PP Data Archiving Guide
No ratings yet
SAP PP Data Archiving Guide
6 pages
Novo10 Hero II User Manual
No ratings yet
Novo10 Hero II User Manual
37 pages
Keys Dbms
No ratings yet
Keys Dbms
34 pages
2025 Solution
No ratings yet
2025 Solution
5 pages
cs619 Final Report Software
100% (3)
cs619 Final Report Software
16 pages
Data Source Migration
No ratings yet
Data Source Migration
31 pages
Rockstar VA
No ratings yet
Rockstar VA
2 pages
746 1805 1 SM PDF
No ratings yet
746 1805 1 SM PDF
9 pages

Vector Space Model for IR Students

Uploaded by

Vector Space Model for IR Students

Uploaded by

CS444: Information Retrieval

and Web Search

Indexer Index Ranker results

9 ENGIN BY ZAINAB AHMED MOHAMMED

Q = 0T1 + 0T2 + 2T3

11 ENGIN BY ZAINAB AHMED MOHAMMED

The total frequency of occurrence Fi of term ki in the collection N is defined

where N is the number of documents in the collection

“Salton was perhaps the

Gerard Salton Award

Exercise: Try it using Sublinear scaling

20 ENGIN BY ZAINAB AHMED MOHAMMED

sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10

21 ENGIN BY ZAINAB AHMED MOHAMMED

Cosine similarity measures the cosine of the angle 1

D1 is 6 times better than D2 using cosine similarity

22 ENGIN BY ZAINAB AHMED MOHAMMED

23 ENGIN BY ZAINAB AHMED MOHAMMED

You might also like

sim(D1 , Q) = 20 + 30 + 5*2 = 10