0% found this document useful (0 votes)

6 views35 pages

TFIDF VectorSpaceModel

Uploaded by

22je0259

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views35 pages

TFIDF VectorSpaceModel

Uploaded by

22je0259

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Term document scoring and vector space model

Ayan Das

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Outline

1 tf − idf scoring

2 Vector space model

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 2 / 33
Lecture outline

1 tf − idf scoring

2 Vector space model

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 3 / 33
tf − idf scoring

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 4 / 33
Ranked retrieval

For Boolean queries, documents either match or don’t match.

Not possible to judge the degree of relevance of a document with
respect to a query.
Good for expert users with precise understanding of their needs and the
collection
Not good for the majority of users as they are incapable of writing
Boolean queries
Thus, Boolean retrieval not suitable for web search
Boolean queries often result in either too few (=0) or too many
(1000s) results.
“standard user dlink 650” → 200,000 hits
“standard user dlink 650 no card found” → 0 hits
AND gives too few; OR gives too many results

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 5 / 33
Ranked retrieval models

Boolean retrieval: Returns set of documents satisfying a query

expression
Ranked retrieval: The system returns an ordering over the (top)
documents in the collection for a query
When a system produces a ranked result set
the size of the result set is not an issue
The top k most relevant (highest ranking) documents can be returned
Don’t overwhelm the user

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 6 / 33
Ranked retrieval models

Return the documents in an order most likely to be useful to the

searcher
How can we rank-order the documents in the collection with respect
to a query?
Assign a score to each document which estimates how well the
document “matches” the query
The score may be in the range [0, 1]

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 7 / 33
Query-document matching scores

We need a way of assigning a score to a query/document pair

For a given one-term query
IF (query term not in document): score = 0
The more frequent the query term in the document, the higher the
score
The rest of the discussion is based on this idea and we explore several
alternatives and extensions

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 8 / 33
Query-document matching scores

Term frequency (tf): The number of times a term occurs in the

document.
Rare terms in a collection are more informative than frequent terms.
The documents themselves may vary in length
We need a more sophisticated way of normalizing the length of
document

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 9 / 33
Term-document incidence matrix

Each document is represented by a binary vector ∈ 0, 1|V|

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 10 / 33
Term-document count matrices

Consider the number of occurrences of a term in a document:

Each document is a count vector

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 11 / 33
Bag of words model

Considers a document or a query as a multi-set set of terms

Does not take the order of the words in the document into account
In the vector space representation also the ordering is not
maintained
John runs faster than Mary and Mary runs faster than John have the
same vector representation
Step back from positional indexing
John runs faster than Mary
John runs faster
than Mary 1 1 1 1 1
Mary runs faster
than John 1 1 1 1 1

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 12 / 33
Term frequency - tf

The term frequency tft,d of term t in document d is defined as the

number of times that t occurs in d.
We want to use t when computing query-document match scores.
Using raw term frequency has some disadvantages
Relevance is not directly proportional to the term frequency
A document with 10 occurrences of a term may be more relevant than
a document containing the term once but not 10 times more relevant

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 13 / 33
Term frequency - tf

A normalization
{ scheme is the log frequency weight of term t in d is
1 + log10 tft,d if tft,d > 0
wt,d =
0 otherwise
tftd → wtd :
0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4
Score for a document-query pair: sum over terms t in both q and d:
∑
tf_matching_score(q, d) = t∈q∩d (1 + logtft,d )
The score is 0 if none of the query terms is present in the document.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 14 / 33
Document Frequency

Document frequency (dft ): Number of documents in the collection

in which term t occurs.
Use the frequency of the term in the collection for weighting and
ranking.
Rare terms are more informative than frequent terms
Consider a term in the query that is rare in the collection e.g.
arachnocentric
A document containing this term is very likely to be relevant.
We want high weights for rare terms like arachnocentric

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 15 / 33
Document Frequency

Frequent terms are less informative than rare terms.

Consider a term in the query that is frequent in the collection (e.g.,
good, increase, line).
A document containing this term is more likely to be relevant than a
document that doesn’t.
These frequent terms are not sure indicators of relevance.
For these frequent terms we want positive weights, but lower weights
than the rare terms
Need high weights for rare terms
We can use the document frequency to factor this phenomenon into
computing the matching score.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 16 / 33
Inverse document frequency (idf) score

dft is an inverse measure of the informativeness of term t

Inverse document frequency: Is a measure of the informativeness
of term t.
We define the idf weight of term t as follows
idft = log(N/dft )
(N is the number of documents in the collection.)
idft is a measure of the informativeness of the term.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 17 / 33
Effect of idf on ranking

Does idf have an effect on ranking for one-term queries e.g. iPhone
idf has no effect on ranking one term queries
idf used to measure the relative importance of terms
idf affects the ranking of documents for queries with at least two terms
For the query capricious person, idf weighting makes occurrences of
capricious count for much more in the final document ranking than
occurrences of person

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 18 / 33
Collection vs. Document frequency

The collection frequency of t is the number of occurrences of t in the

collection, counted multiple occurrences

Word Collection Frequency

insurance 10440 3997
try 10422 8760

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 19 / 33
tf − idf weighting

The tf − idf weight of a term is the product of its tf weight and its idf
weight.
N
wt,d = (1 + log tft,d ).log
dft
Best known weighting scheme in information retrieval
Increases with the
number of occurrences within a document (term frequency)
rarity of the term in the collection (inverse document frequency)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 20 / 33
Vector space model

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 21 / 33
tf − idf score matrix

Each document is now represented by a real-valued vector of tf − idf

weights ∈ R|V|

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 22 / 33
Documents as Vectors

Each document is now represented by a real-valued vector of tf-idf

weights ∈ R|V|
It is a |V|-dimensional real-valued vector space.
Terms are axes of the space.
Documents are points or vectors in this space.
Very high-dimensional: tens of millions of dimensions when you apply
this to web search engines
Each vector is very sparse - most entries are zero.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 23 / 33
Queries as vectors

Key idea 1: Do the same for queries: represent them as vectors in the
space
Key idea 2: Rank documents according to their proximity to the query
in this space
proximity = similarity of vectors
proximity ≈ inverse of distance

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 24 / 33
Queries as vectors

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 24 / 33
Similarity using distance (difference)
The Euclidean distance between q and d2 is large even though the
distribution of terms in the query q and the distribution of terms in the
document d2 are very similar
Vector representation

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 25 / 33
Use angle instead of distance

Take a document d and append it to itself. Call this document d̂.

Semantically d and d̂ have the same distance
The Euclidean distance between the two documents can be quite large
The angle between the two documents is 0, corresponding to maximal
similarity

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 26 / 33
Use angle instead of distance

Take a document d and append it to itself. Call this document d̂.

Semantically d and d̂ have the same distance
The Euclidean distance between the two documents can be quite large
The angle between the two documents is 0, corresponding to maximal
similarity
Key idea: Rank documents according to angle with query

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 26 / 33
From angles to cosines

The following two notions are equivalent

Rank documents in decreasing order of the angle between query and
document
Rank documents in increasing order of cosine of the angle between
query and document
Cosine is a monotonically decreasing function for the interval [0◦ ,
180◦ ]
Advantages of cosine similarity
Cosine score is proportional to similarity
Scales down the similarity score in the range [0,1]

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 27 / 33
Cosine(query, document)

qi is the tf-idf weight of term i in the query

di is the tf-idf weight of term i in the document
−
→ −
→
cos(−→q , d ) is the cosine of the angle between −→q and d .
−
→ −
→
cos(−→q , d ) is the cosine similarity of −
→
q and d
−
→
If −
→
q and d are length normalized, then cosine similarity is the scalar
(dot) product.
|V|
−
→ −
→ −
→ → ∑
−
cos( q , d ) = q · d = qi di (1)
i=1 . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 28 / 33
Length normalization

A vector can be (length-) normalized by dividing each of its

components by its length.
For this we use the L2 norm
√∑
∥x∥2 = 2
i xi
Dividing a vector by its L2 norm makes it a unit (length) vector
Effect on the two documents d and d̂ (d appended to itself) from
earlier slide: they have identical vectors after length-normalization
Long and short documents now have comparable weights

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 29 / 33
Cosine similarity

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 30 / 33
Cosine similarity among 3 documents

How similar are the novels?

SaS: Sense and Sensibility , PaP: Pride and Prejudice, and WH:
Wuthering Heights

term SaS PaP WH

affection 115 58 20
jealous 10 7 11
gossip 2 0 6
wuthering 0 0 38
Table: Term frequencies (counts)

Note: To simplify this example, we don’t do idf weighting

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 31 / 33
Cosine similarity among 3 documents

term SaS PaP WH term SaS PaP WH

affection 3.06 2.76 2.30 affection 0.789 0.832 0.524
jealous 2.00 1.85 2.04 jealous 0.515 0.555 0.465
gossip 1.30 0 1.78 gossip 0.335 0 0.405
wuthering 0 0 2.58 wuthering 0 0 0.588
Table: Log frequency weighting Table: After length normalization

cos(SaS,PaP) ≈
0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0
≈0.94
cos(SaS,WH) ≈ 0.79
cos(PaP,WH) ≈ 0.69
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 32 / 33
Computing cosine scores

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Ayan Das Term document scoring and vector space model 33 / 33

Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
Chapter 6 - Scoring Term Weighting and Vector Space Model
No ratings yet
Chapter 6 - Scoring Term Weighting and Vector Space Model
43 pages
TF Idf
100% (3)
TF Idf
38 pages
Tif Idf Cosine Similarity
No ratings yet
Tif Idf Cosine Similarity
44 pages
TF-IDF and Ranked Retrieval Basics
No ratings yet
TF-IDF and Ranked Retrieval Basics
51 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
Lecture 6 Score - Term Weight - Vector Space Model
No ratings yet
Lecture 6 Score - Term Weight - Vector Space Model
43 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
3 termWeightingIR
No ratings yet
3 termWeightingIR
32 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
No ratings yet
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
45 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
34 pages
Ranked Retrieval
No ratings yet
Ranked Retrieval
52 pages
Text Processing & Term Weighting
100% (2)
Text Processing & Term Weighting
38 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
Session 4 Text Feature
No ratings yet
Session 4 Text Feature
40 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
Vector Space Model & Tf-idf Explained
100% (1)
Vector Space Model & Tf-idf Explained
16 pages
Lecture 04
No ratings yet
Lecture 04
41 pages
Information Retrieval Systems Information Retrieval Systems
No ratings yet
Information Retrieval Systems Information Retrieval Systems
7 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
L12&L13 Ranked Retrieval
No ratings yet
L12&L13 Ranked Retrieval
31 pages
Scoring
No ratings yet
Scoring
49 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
35 pages
3 Term Weighting
No ratings yet
3 Term Weighting
34 pages
ISR Chap..3
No ratings yet
ISR Chap..3
26 pages
Chapter 3 Term Weighting
No ratings yet
Chapter 3 Term Weighting
11 pages
Chapter-3 Termweighting
No ratings yet
Chapter-3 Termweighting
17 pages
Text
No ratings yet
Text
11 pages
Term Weighting & Similarity Basics
50% (2)
Term Weighting & Similarity Basics
54 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
25 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
33 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
33 pages
2 Termweighting
No ratings yet
2 Termweighting
38 pages
3 Termweighting
No ratings yet
3 Termweighting
40 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
Advanced Info Retrieval Lecture
No ratings yet
Advanced Info Retrieval Lecture
27 pages
Chapter 5 IR
No ratings yet
Chapter 5 IR
46 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
4 IRinArabic2021 Ranked Retrieval I
No ratings yet
4 IRinArabic2021 Ranked Retrieval I
49 pages
Vector Space Model & Term Weighting
No ratings yet
Vector Space Model & Term Weighting
41 pages
IR Chapter 2 Part II
No ratings yet
IR Chapter 2 Part II
45 pages
Learning Guide Unit 4 - Home
No ratings yet
Learning Guide Unit 4 - Home
14 pages
Introduction to IR Models
No ratings yet
Introduction to IR Models
22 pages
3 Retrieval Models
No ratings yet
3 Retrieval Models
87 pages
(Jaffar) IR - Modeling - II
No ratings yet
(Jaffar) IR - Modeling - II
39 pages
I R Rank
No ratings yet
I R Rank
52 pages
IR Chap4
100% (1)
IR Chap4
32 pages
Introduction to IR Models and Techniques
100% (1)
Introduction to IR Models and Techniques
32 pages
Vector Space Model
No ratings yet
Vector Space Model
11 pages
IR Models for Information Retrieval
No ratings yet
IR Models for Information Retrieval
51 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
Boolean VectorSpace 11
No ratings yet
Boolean VectorSpace 11
15 pages
Lecture 10
No ratings yet
Lecture 10
18 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
Retrieval Models & Ranking Overview
No ratings yet
Retrieval Models & Ranking Overview
16 pages
Vector Space Model for IR Students
No ratings yet
Vector Space Model for IR Students
23 pages
Continuous Time Fourier
No ratings yet
Continuous Time Fourier
39 pages
Week 2 Day 3
No ratings yet
Week 2 Day 3
6 pages
Week 8 Day 4
No ratings yet
Week 8 Day 4
2 pages
Midterm Exam in Stat and Prob (2018-2019)
No ratings yet
Midterm Exam in Stat and Prob (2018-2019)
5 pages
IGCSE Mathematics A 4WM1H 01 - May 2025 Mark Scheme PDF
100% (1)
IGCSE Mathematics A 4WM1H 01 - May 2025 Mark Scheme PDF
28 pages
53 MB 00194 0 - E Math - 7 - TRM - 03 06 24
No ratings yet
53 MB 00194 0 - E Math - 7 - TRM - 03 06 24
152 pages
Analysis I 9 The Cauchy Criterion
No ratings yet
Analysis I 9 The Cauchy Criterion
7 pages
Chapter 4 - Inner Product Spaces
No ratings yet
Chapter 4 - Inner Product Spaces
80 pages
Theorems in Single Page-2021-22
100% (1)
Theorems in Single Page-2021-22
4 pages
1.4 The Matrix Equation Ax B
No ratings yet
1.4 The Matrix Equation Ax B
2 pages
2ND Grade 2 - Adding Without Regrouping
No ratings yet
2ND Grade 2 - Adding Without Regrouping
27 pages
Internet Networks Wired Wireless and Optical Technologies Devices Circuits and Systems 1st Edition Krzysztof Iniewski Available Instanly
100% (2)
Internet Networks Wired Wireless and Optical Technologies Devices Circuits and Systems 1st Edition Krzysztof Iniewski Available Instanly
161 pages
Lecture 4
No ratings yet
Lecture 4
77 pages
Cambridge IGCSE™: Mathematics 0580/43 October/November 2022
No ratings yet
Cambridge IGCSE™: Mathematics 0580/43 October/November 2022
13 pages
4.2 Suffixes For Literals: Visual Basic
No ratings yet
4.2 Suffixes For Literals: Visual Basic
22 pages
IPC J STD 001 Requirements For Soldered Electrical and Electronic Assemblies IPC J STD 001E 2010 Ipc Instant Download
No ratings yet
IPC J STD 001 Requirements For Soldered Electrical and Electronic Assemblies IPC J STD 001E 2010 Ipc Instant Download
124 pages
Optimization Techniques Unit 41743837445775
No ratings yet
Optimization Techniques Unit 41743837445775
5 pages
ANT - All Units (With Answers) PDF
No ratings yet
ANT - All Units (With Answers) PDF
36 pages
PrimaEd Additional Maths Course Plan 1
100% (1)
PrimaEd Additional Maths Course Plan 1
1 page
SL 2.3 Graphing
No ratings yet
SL 2.3 Graphing
211 pages
Fractions and Decimals
No ratings yet
Fractions and Decimals
7 pages
Doing A Successful Research Project Using Qualitative or Quantitative Methods 2nd Edition Martin Davies Online Version
No ratings yet
Doing A Successful Research Project Using Qualitative or Quantitative Methods 2nd Edition Martin Davies Online Version
92 pages
Logical Reasoning Workbook
No ratings yet
Logical Reasoning Workbook
4 pages
Gamma and Betta Function Adv Calculus Schaum
No ratings yet
Gamma and Betta Function Adv Calculus Schaum
17 pages
TT Y6NegativeNumber
No ratings yet
TT Y6NegativeNumber
22 pages
Addition & Subtraction Unit Guide
No ratings yet
Addition & Subtraction Unit Guide
2 pages
Year 10 Laws of Indices and Fractional Indices 4ma1 Higher Exam Solutions
No ratings yet
Year 10 Laws of Indices and Fractional Indices 4ma1 Higher Exam Solutions
14 pages
2 A Case For Internet Memes
No ratings yet
2 A Case For Internet Memes
6 pages
C Lab Manual
No ratings yet
C Lab Manual
72 pages
Mathematics Year 8 Term 1 2015 WEE K Date Topic Objectives Events 1 Transformation S
No ratings yet
Mathematics Year 8 Term 1 2015 WEE K Date Topic Objectives Events 1 Transformation S
3 pages

TFIDF VectorSpaceModel

Uploaded by

TFIDF VectorSpaceModel

Uploaded by

Term document scoring and vector space model

2 Vector space model

2 Vector space model

For Boolean queries, documents either match or don’t match.

Boolean retrieval: Returns set of documents satisfying a query

Return the documents in an order most likely to be useful to the

We need a way of assigning a score to a query/document pair

Term frequency (tf): The number of times a term occurs in the

Each document is represented by a binary vector ∈ 0, 1|V|

Consider the number of occurrences of a term in a document:

Considers a document or a query as a multi-set set of terms

The term frequency tft,d of term t in document d is defined as the

Document frequency (dft ): Number of documents in the collection

Frequent terms are less informative than rare terms.

dft is an inverse measure of the informativeness of term t

The collection frequency of t is the number of occurrences of t in the

Word Collection Frequency

Each document is now represented by a real-valued vector of tf − idf

Each document is now represented by a real-valued vector of tf-idf

Take a document d and append it to itself. Call this document d̂.

Take a document d and append it to itself. Call this document d̂.

The following two notions are equivalent

qi is the tf-idf weight of term i in the query

A vector can be (length-) normalized by dividing each of its

How similar are the novels?

term SaS PaP WH

Note: To simplify this example, we don’t do idf weighting

term SaS PaP WH term SaS PaP WH

You might also like