0% found this document useful (0 votes)

10 views10 pages

Vmodel

Uploaded by

Efa Mirkana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views10 pages

Vmodel

Uploaded by

Efa Mirkana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 10

Information Storage and Retrieval 2011

CHAPTER 3
INFORMATION RETRIVAL MODELl
Example one (Boolean Model)
Given
Doc 1: Information storage and retrieval.
Doc 2: Expert system and information retrieval system
Doc 3: Information processing and management
Doc 4: Information retrieval on Archive

Suppose our query consists of "Information AND Retrieval" which document will be
retrieved based on Boolean model?

Answer: {1, 2, 4}
Index Term Document
Information 1,2,3,4
Storage 1
Retrieval 1,2,4
System 2
Processing 3
Management 3
Archives 4
Expert 2

Answer: Information AND Retrieval

= {1, 2, 3, 4} n {1, 2, 4}
= {1, 2, 4}
Home work: Determine the rank of the above documents with respect to the
given query using vector space model.

1 Page Chapter 3
Information Storage and Retrieval 2011
Example Two (Vector Space Model)
THE COLLECTION
Here’s the collection – 6 documents, Doc A through Doc F, with term-occurrences as
follows:
Doc A care, cat, persian
Doc B Care, care, care, cat, cat, cat, persian, persian, persian
Doc C cat, cat, cat, cat, cat, cat, cat, cat, cat,
Doc D Care, cat, dog, dog, dog, dog, dog, dog, persian
Doc E Care, cat, dog
Doc F care

TF weights
From this initial specification, we can make a number of observations:
1. The length of a document, ld, is the total number of term-occurrences in it. The
length of Doc A is 3, the length of Doc B is 9, and so on. In other words,
ldocA =3
ldocB =9
ldocC =9
ldocD =9
ldocE =3
ldocF =1
2. The total number of term-occurrences in the collection, fD , is the sum of the
N
document lengths: fD l
j 1
d . In other words, for this collection,

fD = 34
3. The total number of term-types in the collection is 4: these term-types, in
alphabetical order are “care”, “cat”, “dog” and “persian”. Each document is made
up of a different number of occurrences of each of these term-types. Doc A, for
instance, is made up of 1 occurrence of the term-type "care", 1 occurrence
of the term-type "cate", 0 occurrences of the term-type “dog”, and 1 occurrences
of the term-type “persian”. Doc B, on the other hand, is make up of 3
occurrences of the term-type “care”, 3 occurrences of the term-type “cat”, 0
occurrences of the term-type “dog”, and 3 occurrences of the term-type. From
now on, for brevity, we’ll use the word “term” to mean “term-type”, and
“occurrence” to mean “term-occurrence”.

2 Page Chapter 3
Information Storage and Retrieval 2011
4. We can present this information in the form of a vector for each document.
Each document vector consists of an ordered list of values, each value
indicating the number of occurrences of a particular term. So, for Doc A, the
vector is <1, 1, 0, 1>, and for Doc B, the vector is <3, 3, 0, 3>, Note that the
order of terms represented by the values is the same in each vector: this is
essential if the vectors are to be compared, either with each other or with query
vectors. We can with:
Doc A = <1, 1, 0,1>
Doc B = <3, 3, 0, 3>
Doc C = <0, 9, 0, 0>
Doc D = <1, 1, 6, 1>
Doc E = <1, 1, 1, 0>
Doc F = <1, 0, 0, 0>

5. The values making up these vectors are actually values of fdi,tj , or within-
document frequency. These values are the ones that are commonly used as
TF weights, where TF stands for term frequency, and

TFdi ,t j  f dit j
So we could call the vectors term-frequency vectors:
TFdocA = <1, 1, 0, 1>
TFdocB = <3, 3, 0, 3>
TFdocC = <0, 9, 0, 0>
TFdocD = <1, 1, 6, 1>
TFdocE = <1, 1, 1, 0>
TFdocF = <1, 0, 0, 0>

6. In fact, putting the vectors for every document in the collection together in this
way forms a term-frequency matrix, where the rows represent documents, the
columns represent terms, and the individual values represent individual term
frequencies:

care cat dog persian

Doc A 1 1 0 1
Doc B 3 3 0 3
Doc C 0 9 0 0
Doc D 1 1 6 1

3 Page Chapter 3
Information Storage and Retrieval 2011
Doc E 1 1 1 0
Doc F 1 0 0 0

IDF weights
The next thing we can do is calculate the IDF weights for each term. IDF stands for
inverse document frequency, but can also be conceptualized as a within-collection
frequency weight, to contrast with TF which as a within-document frequency weight.
The IDF weight for a particular term does not vary from document to document,
whereas the TF weight for a particular term may well be very different for different
documents (as we saw earlier).

The formula you need to use to calculate the IDF weight for a term tj, is this:
IDFD,tj = log2 (ND/nD,tj)
Where
ND = the total number of documents in the collection D
and
nD,tj = the number of documents in the collection D
that contain at least one occurrence of the term t j.
So, to calculate a value for IDF, you need firstly to divide N by n, then take the logarithm
to base 2 of the result.

The reason we take the logarithm of N/n, rather than just using N/n on its own, is so
that we don't get such high values of IDF whenever we're in a situation where N is very
large and n is relatively small (which is often the case). Some people use a formula for
IDF that takes a logarithm to base 10, rather than to base 2; other people use a formula
for IDF that adds 1 to each final value (this is to ensure that you don't end up with
values of 0 for terms that actually appear in every document in the collection, since
1og21=0). It really doesn't make much difference which formula you use.

So, the IDF weights for each term in the collection can be calculated and expressed in
the form of a single inverse document-frequency vector as follows:

IDFD ,t j = <1og2(6/5),log2(6/5),log2(6/2),log2(6/3)>

=<0.26,0.26,1.58,1.00>

Combined W weights

4 Page Chapter 3
Information Storage and Retrieval 2011
The third step is to calculate combined W weights, i.e., TF.IDF weights for each term in
each document. The TF.IDF weight in document di of term tj is given by Wdi,tj, where

Wdi ,t j TFdi ,t j xIDFD ,t j

What you need to do, is this. For each term in each document, multiply the ITF value for
that tem by the corresponding IDF value for that term. You end up with a matrix of
values again, each row representing a document and each column representing a term,
but this time the values aren't TF weights, they're TF.IDF weights.

So, remember the TF vectors look like this:

TFdocA = <1, 1, 0, 1>
TFdocB = <3, 3, 0, 3>
TFdocC = <0, 9, 0, 0>
TFdocD = <1, 1, 6, 1>
TFdocE = <1, 1, 1, 0>
TFdocF = <1, 0, 0, 0>
And the IDF vector looks like this:
= <0.26, 0.26, 1.58, 1.00>

So, the W (i.e., TF.IDF) vectors look like this:

WdocA = <1 x 0.26, 1x0.26, 0x1.58, 1x1.00>

= <0.26, 0.26, 0.00, 1.00>
WdocB = <3 x 0.26, 3x0.26, 0x1.58, 3x1.00>
= <0.79, 0.79, 0.00, 3.00>
WdocC = <0 x 0.26,9 x 0.26, 0 x 1.58, 0 x 1.00>
= <0.00, 2.37, 0.00, 0.00>
WdocD = <1 x 0.26, 1 x 0.26, 6 x 1.58, 1 x 1.00>
= <0.26, 0.26, 9.51, 1.00>
WdocE = <1 x 0.26, 1 x 0.26, 1 x 1.58, 0 x 1.00>
= <0.26, 0.26, 1.58, 0.00>
WdocF = <1 x 0.26, 0 x 0.26, 0 x 1.58, 0 x 1.00>
= <0.26, 0.00, 0.00, 0.00>

THE QUERIES

5 Page Chapter 3
Information Storage and Retrieval 2011
OK. Now, here are the queries again – 4 queries, Query 1 through Query 4, with the
terms making up those queries as follows:
Query 1 cat
Query 2 care, cat
Query 3 care, cat, persian
Query 4 care, cat(2)

In Query 4, the number “2” is there to indicate that you have somehow specified that the
term “cat” is twice as significant in this query than the term “care”. In other words, you
have assigned a weight to this term: not based on term frequencies (as the document
TF weights are), but rather based on your own perception of the relative significance of
terms in this representation or our information need.

We can represent these queries by vectors, in just the same way that we represented
documents by vectors earlier:

Query 1 = <0, 1, 0, 0>

Query 2 = <1, 1, 0, 0>
Query 3 = <1, 1, 0, 1>
Query 4 = <1, 2, 0, 0>

Note that the first three of these vectors are binary vectors, in that they are made up
purely of values of 0 or 1, respectively representing the absence or presence of a term
in the query. The fourth vector is a non-binary vector, which contains values
representing the weights assigned to each term in the query. But in general, either type
of query vector can be viewed as a vector made up of tem weights, like the ones we
defined for documents earlier.

W query1 = <0, 1, 0, 0>

Wquery 2 = <1, 1, 0, 0>

Wquery 3 = <1, 1, 0, 1>

6 Page Chapter 3
Information Storage and Retrieval 2011

Wquery 4 = <1, 2, 0, 0>

The cosine coefficient

Right, Now suppose you want you determine which documents are most similar to
which queries. In other words, you want to rank the documents in order of their
similarity to each query. (The basic assumption is that if one document-representation
is more similar to a given query-representation than another document, then the first
document is likely to be more relevant to your information need.) To do this, you need
to use a similarity coefficient – a formula that allows you to calculate numerical values
indicating how similar things are. There documents, for instance, are similar. Many of
these formulae (some association coefficients, for instance) produce values on a
scale of 0 to 1, where 0 represents complete dissimilarity and 1 represents complete
similarity; other formulae (some distance metrics, for instance) produce values on the
same scale of 0 to 1, but this time 0 represents complete similarity and 1 represents
complete dissimilarity. One of the most commonly-used formulae in the first category is
the cosine coefficient, which looks like this:

j M T

 (W
j 1
t xWdi ,t j )
qkn j

COS qk ,d 
j M T j M T
2

j 1
(Wqk ,t j ) x  (W
j 1
d i ,t j
2
)

The way you use the cosine coefficient is like this.

1. You identify the query and the document you want to compare. (Remember that you
have to repeat the process, and calculate a new value of COS, for every
query/document pair, in a typical retrieval situation, you would want to calculate a
COS value indicating the similarity between any given query and every document in
the database). Suppose, for example, you want to compare Query 1 and Doc A.
2. You take the W vector for the query and the W vector (i.e the TF, IDF vector) for the
document.
The W vector for Query 1 looks like this:

W query1 = <0, 1, 0, 0>

And the W vector for Doc A looks like this:

WqueryA = <0.26, 0.26, 0.00, 1.00>

7 Page Chapter 3
Information Storage and Retrieval 2011
3. You multiply together the corresponding W values for each term. In the example, you
end up with a vector that looks like this:

Wquery1 xWdocA = <0 x 0.26, 1 x 0.26, 0 x 0.00, 0 x 1.00>

= <0.00, 0.26, 0.00, 0.00>

4. Then you add together (i.e., sum) the values in that vector. The result you get is the
value of the top half of the cosine formula: this is the so-called inner product or dot
product of the two original vectors. In the example,

W 
j M T

j 1
query1,t j xWdocA,t j = 0.00 + 0.26 + 0.00 + 0.00

= 0.26
5. Moving now to the bottom half of the formula, you first need to calculate the squares
of the W values in the query vector. In the example, you get a vector that looks like
this:
2
Wquery1 = <0 x 0,1 x 1, 0 x 0, 0 x 0>

= <0.00, 1.00, 0.00, 0.00>

6. Then you need to sum the values in that vector. In the example,

W 
j M T

j 1
query1,t j
2
= 0.00 + 1.00 + 0.00 + 0.00

= 1.00

7. Similarly, you need to calculate the squares of the W values in the document vector.
In the example, you get a vector that looks like this:

2
WdocA = <0.26 x 0.26, 0.26 x 0.26, 0.00 x 0.00, 1.00 x

1.00>
= <0.07, 0.07, 0.00, 1.00>

8. Then you need to sum the values in that vector. In the example,

W 
j M T


2
docA,t j = 0.07 + 0.07 + 0.00 + 1.00
j 1
= 1.14

9. Next, you need to multiply the results of steps (6) and (8) together. In the example,

8 Page Chapter 3
Information Storage and Retrieval 2011

 W x
j M T
W 
j M T


2 2
query1,t j docA,t j = 1.00 x 1.14
j 1 j 1

= 1.14

10. Now take the square root of the result of step (9). This is the result of the bottom half
of the formula. In the example,

 W x  W 
j M T j M T

query1,t j
2
docA,t j
2 = 1.14
j 1 j 1

= 1.07

11. Finally, you need to divide the result of step (4) by the result of step (10). This is the
value of the cosine coefficient! In the example,

j M T

 (W
j 1
query1, t j xWdocA,t j
COS query1, docA  = 0.26/1.07
j M T j M T
2
 (W  (W
2
query1, t j )x docA, t j )
j 1 j 1

= 0.25

So, after all that, we can say that the degree of similarity between Query 1 and Document A is
0.25, on a scale of 0 to 1, where 1 represents complete similarity.

Refinements

We could have used different kinds of document-term weights, Wdi ,t j , in the formula. For

example: instead of using TF, IDF weights, we could have used just TF weights.

We could have used different kinds of query-term weights, Wdi ,t j , in the formula. For

example: instead of using just user-defined weights (which are a bit like TF weights), we could
have multiplied these by IDF weights to give TF.IDF values, and used those.

We could have used a different similarity coefficient entirely. There are many others. Other
association coefficients, for example, are based on the same inner product formula that the
cosine coefficient is based on, but use different methods of normalization (i.e. different ways of
countering the effect of document length, for instance); we could, for instance, have used the
Dice coefficient, which looks like this:

9 Page Chapter 3
Information Storage and Retrieval 2011

 
j M T
2  Wqk ,t j xWdi ,t j
j 1
DICEqk ,di  j M T
 W   W 
j M T
2 2
q k ,t j d k ,t j
j 1 j 1

Home work 2: rank the above documents with respect to query 1 using vector
space model.

10 Page Chapter 3

3 Term Weighting
No ratings yet
3 Term Weighting
34 pages
Chapter-3 Termweighting
No ratings yet
Chapter-3 Termweighting
17 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
34 pages
Term Weighting 2021
100% (2)
Term Weighting 2021
38 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
35 pages
3 Termweighting
No ratings yet
3 Termweighting
40 pages
3 termWeightingIR
No ratings yet
3 termWeightingIR
32 pages
ISR Chap..3
No ratings yet
ISR Chap..3
26 pages
Chapter 5 IR
No ratings yet
Chapter 5 IR
46 pages
4_IRModels
No ratings yet
4_IRModels
30 pages
Term Weighting and Similarity Measures
50% (2)
Term Weighting and Similarity Measures
54 pages
chapter 3 term weighting
No ratings yet
chapter 3 term weighting
11 pages
5 IRModels
No ratings yet
5 IRModels
30 pages
5 B IRModels
No ratings yet
5 B IRModels
51 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
4_IRModels
No ratings yet
4_IRModels
46 pages
Learning Guide Unit 4 _ Home
No ratings yet
Learning Guide Unit 4 _ Home
14 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
25 pages
Text Representation
No ratings yet
Text Representation
16 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
IR Chap4
100% (1)
IR Chap4
32 pages
IR Chap4
100% (1)
IR Chap4
32 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
33 pages
IR Chapter 2 Part II
No ratings yet
IR Chapter 2 Part II
45 pages
Document Ranking Using Customizes Vector Method
No ratings yet
Document Ranking Using Customizes Vector Method
6 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
Basic IR: Modeling
No ratings yet
Basic IR: Modeling
22 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
TF Idf
100% (3)
TF Idf
38 pages
Lecture 10
No ratings yet
Lecture 10
18 pages
Vector Space Model
No ratings yet
Vector Space Model
6 pages
L12&L13 Ranked Retrieval
No ratings yet
L12&L13 Ranked Retrieval
31 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Recuperación Información Modelo Vectorial
No ratings yet
Recuperación Información Modelo Vectorial
40 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
43 pages
Tif Idf Cosine Similarity
No ratings yet
Tif Idf Cosine Similarity
44 pages
Session 4 Text Feature
No ratings yet
Session 4 Text Feature
40 pages
2.termWeighting
No ratings yet
2.termWeighting
38 pages
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
No ratings yet
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
420 pages
Ranked Retrieval
No ratings yet
Ranked Retrieval
52 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
NLP-week10-IR-enc-dec-annotated_by_Ces
No ratings yet
NLP-week10-IR-enc-dec-annotated_by_Ces
83 pages
Information Retrieval Systems Information Retrieval Systems
No ratings yet
Information Retrieval Systems Information Retrieval Systems
7 pages
Chapter Five IR Models
No ratings yet
Chapter Five IR Models
28 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
34 pages
IR Systems Usually Adopt Index Terms To Process Queries Index Term
No ratings yet
IR Systems Usually Adopt Index Terms To Process Queries Index Term
24 pages
Chapter 6_Scoring Term weighting and vector space model
No ratings yet
Chapter 6_Scoring Term weighting and vector space model
43 pages
I R Rank
No ratings yet
I R Rank
52 pages
TF-IDF
No ratings yet
TF-IDF
6 pages
Unit-3 Irs
No ratings yet
Unit-3 Irs
48 pages
5 IRModels IR
No ratings yet
5 IRModels IR
25 pages
02 Chap02a-BooleanAndvector Models
No ratings yet
02 Chap02a-BooleanAndvector Models
30 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
No ratings yet
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
45 pages
The Summation of Series
From Everand
The Summation of Series
Harold T. Davis
4/5 (1)
The Pleasures of Probability: Richard Isaac
No ratings yet
The Pleasures of Probability: Richard Isaac
6 pages
Lecture notes
No ratings yet
Lecture notes
34 pages
Assignment_4 (1)
No ratings yet
Assignment_4 (1)
5 pages
Presentation ASPACAT
No ratings yet
Presentation ASPACAT
15 pages
Bca Part 2 Data Structure 59 2020
No ratings yet
Bca Part 2 Data Structure 59 2020
3 pages
Statistical Mechanics Theory and Molecular Simulat
No ratings yet
Statistical Mechanics Theory and Molecular Simulat
7 pages
Week2 - 2022 - Biological Data Science - Polikar - Traditional Machine Learning Lecture
No ratings yet
Week2 - 2022 - Biological Data Science - Polikar - Traditional Machine Learning Lecture
123 pages
Anticipating Consumer Demand Using ML
No ratings yet
Anticipating Consumer Demand Using ML
8 pages
Cryptographic Algorithms
No ratings yet
Cryptographic Algorithms
31 pages
Artificial neural network-based harmonics extraction
No ratings yet
Artificial neural network-based harmonics extraction
18 pages
K.s.Fu Material
No ratings yet
K.s.Fu Material
3 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
ndt-02-00007-v3
No ratings yet
ndt-02-00007-v3
20 pages
Functions Worksheet 2
No ratings yet
Functions Worksheet 2
5 pages
Ai Chapter Notes-1
No ratings yet
Ai Chapter Notes-1
3 pages
Adaptive Filters
No ratings yet
Adaptive Filters
167 pages
Award Winners - Visionary Awards - Vision Breakthroughs - Computer Vision - Visionary Honorees
No ratings yet
Award Winners - Visionary Awards - Vision Breakthroughs - Computer Vision - Visionary Honorees
9 pages
Exploring Anomaly Detection in Data Science: Applications, Methods, and Significance
No ratings yet
Exploring Anomaly Detection in Data Science: Applications, Methods, and Significance
16 pages
Draft Time Table Even 2025
No ratings yet
Draft Time Table Even 2025
1 page
An Introduction To Machine Learning and Its Applications
No ratings yet
An Introduction To Machine Learning and Its Applications
8 pages
Overview of Operations Research
No ratings yet
Overview of Operations Research
3 pages
Radha Bai Gound
No ratings yet
Radha Bai Gound
3 pages
An Algorithm Predicting Stock Markets - Farbod - Dehghani
No ratings yet
An Algorithm Predicting Stock Markets - Farbod - Dehghani
19 pages
DSP Vtu Lab Manual
No ratings yet
DSP Vtu Lab Manual
137 pages
On The Analysis and Application of LDPC Codes: Olgica Milenkovic University of Colorado, Boulder
No ratings yet
On The Analysis and Application of LDPC Codes: Olgica Milenkovic University of Colorado, Boulder
51 pages
Association Rule Mining:: "If A Customer Buys Bread, He's 70% Likely of Buying Milk."
No ratings yet
Association Rule Mining:: "If A Customer Buys Bread, He's 70% Likely of Buying Milk."
12 pages
Insertion Sort Animation Updated
No ratings yet
Insertion Sort Animation Updated
24 pages
Geodetic Surveying (Worksheet)-1
No ratings yet
Geodetic Surveying (Worksheet)-1
6 pages
Interview PP T
No ratings yet
Interview PP T
16 pages
Module3 Lists Dictionaries Tuples
No ratings yet
Module3 Lists Dictionaries Tuples
31 pages