[go: up one dir, main page]

0% found this document useful (0 votes)
10 views10 pages

Vmodel

Uploaded by

Efa Mirkana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views10 pages

Vmodel

Uploaded by

Efa Mirkana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 10

Information Storage and Retrieval 2011

CHAPTER 3
INFORMATION RETRIVAL MODELl
Example one (Boolean Model)
Given
Doc 1: Information storage and retrieval.
Doc 2: Expert system and information retrieval system
Doc 3: Information processing and management
Doc 4: Information retrieval on Archive

Suppose our query consists of "Information AND Retrieval" which document will be
retrieved based on Boolean model?

Answer: {1, 2, 4}
Index Term Document
Information 1,2,3,4
Storage 1
Retrieval 1,2,4
System 2
Processing 3
Management 3
Archives 4
Expert 2

Answer: Information AND Retrieval


= {1, 2, 3, 4} n {1, 2, 4}
= {1, 2, 4}
Home work: Determine the rank of the above documents with respect to the
given query using vector space model.

1 Page Chapter 3
Information Storage and Retrieval 2011
Example Two (Vector Space Model)
THE COLLECTION
Here’s the collection – 6 documents, Doc A through Doc F, with term-occurrences as
follows:
Doc A care, cat, persian
Doc B Care, care, care, cat, cat, cat, persian, persian, persian
Doc C cat, cat, cat, cat, cat, cat, cat, cat, cat,
Doc D Care, cat, dog, dog, dog, dog, dog, dog, persian
Doc E Care, cat, dog
Doc F care

TF weights
From this initial specification, we can make a number of observations:
1. The length of a document, ld, is the total number of term-occurrences in it. The
length of Doc A is 3, the length of Doc B is 9, and so on. In other words,
ldocA =3
ldocB =9
ldocC =9
ldocD =9
ldocE =3
ldocF =1
2. The total number of term-occurrences in the collection, fD , is the sum of the
N
document lengths: fD l
j 1
d . In other words, for this collection,

fD = 34
3. The total number of term-types in the collection is 4: these term-types, in
alphabetical order are “care”, “cat”, “dog” and “persian”. Each document is made
up of a different number of occurrences of each of these term-types. Doc A, for
instance, is made up of 1 occurrence of the term-type "care", 1 occurrence
of the term-type "cate", 0 occurrences of the term-type “dog”, and 1 occurrences
of the term-type “persian”. Doc B, on the other hand, is make up of 3
occurrences of the term-type “care”, 3 occurrences of the term-type “cat”, 0
occurrences of the term-type “dog”, and 3 occurrences of the term-type. From
now on, for brevity, we’ll use the word “term” to mean “term-type”, and
“occurrence” to mean “term-occurrence”.

2 Page Chapter 3
Information Storage and Retrieval 2011
4. We can present this information in the form of a vector for each document.
Each document vector consists of an ordered list of values, each value
indicating the number of occurrences of a particular term. So, for Doc A, the
vector is <1, 1, 0, 1>, and for Doc B, the vector is <3, 3, 0, 3>, Note that the
order of terms represented by the values is the same in each vector: this is
essential if the vectors are to be compared, either with each other or with query
vectors. We can with:
Doc A = <1, 1, 0,1>
Doc B = <3, 3, 0, 3>
Doc C = <0, 9, 0, 0>
Doc D = <1, 1, 6, 1>
Doc E = <1, 1, 1, 0>
Doc F = <1, 0, 0, 0>

5. The values making up these vectors are actually values of fdi,tj , or within-
document frequency. These values are the ones that are commonly used as
TF weights, where TF stands for term frequency, and

TFdi ,t j  f dit j
So we could call the vectors term-frequency vectors:
TFdocA = <1, 1, 0, 1>
TFdocB = <3, 3, 0, 3>
TFdocC = <0, 9, 0, 0>
TFdocD = <1, 1, 6, 1>
TFdocE = <1, 1, 1, 0>
TFdocF = <1, 0, 0, 0>

6. In fact, putting the vectors for every document in the collection together in this
way forms a term-frequency matrix, where the rows represent documents, the
columns represent terms, and the individual values represent individual term
frequencies:

care cat dog persian


Doc A 1 1 0 1
Doc B 3 3 0 3
Doc C 0 9 0 0
Doc D 1 1 6 1

3 Page Chapter 3
Information Storage and Retrieval 2011
Doc E 1 1 1 0
Doc F 1 0 0 0

IDF weights
The next thing we can do is calculate the IDF weights for each term. IDF stands for
inverse document frequency, but can also be conceptualized as a within-collection
frequency weight, to contrast with TF which as a within-document frequency weight.
The IDF weight for a particular term does not vary from document to document,
whereas the TF weight for a particular term may well be very different for different
documents (as we saw earlier).

The formula you need to use to calculate the IDF weight for a term tj, is this:
IDFD,tj = log2 (ND/nD,tj)
Where
ND = the total number of documents in the collection D
and
nD,tj = the number of documents in the collection D
that contain at least one occurrence of the term t j.
So, to calculate a value for IDF, you need firstly to divide N by n, then take the logarithm
to base 2 of the result.

The reason we take the logarithm of N/n, rather than just using N/n on its own, is so
that we don't get such high values of IDF whenever we're in a situation where N is very
large and n is relatively small (which is often the case). Some people use a formula for
IDF that takes a logarithm to base 10, rather than to base 2; other people use a formula
for IDF that adds 1 to each final value (this is to ensure that you don't end up with
values of 0 for terms that actually appear in every document in the collection, since
1og21=0). It really doesn't make much difference which formula you use.

So, the IDF weights for each term in the collection can be calculated and expressed in
the form of a single inverse document-frequency vector as follows:

IDFD ,t j = <1og2(6/5),log2(6/5),log2(6/2),log2(6/3)>

=<0.26,0.26,1.58,1.00>

Combined W weights

4 Page Chapter 3
Information Storage and Retrieval 2011
The third step is to calculate combined W weights, i.e., TF.IDF weights for each term in
each document. The TF.IDF weight in document di of term tj is given by Wdi,tj, where

Wdi ,t j TFdi ,t j xIDFD ,t j


What you need to do, is this. For each term in each document, multiply the ITF value for
that tem by the corresponding IDF value for that term. You end up with a matrix of
values again, each row representing a document and each column representing a term,
but this time the values aren't TF weights, they're TF.IDF weights.

So, remember the TF vectors look like this:


TFdocA = <1, 1, 0, 1>
TFdocB = <3, 3, 0, 3>
TFdocC = <0, 9, 0, 0>
TFdocD = <1, 1, 6, 1>
TFdocE = <1, 1, 1, 0>
TFdocF = <1, 0, 0, 0>
And the IDF vector looks like this:
= <0.26, 0.26, 1.58, 1.00>

So, the W (i.e., TF.IDF) vectors look like this:

WdocA = <1 x 0.26, 1x0.26, 0x1.58, 1x1.00>


= <0.26, 0.26, 0.00, 1.00>
WdocB = <3 x 0.26, 3x0.26, 0x1.58, 3x1.00>
= <0.79, 0.79, 0.00, 3.00>
WdocC = <0 x 0.26,9 x 0.26, 0 x 1.58, 0 x 1.00>
= <0.00, 2.37, 0.00, 0.00>
WdocD = <1 x 0.26, 1 x 0.26, 6 x 1.58, 1 x 1.00>
= <0.26, 0.26, 9.51, 1.00>
WdocE = <1 x 0.26, 1 x 0.26, 1 x 1.58, 0 x 1.00>
= <0.26, 0.26, 1.58, 0.00>
WdocF = <1 x 0.26, 0 x 0.26, 0 x 1.58, 0 x 1.00>
= <0.26, 0.00, 0.00, 0.00>

THE QUERIES

5 Page Chapter 3
Information Storage and Retrieval 2011
OK. Now, here are the queries again – 4 queries, Query 1 through Query 4, with the
terms making up those queries as follows:
Query 1 cat
Query 2 care, cat
Query 3 care, cat, persian
Query 4 care, cat(2)

In Query 4, the number “2” is there to indicate that you have somehow specified that the
term “cat” is twice as significant in this query than the term “care”. In other words, you
have assigned a weight to this term: not based on term frequencies (as the document
TF weights are), but rather based on your own perception of the relative significance of
terms in this representation or our information need.

We can represent these queries by vectors, in just the same way that we represented
documents by vectors earlier:

Query 1 = <0, 1, 0, 0>


Query 2 = <1, 1, 0, 0>
Query 3 = <1, 1, 0, 1>
Query 4 = <1, 2, 0, 0>

Note that the first three of these vectors are binary vectors, in that they are made up
purely of values of 0 or 1, respectively representing the absence or presence of a term
in the query. The fourth vector is a non-binary vector, which contains values
representing the weights assigned to each term in the query. But in general, either type
of query vector can be viewed as a vector made up of tem weights, like the ones we
defined for documents earlier.

W query1 = <0, 1, 0, 0>

Wquery 2 = <1, 1, 0, 0>

Wquery 3 = <1, 1, 0, 1>

6 Page Chapter 3
Information Storage and Retrieval 2011

Wquery 4 = <1, 2, 0, 0>

The cosine coefficient


Right, Now suppose you want you determine which documents are most similar to
which queries. In other words, you want to rank the documents in order of their
similarity to each query. (The basic assumption is that if one document-representation
is more similar to a given query-representation than another document, then the first
document is likely to be more relevant to your information need.) To do this, you need
to use a similarity coefficient – a formula that allows you to calculate numerical values
indicating how similar things are. There documents, for instance, are similar. Many of
these formulae (some association coefficients, for instance) produce values on a
scale of 0 to 1, where 0 represents complete dissimilarity and 1 represents complete
similarity; other formulae (some distance metrics, for instance) produce values on the
same scale of 0 to 1, but this time 0 represents complete similarity and 1 represents
complete dissimilarity. One of the most commonly-used formulae in the first category is
the cosine coefficient, which looks like this:

j M T

 (W
j 1
t xWdi ,t j )
qkn j

COS qk ,d 
j M T j M T
2

j 1
(Wqk ,t j ) x  (W
j 1
d i ,t j
2
)

The way you use the cosine coefficient is like this.


1. You identify the query and the document you want to compare. (Remember that you
have to repeat the process, and calculate a new value of COS, for every
query/document pair, in a typical retrieval situation, you would want to calculate a
COS value indicating the similarity between any given query and every document in
the database). Suppose, for example, you want to compare Query 1 and Doc A.
2. You take the W vector for the query and the W vector (i.e the TF, IDF vector) for the
document.
The W vector for Query 1 looks like this:

W query1 = <0, 1, 0, 0>

And the W vector for Doc A looks like this:

WqueryA = <0.26, 0.26, 0.00, 1.00>

7 Page Chapter 3
Information Storage and Retrieval 2011
3. You multiply together the corresponding W values for each term. In the example, you
end up with a vector that looks like this:

Wquery1 xWdocA = <0 x 0.26, 1 x 0.26, 0 x 0.00, 0 x 1.00>


= <0.00, 0.26, 0.00, 0.00>

4. Then you add together (i.e., sum) the values in that vector. The result you get is the
value of the top half of the cosine formula: this is the so-called inner product or dot
product of the two original vectors. In the example,

W 
j M T

j 1
query1,t j xWdocA,t j = 0.00 + 0.26 + 0.00 + 0.00

= 0.26
5. Moving now to the bottom half of the formula, you first need to calculate the squares
of the W values in the query vector. In the example, you get a vector that looks like
this:
2
Wquery1 = <0 x 0,1 x 1, 0 x 0, 0 x 0>

= <0.00, 1.00, 0.00, 0.00>

6. Then you need to sum the values in that vector. In the example,

W 
j M T

j 1
query1,t j
2
= 0.00 + 1.00 + 0.00 + 0.00

= 1.00

7. Similarly, you need to calculate the squares of the W values in the document vector.
In the example, you get a vector that looks like this:

2
WdocA = <0.26 x 0.26, 0.26 x 0.26, 0.00 x 0.00, 1.00 x

1.00>
= <0.07, 0.07, 0.00, 1.00>

8. Then you need to sum the values in that vector. In the example,

W 
j M T


2
docA,t j = 0.07 + 0.07 + 0.00 + 1.00
j 1
= 1.14

9. Next, you need to multiply the results of steps (6) and (8) together. In the example,

8 Page Chapter 3
Information Storage and Retrieval 2011

 W x
j M T
W 
j M T


2 2
query1,t j docA,t j = 1.00 x 1.14
j 1 j 1

= 1.14

10. Now take the square root of the result of step (9). This is the result of the bottom half
of the formula. In the example,

 W x  W 
j M T j M T

query1,t j
2
docA,t j
2 = 1.14
j 1 j 1

= 1.07

11. Finally, you need to divide the result of step (4) by the result of step (10). This is the
value of the cosine coefficient! In the example,

j M T

 (W
j 1
query1, t j xWdocA,t j
COS query1, docA  = 0.26/1.07
j M T j M T
2
 (W  (W
2
query1, t j )x docA, t j )
j 1 j 1

= 0.25

So, after all that, we can say that the degree of similarity between Query 1 and Document A is
0.25, on a scale of 0 to 1, where 1 represents complete similarity.

Refinements

We could have used different kinds of document-term weights, Wdi ,t j , in the formula. For

example: instead of using TF, IDF weights, we could have used just TF weights.

We could have used different kinds of query-term weights, Wdi ,t j , in the formula. For

example: instead of using just user-defined weights (which are a bit like TF weights), we could
have multiplied these by IDF weights to give TF.IDF values, and used those.

We could have used a different similarity coefficient entirely. There are many others. Other
association coefficients, for example, are based on the same inner product formula that the
cosine coefficient is based on, but use different methods of normalization (i.e. different ways of
countering the effect of document length, for instance); we could, for instance, have used the
Dice coefficient, which looks like this:

9 Page Chapter 3
Information Storage and Retrieval 2011

 
j M T
2  Wqk ,t j xWdi ,t j
j 1
DICEqk ,di  j M T
 W   W 
j M T
2 2
q k ,t j d k ,t j
j 1 j 1

Home work 2: rank the above documents with respect to query 1 using vector
space model.

10 Page Chapter 3

You might also like