IR Unit-3

WELCOME
Department Optional Course -4

Subject: Information Retrieval.
Course Code: CSDC7023
Mr. S. G. Shaikh
Asst. Professor
Dept. Of Computer Engg,
AIKTC, New Panvel
salim.shaikh@aiktc.ac.in
Cell. +91 9960726716
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Subject: Information Retrieval CSDC7023
Prerequisite: Data structures and algorithms
Course Objectives:
The course aims students :
1 To learn the fundamentals of Information Retrieval
2 To analyze various Information retrieval modeling techniques
3 To understand query processing and its applications
4 To explore the various indexing and scoring techniques
5 To assess the various evaluation methods
6 To analyze various information retrieval for real world application.
Course Outcomes:
Learner will be able to: -
1 Describe and Analyze the concepts, challlenges of the Information retrieval system.
2 Design the various modeling techniques for information retrieval systems.
3 Implements the query structure and various query operations
4 Analyzing the indexing and scoring operation in information retrieval systems
5 Perform the evaluation of information retrieval systems
6 Analyze various information retrieval for real world application
Unit No-03: Query and Operations in Information Retrieval
Syllabus:-
• Query structures, Keyboard based querying, Pattern matching, Structured
queries, User relevance feedback, Automatic local analysis, Automatic global
analysis
•Reference Books Used

• T1:Modern information retrieval, Baeza-Yates, R. and Ribeiro-Neto, B., 1999. ACM press
• T1:Introduction to Modern Information Retrieval. G.G. Chowdhury. NealSchuman
• R1: Storage Network Management and Retrieval, VaishaliKhairnar
Introduction to Information Retrieval
Image Source: Google Images
Types of Queries in IR
• During the process of indexing, many keywords are associated with document set
which contains words, phrases, date created, author names, and type of
document.
• They are used by an IR system to build an inverted index which is then consulted
during the search. The queries formulated by users are compared to the set of
index keywords.
• Most IR systems also allow the use of Boolean and other operators to build a
complex query.
• The query language with these operators enriches the expressiveness of a user’s
information need.
• The IR system finds the relevant documents from a large data set according to the
user query. Queries submitted by users to search engines might be ambiguous,
concise and their meaning may change over time.
Types of Queries in IR
• Some of the types of Queries in IR systems are
1. Keyword Queries
2. Boolean Queries
3. Phrase Queries
4. Proximity Queries
5. Wildcard Queries
6. Natural Language Queries
1. Keyword Queries
 Keyword-based queries are the simplest and most commonly used forms of IR
queries: the user just enters keyword combinations to retrieve documents.
 The query keyword terms are implicitly connected by a logical AND operator.
 A query such as ‘database concepts’ retrieves documents that contain both the
words ‘data-base’ and ‘concepts’ at the top of the retrieved results. In addition,
most systems also retrieve documents that contain only ‘database’ or only
‘concepts’ in their text.
 Some systems remove most commonly occurring words (such as a, the, of, and so
on, called stopwords) as a preprocessing step before sending the filtered query
key-words to the IR engine. Most IR systems do not pay attention to the ordering
of these words in the query. All retrieval models provide support for keyword
queries.
2. Boolean Queries
 Some IR systems allow using the AND, OR, NOT, ( ), + , and – Boolean operators in combinations of
keyword formulations. AND requires that both terms be found.
 OR lets either term be found. NOT means any record containing the second term will be excluded.
‘( )’ means the Boolean operators can be nested using parentheses. ‘+’ is equivalent to AND,
requiring the term; the ‘+’ should be placed directly in front of the search term.
 ‘–’ is equivalent to AND NOT and means to exclude the term; the ‘–’ should be placed directly in
front of the search term not wanted.
 Complex Boolean queries can be built out of these operators and their combinations, and they are
evaluated according to the classical rules of Boolean algebra.
 No ranking is possible, because a document either satisfies such a query (is “relevant”) or does not
satisfy it (is “nonrelevant”). A document is retrieved for a Boolean query if the query is logically
true as an exact match in the document.
 Users generally do not use combinations of these complex Boolean operators, and IR systems
support a restricted version of these set operators. Boolean retrieval models can directly sup-port
different Boolean operator implementations for these kinds of queries.
3. Phrase Queries
 When documents are represented using an inverted keyword index for searching, the
relative order of the terms in the document is lost.
 In order to perform exact phrase retrieval, these phrases should be encoded in the
inverted index or implemented differently (with relative positions of word occurrences
in documents).
 A phrase query consists of a sequence of words that makes up a phrase. The phrase is
generally enclosed within double quotes. Each retrieved document must contain at
least one instance of the exact phrase.
 Phrase searching is a more restricted and specific version of proximity searching that
we mention below. For example, a phrase searching query could be ‘conceptual
database design’.
 If phrases are indexed by the retrieval model, any retrieval model can be used for these
query types. A phrase thesaurus may also be used in semantic models for fast
dictionary searching for phrases.
4. Proximity Queries
 Proximity search refers to a search that accounts for how close within a record
multiple terms should be to each other. The most commonly used proximity search
option is a phrase search that requires terms to be in the exact order. Other proximity
operators can specify how close terms should be to each other. Some will also specify
the order of the search terms. Each search engine can define proximity operators
differently, and the search engines use various operator names such as NEAR,
ADJ(adjacent), or AFTER. In some cases, a sequence of single words is given, together
with a maximum allowed distance between them. Vector space models that also
maintain information about positions and offsets of tokens (words) have robust
implementations for this query type. However, providing support for complex proximity
operators becomes computationally expensive because it requires the time-consuming
preprocessing of documents, and is thus suitable for smaller document collections
rather than for the Web.
5. Wildcard Queries
 Wildcard searching is generally meant to support regular expressions and pattern

matching-based searching in text.
 In IR systems, certain kinds of wildcard search support may be implemented—usually

words with any trailing characters (for example, ‘data*’ would retrieve data, database,
data point, dataset, and so on).
 Providing support for wildcard searches in IR systems involves preprocessing over-head

and is not considered worth the cost by many Web search engines today.
 Retrieval models do not directly provide support for this query type.
6. Natural Language Queries
There are a few natural language search engines that aim to understand the structure
and meaning of queries written in natural language text, generally as a question or
narrative. This is an active area of research that employs techniques like shallow semantic
parsing of text, or query reformulations based on natural language under-standing. The
system tries to formulate answers for such queries from retrieved results. Some search
systems are starting to provide natural language interfaces to provide answers to specific
types of questions, such as definition and factoid questions, which ask for definitions of
technical terms or common facts that can be retrieved from specialized databases. Such
questions are usually easier to answer because there are strong linguistic patterns giving
clues to specific types of sentences—for example, ‘defined as’ or ‘refers to’. Semantic
models can provide support for this query type.
Pattern matching in Python with Regex
You may be familiar with searching for text by pressing ctrl-F and
typing in the words you’re looking for. Regular expressions go one
step further: They allow you to specify a pattern of text to search
for.
Regular expressions, called regexes for short, are descriptions for a
pattern of text. For example, a \d in a regex stands for a digit
character — that is, any single numeral 0 to 9.
Pattern matching in Python with Regex
following regex is used in Python to match a string of three numbers, a hyphen, three more
numbers, another hyphen, and four numbers.
Any other string would not match the pattern.
\d\d\d-\d\d\d-\d\d\d\d
Regular expressions can be much more sophisticated. For example, adding a 3 in curly
brackets ({3}) after a pattern is like saying, “ Match this pattern three times.” So the slightly
shorter regex
\d{3}-\d{3}-\d{4}
(It matches the correct phone number format.)
RegEx Functions
Function Description
Returns a list containing all
findall
matches
Returns a Match object if
search there is a match anywhere
in the string
Returns a list where the
split string has been split at each
match
Replaces one or many
sub
matches with a string
Metacharacters
Character Description Example
[] A set of characters "[a-m]"
\ Signals a special sequence (can also be used to "\d"
escape special characters)
. Any character (except newline character) "he..o"
^ Starts with "^hello"

$ Ends with "planet$"
* Zero or more occurrences "he.*o"
+ One or more occurrences "he.+o"
? Zero or one occurrences "he.?o"
{} Exactly the specified number of occurrences "he.{2}o"
| Either or "falls|stays"
() Capture and group
Special Sequences
Character Description Example
\A Returns a match if the specified characters are at the beginning of the "\AThe"
string
\b Returns a match where the specified characters are at the beginning or at r"\bain"
the end of a word r"ain\b"
(the "r" in the beginning is making sure that the string is being treated as a
"raw string")
\B Returns a match where the specified characters are present, but NOT at the r"\Bain"
beginning (or at the end) of a word r"ain\B"
(the "r" in the beginning is making sure that the string is being treated as a
"raw string")
\d Returns a match where the string contains digits (numbers from 0-9) "\d"
\D Returns a match where the string DOES NOT contain digits "\D"
\s Returns a match where the string contains a white space character "\s"
\S Returns a match where the string DOES NOT contain a white space "\S"
character
\w Returns a match where the string contains any word characters (characters "\w"
from a to Z, digits from 0-9, and the underscore _ character)
\W Returns a match where the string DOES NOT contain any word characters "\W"
\Z Returns a match if the specified characters are at the end of the string "Spain\Z"
Sets
Set Description
[arn] Returns a match where one of the specified characters (a, r, or n)
is present
[a-n] Returns a match for any lower case character, alphabetically
between a and n
[^arn] Returns a match for any character EXCEPT a, r, and n
[0123] Returns a match where any of the specified digits (0, 1, 2, or 3)
are present
[0-9] Returns a match for any digit between 0 and 9
[0-5][0-9] Returns a match for any two-digit numbers from 00 and 59
[a-zA-Z] Returns a match for any character alphabetically between a and z,
lower case OR upper case
[+] In sets, +, *, ., |, (), $,{} has no special meaning, so [+] means:
return a match for any + character in the string
Motivation User Relevance Feedback
The same word can have different meanings (polysemy)

Two different words can have the same meaning
(synonymy)
Vocabulary of searcher may not match that of the
documents Consider the query = {plane fuel}
While this is relatively unambiguous (wrt the meaning of
each word in context), exact matching will miss
documents containing aircraft, airplane, or jet
Relevance feedback and query expansion aim to
overcome the problem of synonymy
Example
Improving Recall
Local analysis: query-time analysis on a portion of

documents returned for a user query
Main local method: relevance feedback
Global analysis: perform a global analysis once
(e.g., of collection) to produce thesaurus
Use thesaurus for query expansion
Relevance feedback and query expansion
□ Goal: To refine the answer set by involving the user in

the retrieval process (feedback/interaction)
□ Local Methods (adjust the user queries)

■ Relevance feedback
■ Pseudo (or Blind) Relevance Feedback
■ (Global) indirect Relevance Feedback
□ Global Methods (independent of the queries and results)
■ Query expansion/reformulation with a thesaurus
(WordNet)
■ Query expansion via automatic thesaurus generation
■ Other techniques (spelling correction,...)
Relevance feedback
□ Basic Procedure of RF
1. The user issues a simple query
2. The system returns an initial set of retrieval results
3. The user marks some of these documents as
relevant/irrelevant
4. The system computes a better representation of the
information need based on this feedback
5. The system displays a revised set of results
Repeat the procedure one or more times.
□ This process helps the user to focalize its own information

need as well.
Relevance feedback
Relevance Feedback: Example

□ Image search engine
http://nayana.ece.ucsb.edu/imsearch/imsearch.html
Results for Initial Query
Relevance Feedback step
Results after Relevance Feedback
Rocchio Algorithm
□ The Rocchio algorithm incorporates relevance feedback information
into the vector space model.
□ Want to maximize sim (Q, Cr) - sim (Q, Cnr)
□ The optimal query vector for separating relevant and non-relevant
documents (with cosine sim.):
r 1 r 1 r
r∑ j r∑ j
Qopt  d  d
Cr d j Cr N  Cr d j C r
□ Qopt = optimal query; Cr = set of rel. doc vectors; N = collection size
□ Unrealistic: we don’t know relevant documents.
The Theoretically Best Query
x x
x x
o x x
x x x x
x x
o x
o
 o x x x
o o x
x
Optimal query
x non-relevant documents
o relevant documents
Rocchio 1971 Algorithm (SMART)
r r r
qm  qr0  
1 1
r∑ j r∑ j
d   d
Dr d j Dr Dnr d j Dnr
□ qm = modified query vector; q0 = original query vector; α,β,γ: weights
(hand-chosen or set empirically); Dr = set of known relevant doc vectors; Dnr =

set of known irrelevant doc vectors
□ New query moves toward relevant documents and away from irrelevant
documents
□ Tradeoff  vs.  and  : If we have a lot of judged documents, we want a higher 
and .
□ Term weight can go negative
■ Negative term weights are ignored
■ Alternatively, weights can be normalized in [0,1]
Relevance feedback on initial query
Initial
query x x
x
o x
 x x
x x
x x
o x
x o
x o x
o x
x
Revised x o
query x known non-relevant documents
x o known relevant documents
Relevance Feedback in vector spaces
□ We can modify the query based on relevance feedback and apply

standard vector space model.
□ Use only the docs that were marked.
□ Relevance feedback can improve recall and precision
□ Relevance feedback is most useful for increasing recall

in situations where recall is important
Positive vs Negative Feedback
□ Usual choices for parameters ,  nad 

■  >>  , i.e. greater importance to the docs judged relevant than to the docs
judged irrelevant
■   0, as the docs marked irrelevant are typically near-positive. However,
many systems only allow positive feedback (=0).
■   0, in order to prevent overfitting, i.e. the excessive influence of
‘noisy’ characteristics of the docs marked (ir)relevant on the resulting
query
■ Reasonable values can be =1, =.75, =.15
□ The values of the parameters could be made dependent on the iteration, i.e.
increasing  and decreasing  and  (later queries already incorporate the
contribution of previous feedback iterations)
Probabilistic relevance feedback
□ Rather than reweighting in a vector space…
□ If user has told us some relevant and irrelevant documents, then we can
proceed to build a classifier, such as a Naive Bayes model:
■ P(tk|R) = |Drk| / |Dr|
■ P(tk|NR) = (Nk - |Drk|) / (N - |Dr|)
□ tk = term in document; Drk = known relevant doc containing tk; Nk = total
number of docs containing tk
□ More on later lectures on probabilistic classification
■ This is effectively another way of changing the (implicit) query term

weights
■ But note: the above proposal preserves no memory of the original weights
Relevance Feedback: Assumptions
□ A1: User has sufficient knowledge for initial query.
□ A2: Relevance prototypes are “well-behaved”.
■ Term distribution in relevant documents will be similar
■ Term distribution in non-relevant documents will be different from
those in relevant documents
□ Either: All relevant documents are tightly clustered around a single
prototype.
□ Or: There are different prototypes, but they have significant
vocabulary overlap.
□ Similarities between relevant and irrelevant documents are small
Excite Relevance Feedback
Spink et al. 2000

□ Only about 4% of query sessions from a user used
relevance feedback option
■ Expressed as “More like this” link next to each result
□ But about 70% of users only looked at first page of
results and didn’t pursue things further
■ So 4% is about 1/8 of people extending search
□ Relevance feedback improved results about 2/3 of the time
Pseudo Relevance Feedback
□ Automatic local analysis

□ Pseudo relevance feedback attempts to automate the manual
part of relevance feedback.
□ Retrieve an initial set of relevant documents.

□ Assume that top m ranked documents are relevant.
□ Do relevance feedback
□ Mostly works (perhaps better than global analysis!)

■ Found to improve performance in TREC ad-hoc task
■ Danger of query drift
Indirect relevance feedback
□ On the web, DirectHit introduced a form of indirect

relevance feedback.
□ DirectHit ranked documents higher that users look at
more often.
■ Clicked on links are assumed likely to be relevant
□ Assuming the displayed summaries are good, etc.
□ Globally: Not user or query specific.

□ This is the general area of clickstream mining
Relevance Feedback Summary
□ Relevance feedback has been shown to be very effective at improving

relevance of results.
■ Requires enough judged documents, otherwise it’s unstable (≥ 5
recommended)
■ Requires queries for which the set of relevant documents is
medium to large
□ Full relevance feedback is painful for the user.

□ Full relevance feedback is not very efficient in most IR systems.
□ Other types of interactive retrieval may improve relevance by
as much with less work.
Overview
1 Introduction
2 Relevance Feedback Rocchio Algorithm

Relevance-Based Language Models
3 Query Expansion
The Basics
The user issues a (short, simple) query.

The search engine returns a set of documents.
User marks some docs as relevant (possibly
some as non-relevant).
Search engine computes a new representation of
the information need.
Hope: better than the initial query.
Search engine runs new query and returns new results.
New results have (hopefully) better recall (and possibly
also better precision).
A limited form of RF is often expressed as “more like this” or
”find similar”.
Example
Example
Outline
1 Introduction
2 Relevance
Feedback Rocchio
Algorithm
Relevance-Based Language
Models
3 Query
Expansion
Rocchio Basics
Developed in the late 60s or early 70s.

It was developed using the VSM as its basis.
Therefore, we represent documents as points
in a high-dimensional term space.
Uses centroids to calculate the center of a set of
documents.
Rocchio Diagram
Rocchio aims to find the optimal query ˙qoptthat

maximises:
˙qopt = arg max[sim(˙q,Cr ) − sim(˙q,C nr)] (1)
˙q
where sim(˙q, Cr ) is the similarity between a query q and the set

of relevant documents Cr .
Rocchio Diagram
Rocchio aims to find the optimal query ˙qoptthat

maximises:
˙qopt = arg max[sim(˙q,Cr ) − sim(˙q,C nr)] (1)
q̇
where sim(˙q, Cr ) is the similarity between a query q and the set

of relevant documents Cr . Using cosine similarity the optimal
query becomes:
1 Σ ˙ 1 Σ ˙
q̇opt = dj − dj (2)
|Cr | |Cnr|
ḋj∈C r ḋj∈C nr
which is the difference between the centroids of the relevant

and non-relevant document vectors.
Rocchio Diagram
Rocchio in practice
However, we usually do not know the full relevant

and non-relevant sets.
For example, a user might only label a few documents
as relevant.
Therefore, in practice Rocchio is often parameterised as
follows:
1 Σ ˙ 1 Σ
˙qm= α˙q0+β dj − γ ḋj (3)
|Cr | |Cnr|
ḋj∈C r ḋj∈C nr
where α, β, and γ are weights that are attached to

each component.
Rocchio Summary
Rocchio has been shown useful for increasing recall

Contains aspects of positive and negative feedback
Positive feedback is much more valuable (i.e.
indications of what is relevant and γ < β
Reasonable values of the parameters are α = 1.0, β =
0.75,
γ = 0.15
Outline
1 Introduction
2 Relevance
Feedback Rocchio
Algorithm
Relevance-Based Language
Models
3 Query
Expansion
Relevance-Based Language Models I
The query-likelihood language model (earlier lecture) had no
concept of relevance (if you remember)
Relevance-Based language models take a probabilistic
language modelling approach to modelling relevance
The main assumption is that a document is generated from
either one of two classes (i.e. relevant or non-relevant)
Documents are then ranked according to their probability of
being drawn from the relevance class
P(D|R)P(R)
P(R|D) = (4)
P(D|R)P(R)+ P(D|NR)P(NR)
which is rank equivalent to ranking by log-odds
P(D|R)
= log (5)
P(D|NR)
Relevance-Based Language Models II
Lavrenko (2001) introduced the idea of relevance-

based language models
Outlined a number of different generative models
One of the best performing models is one called RM3
(useful for both relevance and pseudo-relevance
feedback)
Relevance-Based Language Models III
Given a set of known relevant documents R one can estimate

a relevance language model (e.g. multinomial θR)
In practice, this can be smoothed with the original query
model and a background model (not shown)
One could estimate the relevance model as:
(1 − π)θR +πθq (6)
where π controls how much of the original query one wishes

to retain.
When does RF work?
When users are willing to give feedback!

When the user knows the terms in the collection well
enough for an initial query.
When relevant documents contain similar terms (similar
to the cluster hypothesis)
The cluster hypothesis states that if there is a

document from a cluster that is relevant to a search
request, then it is likely that other documents from the
same cluster are also relevant. - Jardine and van
Rijsbergen
Relevance Feedback - Evaluation
How to evaluate if RF
works?
Relevance Feedback - Evaluation
How to evaluate if RF works?

Have two collections, with relevance judgements for the
same information needs (queries)
User studies: time taken to find # of relevant
documents (with and without feedback)
Other types of relevance feedback
Implicit relevance feedback

Pseudo relevance feedback - when does it
work?
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel 29
Overview
1 Introduction
2 Relevance Feedback
Rocchio Algorithm
Relevance-Based Language Models
3 Query Expansion
Query Expansion Motivation
Query Expansion Motivation
Query Expansion Introduction
Query expansion is another method for increasing recall

We use “global query expansion” to refer to “global
methods for query reformulation”
In global query expansion, the query is modified
based on some global resource, i.e. a resource that is
not
query-dependent
Often the problem aims to find (near-
)synonyms Distributional Semantics (word
embeddings)
What’s the different between “local” and “global”
methods?
Query Expansion Methods
• Use of a controlled vocabulary that is maintained by human editors

(e.g. sets of keywords for publications - Medline)
• A manual thesaurus (e.g. wordnet)
• An automatically derived thesaurus
• Query reformulations based on query log mining (i.e. what the large
search engines do)
Automatic thesaurus generation I
Let A be a term-document matrix

Where each cell Atd is a weighted count of term t
in document (or window) d
Row normalise the matrix (e.g. L2
normalisation) Then C = AAT is a term-term
similarity matrix
The similarity between any two terms u and v is in Cuv
Given any particular term q, the most similar terms can
be easily retrieved
Automatic thesaurus generation II
Other approaches involve distributional semantics

Where words with similar meanings appear in similar
contexts Word embeddings - word2vec, glove, etc
Can be useful but global expansion still suffers from
problems of polysemy
A naive approach to word-level expansion might lead to
{apple computer} → {apple fruit computer}
Summary
• QE is transparent in that it allows the user to see

(select) expansion terms
• Local approaches (PRF) to expanding queries tend to be
more effective
• E.g. {apple computer} → {apple computer jobs iphone
ipad macintosh}
• Local approaches tend to automatically disambiguate
the individual query terms. Why?
• Query log mining approaches have also been shown to
be useful
Reading
Manning, Raghavan, Schu¨tze: Introduction to

Information Retrieval (MRS), chapter 9: Relevance
feedback and query expansion, chapter 16.1:
Clustering in information retrieval
Victor Lavrenko and W. Bruce Croft: Relevance-
Based Language Models
End of Unit-3

IR Unit-3

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

IR Unit-3

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IR Unit-3

Uploaded by

Copyright:

Available Formats

WELCOME

Department Optional Course -4

•Reference Books Used

Image Source: Google Images

• Some of the types of Queries in IR systems are

6. Natural Language Queries

 Wildcard searching is generally meant to support regular expressions and pattern

 In IR systems, certain kinds of wildcard search support may be implemented—usually

 Providing support for wildcard searches in IR systems involves preprocessing over-head

Any other string would not match the pattern.

(It matches the correct phone number format.)

. Any character (except newline character) "he..o"

^ Starts with "^hello"

The same word can have different meanings (polysemy)

Local analysis: query-time analysis on a portion of

□ Goal: To refine the answer set by involving the user in

□ Local Methods (adjust the user queries)

Repeat the procedure one or more times.

□ This process helps the user to focalize its own information

Relevance Feedback: Example

□ Qopt = optimal query; Cr = set of rel. doc vectors; N = collection size

□ Unrealistic: we don’t know relevant documents.

□ qm = modified query vector; q0 = original query vector; α,β,γ: weights

(hand-chosen or set empirically); Dr = set of known relevant doc vectors; Dnr =

□ We can modify the query based on relevance feedback and apply

□ Use only the docs that were marked.

□ Relevance feedback can improve recall and precision

□ Relevance feedback is most useful for increasing recall

□ Usual choices for parameters ,  nad 

□ Rather than reweighting in a vector space…

■ This is effectively another way of changing the (implicit) query term

□ A2: Relevance prototypes are “well-behaved”.

■ Term distribution in relevant documents will be similar

■ Term distribution in non-relevant documents will be different from

those in relevant documents

□ Either: All relevant documents are tightly clustered around a single

□ Or: There are different prototypes, but they have significant

□ Similarities between relevant and irrelevant documents are small

Spink et al. 2000

□ Automatic local analysis

□ Retrieve an initial set of relevant documents.

□ Mostly works (perhaps better than global analysis!)

□ On the web, DirectHit introduced a form of indirect

□ Globally: Not user or query specific.

□ Relevance feedback has been shown to be very effective at improving

□ Full relevance feedback is painful for the user.

2 Relevance Feedback Rocchio Algorithm

The user issues a (short, simple) query.

Developed in the late 60s or early 70s.

Rocchio aims to find the optimal query ˙qoptthat

where sim(˙q, Cr ) is the similarity between a query q and the set

Rocchio aims to find the optimal query ˙qoptthat

where sim(˙q, Cr ) is the similarity between a query q and the set

which is the difference between the centroids of the relevant

However, we usually do not know the full relevant

where α, β, and γ are weights that are attached to

Rocchio has been shown useful for increasing recall

Lavrenko (2001) introduced the idea of relevance-