[go: up one dir, main page]

0% found this document useful (0 votes)
28 views75 pages

IR Unit-3

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 75

WELCOME

Department Optional Course -4


Subject: Information Retrieval.
Course Code: CSDC7023

Mr. S. G. Shaikh
Asst. Professor
Dept. Of Computer Engg,
AIKTC, New Panvel
salim.shaikh@aiktc.ac.in
Cell. +91 9960726716

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Subject: Information Retrieval CSDC7023
Prerequisite: Data structures and algorithms
Course Objectives:
The course aims students :
1 To learn the fundamentals of Information Retrieval
2 To analyze various Information retrieval modeling techniques
3 To understand query processing and its applications
4 To explore the various indexing and scoring techniques
5 To assess the various evaluation methods
6 To analyze various information retrieval for real world application.

Course Outcomes:
Learner will be able to: -
1 Describe and Analyze the concepts, challlenges of the Information retrieval system.
2 Design the various modeling techniques for information retrieval systems.
3 Implements the query structure and various query operations
4 Analyzing the indexing and scoring operation in information retrieval systems
5 Perform the evaluation of information retrieval systems
6 Analyze various information retrieval for real world application

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Unit No-03: Query and Operations in Information Retrieval

Syllabus:-
• Query structures, Keyboard based querying, Pattern matching, Structured
queries, User relevance feedback, Automatic local analysis, Automatic global
analysis

•Reference Books Used


• T1:Modern information retrieval, Baeza-Yates, R. and Ribeiro-Neto, B., 1999. ACM press
• T1:Introduction to Modern Information Retrieval. G.G. Chowdhury. NealSchuman
• R1: Storage Network Management and Retrieval, VaishaliKhairnar

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Introduction to Information Retrieval

Image Source: Google Images

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Types of Queries in IR

• During the process of indexing, many keywords are associated with document set
which contains words, phrases, date created, author names, and type of
document.

• They are used by an IR system to build an inverted index which is then consulted
during the search. The queries formulated by users are compared to the set of
index keywords.

• Most IR systems also allow the use of Boolean and other operators to build a
complex query.

• The query language with these operators enriches the expressiveness of a user’s
information need.

• The IR system finds the relevant documents from a large data set according to the
user query. Queries submitted by users to search engines might be ambiguous,
concise and their meaning may change over time.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Types of Queries in IR

• Some of the types of Queries in IR systems are

1. Keyword Queries

2. Boolean Queries

3. Phrase Queries

4. Proximity Queries

5. Wildcard Queries

6. Natural Language Queries

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
1. Keyword Queries

 Keyword-based queries are the simplest and most commonly used forms of IR
queries: the user just enters keyword combinations to retrieve documents.

 The query keyword terms are implicitly connected by a logical AND operator.

 A query such as ‘database concepts’ retrieves documents that contain both the
words ‘data-base’ and ‘concepts’ at the top of the retrieved results. In addition,
most systems also retrieve documents that contain only ‘database’ or only
‘concepts’ in their text.

 Some systems remove most commonly occurring words (such as a, the, of, and so
on, called stopwords) as a preprocessing step before sending the filtered query
key-words to the IR engine. Most IR systems do not pay attention to the ordering
of these words in the query. All retrieval models provide support for keyword
queries.

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
2. Boolean Queries

 Some IR systems allow using the AND, OR, NOT, ( ), + , and – Boolean operators in combinations of
keyword formulations. AND requires that both terms be found.

 OR lets either term be found. NOT means any record containing the second term will be excluded.
‘( )’ means the Boolean operators can be nested using parentheses. ‘+’ is equivalent to AND,
requiring the term; the ‘+’ should be placed directly in front of the search term.

 ‘–’ is equivalent to AND NOT and means to exclude the term; the ‘–’ should be placed directly in
front of the search term not wanted.

 Complex Boolean queries can be built out of these operators and their combinations, and they are
evaluated according to the classical rules of Boolean algebra.

 No ranking is possible, because a document either satisfies such a query (is “relevant”) or does not
satisfy it (is “nonrelevant”). A document is retrieved for a Boolean query if the query is logically
true as an exact match in the document.

 Users generally do not use combinations of these complex Boolean operators, and IR systems
support a restricted version of these set operators. Boolean retrieval models can directly sup-port
different Boolean operator implementations for these kinds of queries.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
3. Phrase Queries
 When documents are represented using an inverted keyword index for searching, the
relative order of the terms in the document is lost.

 In order to perform exact phrase retrieval, these phrases should be encoded in the
inverted index or implemented differently (with relative positions of word occurrences
in documents).

 A phrase query consists of a sequence of words that makes up a phrase. The phrase is
generally enclosed within double quotes. Each retrieved document must contain at
least one instance of the exact phrase.

 Phrase searching is a more restricted and specific version of proximity searching that
we mention below. For example, a phrase searching query could be ‘conceptual
database design’.

 If phrases are indexed by the retrieval model, any retrieval model can be used for these
query types. A phrase thesaurus may also be used in semantic models for fast
dictionary searching for phrases.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
4. Proximity Queries

 Proximity search refers to a search that accounts for how close within a record
multiple terms should be to each other. The most commonly used proximity search
option is a phrase search that requires terms to be in the exact order. Other proximity
operators can specify how close terms should be to each other. Some will also specify
the order of the search terms. Each search engine can define proximity operators
differently, and the search engines use various operator names such as NEAR,
ADJ(adjacent), or AFTER. In some cases, a sequence of single words is given, together
with a maximum allowed distance between them. Vector space models that also
maintain information about positions and offsets of tokens (words) have robust
implementations for this query type. However, providing support for complex proximity
operators becomes computationally expensive because it requires the time-consuming
preprocessing of documents, and is thus suitable for smaller document collections
rather than for the Web.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
5. Wildcard Queries

 Wildcard searching is generally meant to support regular expressions and pattern


matching-based searching in text.

 In IR systems, certain kinds of wildcard search support may be implemented—usually


words with any trailing characters (for example, ‘data*’ would retrieve data, database,
data point, dataset, and so on).

 Providing support for wildcard searches in IR systems involves preprocessing over-head


and is not considered worth the cost by many Web search engines today.

 Retrieval models do not directly provide support for this query type.

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
6. Natural Language Queries
There are a few natural language search engines that aim to understand the structure
and meaning of queries written in natural language text, generally as a question or
narrative. This is an active area of research that employs techniques like shallow semantic
parsing of text, or query reformulations based on natural language under-standing. The
system tries to formulate answers for such queries from retrieved results. Some search
systems are starting to provide natural language interfaces to provide answers to specific
types of questions, such as definition and factoid questions, which ask for definitions of
technical terms or common facts that can be retrieved from specialized databases. Such
questions are usually easier to answer because there are strong linguistic patterns giving
clues to specific types of sentences—for example, ‘defined as’ or ‘refers to’. Semantic
models can provide support for this query type.

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Pattern matching in Python with Regex
You may be familiar with searching for text by pressing ctrl-F and
typing in the words you’re looking for. Regular expressions go one
step further: They allow you to specify a pattern of text to search
for.
Regular expressions, called regexes for short, are descriptions for a
pattern of text. For example, a \d in a regex stands for a digit
character — that is, any single numeral 0 to 9.

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Pattern matching in Python with Regex
following regex is used in Python to match a string of three numbers, a hyphen, three more
numbers, another hyphen, and four numbers.

Any other string would not match the pattern.

\d\d\d-\d\d\d-\d\d\d\d

Regular expressions can be much more sophisticated. For example, adding a 3 in curly
brackets ({3}) after a pattern is like saying, “ Match this pattern three times.” So the slightly
shorter regex

\d{3}-\d{3}-\d{4}

(It matches the correct phone number format.)

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
RegEx Functions

Function Description
Returns a list containing all
findall
matches
Returns a Match object if
search there is a match anywhere
in the string
Returns a list where the
split string has been split at each
match
Replaces one or many
sub
matches with a string

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Metacharacters
Character Description Example
[] A set of characters "[a-m]"
\ Signals a special sequence (can also be used to "\d"
escape special characters)

. Any character (except newline character) "he..o"

^ Starts with "^hello"


$ Ends with "planet$"
* Zero or more occurrences "he.*o"
+ One or more occurrences "he.+o"
? Zero or one occurrences "he.?o"
{} Exactly the specified number of occurrences "he.{2}o"

| Either or "falls|stays"
() Capture and group

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Special Sequences
Character Description Example
\A Returns a match if the specified characters are at the beginning of the "\AThe"
string
\b Returns a match where the specified characters are at the beginning or at r"\bain"
the end of a word r"ain\b"
(the "r" in the beginning is making sure that the string is being treated as a
"raw string")
\B Returns a match where the specified characters are present, but NOT at the r"\Bain"
beginning (or at the end) of a word r"ain\B"
(the "r" in the beginning is making sure that the string is being treated as a
"raw string")
\d Returns a match where the string contains digits (numbers from 0-9) "\d"

\D Returns a match where the string DOES NOT contain digits "\D"

\s Returns a match where the string contains a white space character "\s"

\S Returns a match where the string DOES NOT contain a white space "\S"
character
\w Returns a match where the string contains any word characters (characters "\w"
from a to Z, digits from 0-9, and the underscore _ character)

\W Returns a match where the string DOES NOT contain any word characters "\W"

\Z Returns a match if the specified characters are at the end of the string "Spain\Z"

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Sets

Set Description
[arn] Returns a match where one of the specified characters (a, r, or n)
is present
[a-n] Returns a match for any lower case character, alphabetically
between a and n
[^arn] Returns a match for any character EXCEPT a, r, and n
[0123] Returns a match where any of the specified digits (0, 1, 2, or 3)
are present
[0-9] Returns a match for any digit between 0 and 9
[0-5][0-9] Returns a match for any two-digit numbers from 00 and 59
[a-zA-Z] Returns a match for any character alphabetically between a and z,
lower case OR upper case
[+] In sets, +, *, ., |, (), $,{} has no special meaning, so [+] means:
return a match for any + character in the string

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Motivation User Relevance Feedback

The same word can have different meanings (polysemy)


Two different words can have the same meaning
(synonymy)
Vocabulary of searcher may not match that of the
documents Consider the query = {plane fuel}
While this is relatively unambiguous (wrt the meaning of
each word in context), exact matching will miss
documents containing aircraft, airplane, or jet
Relevance feedback and query expansion aim to
overcome the problem of synonymy

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Example

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Improving Recall

Local analysis: query-time analysis on a portion of


documents returned for a user query
Main local method: relevance feedback
Global analysis: perform a global analysis once
(e.g., of collection) to produce thesaurus
Use thesaurus for query expansion

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Relevance feedback and query expansion

□ Goal: To refine the answer set by involving the user in


the retrieval process (feedback/interaction)

□ Local Methods (adjust the user queries)


■ Relevance feedback
■ Pseudo (or Blind) Relevance Feedback
■ (Global) indirect Relevance Feedback
□ Global Methods (independent of the queries and results)
■ Query expansion/reformulation with a thesaurus
(WordNet)
■ Query expansion via automatic thesaurus generation
■ Other techniques (spelling correction,...)

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Relevance feedback
□ Basic Procedure of RF
1. The user issues a simple query
2. The system returns an initial set of retrieval results
3. The user marks some of these documents as
relevant/irrelevant
4. The system computes a better representation of the
information need based on this feedback
5. The system displays a revised set of results

Repeat the procedure one or more times.

□ This process helps the user to focalize its own information


need as well.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Relevance feedback

Relevance Feedback: Example


□ Image search engine
http://nayana.ece.ucsb.edu/imsearch/imsearch.html

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Results for Initial Query

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Relevance Feedback step

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Results after Relevance Feedback

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Rocchio Algorithm
□ The Rocchio algorithm incorporates relevance feedback information
into the vector space model.
□ Want to maximize sim (Q, Cr) - sim (Q, Cnr)
□ The optimal query vector for separating relevant and non-relevant
documents (with cosine sim.):

r 1 r 1 r
r∑ j r∑ j
Qopt  d  d
Cr d j Cr N  Cr d j C r

□ Qopt = optimal query; Cr = set of rel. doc vectors; N = collection size

□ Unrealistic: we don’t know relevant documents.

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
The Theoretically Best Query

x x
x x
o x x
x x x x
x x
o x
o
 o x x x
o o x
x
Optimal query
x non-relevant documents
o relevant documents

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Rocchio 1971 Algorithm (SMART)

r r r
qm  qr0  
1 1
r∑ j r∑ j
d   d
Dr d j Dr Dnr d j Dnr

□ qm = modified query vector; q0 = original query vector; α,β,γ: weights

(hand-chosen or set empirically); Dr = set of known relevant doc vectors; Dnr =


set of known irrelevant doc vectors
□ New query moves toward relevant documents and away from irrelevant
documents
□ Tradeoff  vs.  and  : If we have a lot of judged documents, we want a higher 
and .
□ Term weight can go negative
■ Negative term weights are ignored
■ Alternatively, weights can be normalized in [0,1]

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Relevance feedback on initial query

Initial
query x x
x
o x
 x x
x x
x x
o x
x o
x o x
o x
x
Revised x o
query x known non-relevant documents
x o known relevant documents

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Relevance Feedback in vector spaces

□ We can modify the query based on relevance feedback and apply


standard vector space model.

□ Use only the docs that were marked.

□ Relevance feedback can improve recall and precision

□ Relevance feedback is most useful for increasing recall


in situations where recall is important

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Positive vs Negative Feedback

□ Usual choices for parameters ,  nad 


■  >>  , i.e. greater importance to the docs judged relevant than to the docs
judged irrelevant
■   0, as the docs marked irrelevant are typically near-positive. However,
many systems only allow positive feedback (=0).
■   0, in order to prevent overfitting, i.e. the excessive influence of
‘noisy’ characteristics of the docs marked (ir)relevant on the resulting
query
■ Reasonable values can be =1, =.75, =.15

□ The values of the parameters could be made dependent on the iteration, i.e.
increasing  and decreasing  and  (later queries already incorporate the
contribution of previous feedback iterations)

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Probabilistic relevance feedback

□ Rather than reweighting in a vector space…

□ If user has told us some relevant and irrelevant documents, then we can
proceed to build a classifier, such as a Naive Bayes model:
■ P(tk|R) = |Drk| / |Dr|
■ P(tk|NR) = (Nk - |Drk|) / (N - |Dr|)
□ tk = term in document; Drk = known relevant doc containing tk; Nk = total
number of docs containing tk
□ More on later lectures on probabilistic classification

■ This is effectively another way of changing the (implicit) query term


weights

■ But note: the above proposal preserves no memory of the original weights

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Relevance Feedback: Assumptions
□ A1: User has sufficient knowledge for initial query.

□ A2: Relevance prototypes are “well-behaved”.

■ Term distribution in relevant documents will be similar

■ Term distribution in non-relevant documents will be different from

those in relevant documents

□ Either: All relevant documents are tightly clustered around a single

prototype.

□ Or: There are different prototypes, but they have significant

vocabulary overlap.

□ Similarities between relevant and irrelevant documents are small

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Excite Relevance Feedback

Spink et al. 2000


□ Only about 4% of query sessions from a user used
relevance feedback option
■ Expressed as “More like this” link next to each result
□ But about 70% of users only looked at first page of
results and didn’t pursue things further
■ So 4% is about 1/8 of people extending search
□ Relevance feedback improved results about 2/3 of the time

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Pseudo Relevance Feedback

□ Automatic local analysis


□ Pseudo relevance feedback attempts to automate the manual
part of relevance feedback.

□ Retrieve an initial set of relevant documents.


□ Assume that top m ranked documents are relevant.

□ Do relevance feedback

□ Mostly works (perhaps better than global analysis!)


■ Found to improve performance in TREC ad-hoc task
■ Danger of query drift

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Indirect relevance feedback

□ On the web, DirectHit introduced a form of indirect


relevance feedback.
□ DirectHit ranked documents higher that users look at
more often.
■ Clicked on links are assumed likely to be relevant
□ Assuming the displayed summaries are good, etc.

□ Globally: Not user or query specific.


□ This is the general area of clickstream mining

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Relevance Feedback Summary

□ Relevance feedback has been shown to be very effective at improving


relevance of results.
■ Requires enough judged documents, otherwise it’s unstable (≥ 5
recommended)
■ Requires queries for which the set of relevant documents is
medium to large

□ Full relevance feedback is painful for the user.


□ Full relevance feedback is not very efficient in most IR systems.
□ Other types of interactive retrieval may improve relevance by
as much with less work.

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Overview

1 Introduction

2 Relevance Feedback Rocchio Algorithm


Relevance-Based Language Models

3 Query Expansion

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
The Basics

The user issues a (short, simple) query.


The search engine returns a set of documents.
User marks some docs as relevant (possibly
some as non-relevant).
Search engine computes a new representation of
the information need.
Hope: better than the initial query.
Search engine runs new query and returns new results.
New results have (hopefully) better recall (and possibly
also better precision).
A limited form of RF is often expressed as “more like this” or
”find similar”.

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Example

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Example

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Outline

1 Introduction

2 Relevance
Feedback Rocchio
Algorithm
Relevance-Based Language
Models
3 Query
Expansion

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Rocchio Basics

Developed in the late 60s or early 70s.


It was developed using the VSM as its basis.
Therefore, we represent documents as points
in a high-dimensional term space.
Uses centroids to calculate the center of a set of
documents.

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Rocchio Diagram

Rocchio aims to find the optimal query ˙qoptthat


maximises:
˙qopt = arg max[sim(˙q,Cr ) − sim(˙q,C nr)] (1)
˙q

where sim(˙q, Cr ) is the similarity between a query q and the set


of relevant documents Cr .

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Rocchio Diagram

Rocchio aims to find the optimal query ˙qoptthat


maximises:
˙qopt = arg max[sim(˙q,Cr ) − sim(˙q,C nr)] (1)

where sim(˙q, Cr ) is the similarity between a query q and the set


of relevant documents Cr . Using cosine similarity the optimal
query becomes:
1 Σ ˙ 1 Σ ˙
q̇opt = dj − dj (2)
|Cr | |Cnr|
ḋj∈C r ḋj∈C nr

which is the difference between the centroids of the relevant


and non-relevant document vectors.

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Rocchio Diagram

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Rocchio in practice

However, we usually do not know the full relevant


and non-relevant sets.
For example, a user might only label a few documents
as relevant.
Therefore, in practice Rocchio is often parameterised as
follows:
1 Σ ˙ 1 Σ
˙qm= α˙q0+β dj − γ ḋj (3)
|Cr | |Cnr|
ḋj∈C r ḋj∈C nr

where α, β, and γ are weights that are attached to


each component.

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Rocchio Summary

Rocchio has been shown useful for increasing recall


Contains aspects of positive and negative feedback
Positive feedback is much more valuable (i.e.
indications of what is relevant and γ < β
Reasonable values of the parameters are α = 1.0, β =
0.75,
γ = 0.15

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Outline

1 Introduction

2 Relevance
Feedback Rocchio
Algorithm
Relevance-Based Language
Models
3 Query
Expansion

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Relevance-Based Language Models I
The query-likelihood language model (earlier lecture) had no
concept of relevance (if you remember)
Relevance-Based language models take a probabilistic
language modelling approach to modelling relevance
The main assumption is that a document is generated from
either one of two classes (i.e. relevant or non-relevant)
Documents are then ranked according to their probability of
being drawn from the relevance class

P(D|R)P(R)
P(R|D) = (4)
P(D|R)P(R)+ P(D|NR)P(NR)
which is rank equivalent to ranking by log-odds
P(D|R)
= log (5)
P(D|NR)

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Relevance-Based Language Models II

Lavrenko (2001) introduced the idea of relevance-


based language models
Outlined a number of different generative models
One of the best performing models is one called RM3
(useful for both relevance and pseudo-relevance
feedback)

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Relevance-Based Language Models III

Given a set of known relevant documents R one can estimate


a relevance language model (e.g. multinomial θR)
In practice, this can be smoothed with the original query
model and a background model (not shown)
One could estimate the relevance model as:

(1 − π)θR +πθq (6)

where π controls how much of the original query one wishes


to retain.

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
When does RF work?

When users are willing to give feedback!


When the user knows the terms in the collection well
enough for an initial query.
When relevant documents contain similar terms (similar
to the cluster hypothesis)

The cluster hypothesis states that if there is a


document from a cluster that is relevant to a search
request, then it is likely that other documents from the
same cluster are also relevant. - Jardine and van
Rijsbergen

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Relevance Feedback - Evaluation

How to evaluate if RF
works?

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Relevance Feedback - Evaluation

How to evaluate if RF works?


Have two collections, with relevance judgements for the
same information needs (queries)
User studies: time taken to find # of relevant
documents (with and without feedback)

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Other types of relevance feedback

Implicit relevance feedback


Pseudo relevance feedback - when does it
work?

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel 29
Overview

1 Introduction

2 Relevance Feedback
Rocchio Algorithm
Relevance-Based Language Models

3 Query Expansion

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Query Expansion Motivation

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel 292
Query Expansion Motivation

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel 292
Query Expansion Introduction

Query expansion is another method for increasing recall


We use “global query expansion” to refer to “global
methods for query reformulation”
In global query expansion, the query is modified
based on some global resource, i.e. a resource that is
not
query-dependent
Often the problem aims to find (near-
)synonyms Distributional Semantics (word
embeddings)
What’s the different between “local” and “global”
methods?

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Query Expansion Methods

• Use of a controlled vocabulary that is maintained by human editors


(e.g. sets of keywords for publications - Medline)
• A manual thesaurus (e.g. wordnet)
• An automatically derived thesaurus
• Query reformulations based on query log mining (i.e. what the large
search engines do)

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Automatic thesaurus generation I

Let A be a term-document matrix


Where each cell Atd is a weighted count of term t
in document (or window) d
Row normalise the matrix (e.g. L2
normalisation) Then C = AAT is a term-term
similarity matrix
The similarity between any two terms u and v is in Cuv
Given any particular term q, the most similar terms can
be easily retrieved

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Automatic thesaurus generation II

Other approaches involve distributional semantics


Where words with similar meanings appear in similar
contexts Word embeddings - word2vec, glove, etc
Can be useful but global expansion still suffers from
problems of polysemy
A naive approach to word-level expansion might lead to
{apple computer} → {apple fruit computer}

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Summary

• QE is transparent in that it allows the user to see


(select) expansion terms
• Local approaches (PRF) to expanding queries tend to be
more effective
• E.g. {apple computer} → {apple computer jobs iphone
ipad macintosh}
• Local approaches tend to automatically disambiguate
the individual query terms. Why?
• Query log mining approaches have also been shown to
be useful

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Reading

Manning, Raghavan, Schu¨tze: Introduction to


Information Retrieval (MRS), chapter 9: Relevance
feedback and query expansion, chapter 16.1:
Clustering in information retrieval
Victor Lavrenko and W. Bruce Croft: Relevance-
Based Language Models

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
End of Unit-3

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel

You might also like