IR Unit-3
IR Unit-3
IR Unit-3
Mr. S. G. Shaikh
Asst. Professor
Dept. Of Computer Engg,
AIKTC, New Panvel
salim.shaikh@aiktc.ac.in
Cell. +91 9960726716
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Subject: Information Retrieval CSDC7023
Prerequisite: Data structures and algorithms
Course Objectives:
The course aims students :
1 To learn the fundamentals of Information Retrieval
2 To analyze various Information retrieval modeling techniques
3 To understand query processing and its applications
4 To explore the various indexing and scoring techniques
5 To assess the various evaluation methods
6 To analyze various information retrieval for real world application.
Course Outcomes:
Learner will be able to: -
1 Describe and Analyze the concepts, challlenges of the Information retrieval system.
2 Design the various modeling techniques for information retrieval systems.
3 Implements the query structure and various query operations
4 Analyzing the indexing and scoring operation in information retrieval systems
5 Perform the evaluation of information retrieval systems
6 Analyze various information retrieval for real world application
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Unit No-03: Query and Operations in Information Retrieval
Syllabus:-
• Query structures, Keyboard based querying, Pattern matching, Structured
queries, User relevance feedback, Automatic local analysis, Automatic global
analysis
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Introduction to Information Retrieval
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Types of Queries in IR
• During the process of indexing, many keywords are associated with document set
which contains words, phrases, date created, author names, and type of
document.
• They are used by an IR system to build an inverted index which is then consulted
during the search. The queries formulated by users are compared to the set of
index keywords.
• Most IR systems also allow the use of Boolean and other operators to build a
complex query.
• The query language with these operators enriches the expressiveness of a user’s
information need.
• The IR system finds the relevant documents from a large data set according to the
user query. Queries submitted by users to search engines might be ambiguous,
concise and their meaning may change over time.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Types of Queries in IR
1. Keyword Queries
2. Boolean Queries
3. Phrase Queries
4. Proximity Queries
5. Wildcard Queries
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
1. Keyword Queries
Keyword-based queries are the simplest and most commonly used forms of IR
queries: the user just enters keyword combinations to retrieve documents.
The query keyword terms are implicitly connected by a logical AND operator.
A query such as ‘database concepts’ retrieves documents that contain both the
words ‘data-base’ and ‘concepts’ at the top of the retrieved results. In addition,
most systems also retrieve documents that contain only ‘database’ or only
‘concepts’ in their text.
Some systems remove most commonly occurring words (such as a, the, of, and so
on, called stopwords) as a preprocessing step before sending the filtered query
key-words to the IR engine. Most IR systems do not pay attention to the ordering
of these words in the query. All retrieval models provide support for keyword
queries.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
2. Boolean Queries
Some IR systems allow using the AND, OR, NOT, ( ), + , and – Boolean operators in combinations of
keyword formulations. AND requires that both terms be found.
OR lets either term be found. NOT means any record containing the second term will be excluded.
‘( )’ means the Boolean operators can be nested using parentheses. ‘+’ is equivalent to AND,
requiring the term; the ‘+’ should be placed directly in front of the search term.
‘–’ is equivalent to AND NOT and means to exclude the term; the ‘–’ should be placed directly in
front of the search term not wanted.
Complex Boolean queries can be built out of these operators and their combinations, and they are
evaluated according to the classical rules of Boolean algebra.
No ranking is possible, because a document either satisfies such a query (is “relevant”) or does not
satisfy it (is “nonrelevant”). A document is retrieved for a Boolean query if the query is logically
true as an exact match in the document.
Users generally do not use combinations of these complex Boolean operators, and IR systems
support a restricted version of these set operators. Boolean retrieval models can directly sup-port
different Boolean operator implementations for these kinds of queries.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
3. Phrase Queries
When documents are represented using an inverted keyword index for searching, the
relative order of the terms in the document is lost.
In order to perform exact phrase retrieval, these phrases should be encoded in the
inverted index or implemented differently (with relative positions of word occurrences
in documents).
A phrase query consists of a sequence of words that makes up a phrase. The phrase is
generally enclosed within double quotes. Each retrieved document must contain at
least one instance of the exact phrase.
Phrase searching is a more restricted and specific version of proximity searching that
we mention below. For example, a phrase searching query could be ‘conceptual
database design’.
If phrases are indexed by the retrieval model, any retrieval model can be used for these
query types. A phrase thesaurus may also be used in semantic models for fast
dictionary searching for phrases.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
4. Proximity Queries
Proximity search refers to a search that accounts for how close within a record
multiple terms should be to each other. The most commonly used proximity search
option is a phrase search that requires terms to be in the exact order. Other proximity
operators can specify how close terms should be to each other. Some will also specify
the order of the search terms. Each search engine can define proximity operators
differently, and the search engines use various operator names such as NEAR,
ADJ(adjacent), or AFTER. In some cases, a sequence of single words is given, together
with a maximum allowed distance between them. Vector space models that also
maintain information about positions and offsets of tokens (words) have robust
implementations for this query type. However, providing support for complex proximity
operators becomes computationally expensive because it requires the time-consuming
preprocessing of documents, and is thus suitable for smaller document collections
rather than for the Web.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
5. Wildcard Queries
Retrieval models do not directly provide support for this query type.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
6. Natural Language Queries
There are a few natural language search engines that aim to understand the structure
and meaning of queries written in natural language text, generally as a question or
narrative. This is an active area of research that employs techniques like shallow semantic
parsing of text, or query reformulations based on natural language under-standing. The
system tries to formulate answers for such queries from retrieved results. Some search
systems are starting to provide natural language interfaces to provide answers to specific
types of questions, such as definition and factoid questions, which ask for definitions of
technical terms or common facts that can be retrieved from specialized databases. Such
questions are usually easier to answer because there are strong linguistic patterns giving
clues to specific types of sentences—for example, ‘defined as’ or ‘refers to’. Semantic
models can provide support for this query type.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Pattern matching in Python with Regex
You may be familiar with searching for text by pressing ctrl-F and
typing in the words you’re looking for. Regular expressions go one
step further: They allow you to specify a pattern of text to search
for.
Regular expressions, called regexes for short, are descriptions for a
pattern of text. For example, a \d in a regex stands for a digit
character — that is, any single numeral 0 to 9.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Pattern matching in Python with Regex
following regex is used in Python to match a string of three numbers, a hyphen, three more
numbers, another hyphen, and four numbers.
\d\d\d-\d\d\d-\d\d\d\d
Regular expressions can be much more sophisticated. For example, adding a 3 in curly
brackets ({3}) after a pattern is like saying, “ Match this pattern three times.” So the slightly
shorter regex
\d{3}-\d{3}-\d{4}
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
RegEx Functions
Function Description
Returns a list containing all
findall
matches
Returns a Match object if
search there is a match anywhere
in the string
Returns a list where the
split string has been split at each
match
Replaces one or many
sub
matches with a string
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Metacharacters
Character Description Example
[] A set of characters "[a-m]"
\ Signals a special sequence (can also be used to "\d"
escape special characters)
| Either or "falls|stays"
() Capture and group
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Special Sequences
Character Description Example
\A Returns a match if the specified characters are at the beginning of the "\AThe"
string
\b Returns a match where the specified characters are at the beginning or at r"\bain"
the end of a word r"ain\b"
(the "r" in the beginning is making sure that the string is being treated as a
"raw string")
\B Returns a match where the specified characters are present, but NOT at the r"\Bain"
beginning (or at the end) of a word r"ain\B"
(the "r" in the beginning is making sure that the string is being treated as a
"raw string")
\d Returns a match where the string contains digits (numbers from 0-9) "\d"
\D Returns a match where the string DOES NOT contain digits "\D"
\s Returns a match where the string contains a white space character "\s"
\S Returns a match where the string DOES NOT contain a white space "\S"
character
\w Returns a match where the string contains any word characters (characters "\w"
from a to Z, digits from 0-9, and the underscore _ character)
\W Returns a match where the string DOES NOT contain any word characters "\W"
\Z Returns a match if the specified characters are at the end of the string "Spain\Z"
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Sets
Set Description
[arn] Returns a match where one of the specified characters (a, r, or n)
is present
[a-n] Returns a match for any lower case character, alphabetically
between a and n
[^arn] Returns a match for any character EXCEPT a, r, and n
[0123] Returns a match where any of the specified digits (0, 1, 2, or 3)
are present
[0-9] Returns a match for any digit between 0 and 9
[0-5][0-9] Returns a match for any two-digit numbers from 00 and 59
[a-zA-Z] Returns a match for any character alphabetically between a and z,
lower case OR upper case
[+] In sets, +, *, ., |, (), $,{} has no special meaning, so [+] means:
return a match for any + character in the string
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Motivation User Relevance Feedback
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Example
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Improving Recall
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Relevance feedback and query expansion
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Relevance feedback
□ Basic Procedure of RF
1. The user issues a simple query
2. The system returns an initial set of retrieval results
3. The user marks some of these documents as
relevant/irrelevant
4. The system computes a better representation of the
information need based on this feedback
5. The system displays a revised set of results
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Results for Initial Query
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Relevance Feedback step
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Results after Relevance Feedback
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Rocchio Algorithm
□ The Rocchio algorithm incorporates relevance feedback information
into the vector space model.
□ Want to maximize sim (Q, Cr) - sim (Q, Cnr)
□ The optimal query vector for separating relevant and non-relevant
documents (with cosine sim.):
r 1 r 1 r
r∑ j r∑ j
Qopt d d
Cr d j Cr N Cr d j C r
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
The Theoretically Best Query
x x
x x
o x x
x x x x
x x
o x
o
o x x x
o o x
x
Optimal query
x non-relevant documents
o relevant documents
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Rocchio 1971 Algorithm (SMART)
r r r
qm qr0
1 1
r∑ j r∑ j
d d
Dr d j Dr Dnr d j Dnr
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Relevance feedback on initial query
Initial
query x x
x
o x
x x
x x
x x
o x
x o
x o x
o x
x
Revised x o
query x known non-relevant documents
x o known relevant documents
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Relevance Feedback in vector spaces
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Positive vs Negative Feedback
□ The values of the parameters could be made dependent on the iteration, i.e.
increasing and decreasing and (later queries already incorporate the
contribution of previous feedback iterations)
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Probabilistic relevance feedback
□ If user has told us some relevant and irrelevant documents, then we can
proceed to build a classifier, such as a Naive Bayes model:
■ P(tk|R) = |Drk| / |Dr|
■ P(tk|NR) = (Nk - |Drk|) / (N - |Dr|)
□ tk = term in document; Drk = known relevant doc containing tk; Nk = total
number of docs containing tk
□ More on later lectures on probabilistic classification
■ But note: the above proposal preserves no memory of the original weights
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Relevance Feedback: Assumptions
□ A1: User has sufficient knowledge for initial query.
prototype.
vocabulary overlap.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Excite Relevance Feedback
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Pseudo Relevance Feedback
□ Do relevance feedback
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Indirect relevance feedback
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Relevance Feedback Summary
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Overview
1 Introduction
3 Query Expansion
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
The Basics
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Example
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Example
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Outline
1 Introduction
2 Relevance
Feedback Rocchio
Algorithm
Relevance-Based Language
Models
3 Query
Expansion
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Rocchio Basics
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Rocchio Diagram
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Rocchio Diagram
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Rocchio Diagram
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Rocchio in practice
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Rocchio Summary
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Outline
1 Introduction
2 Relevance
Feedback Rocchio
Algorithm
Relevance-Based Language
Models
3 Query
Expansion
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Relevance-Based Language Models I
The query-likelihood language model (earlier lecture) had no
concept of relevance (if you remember)
Relevance-Based language models take a probabilistic
language modelling approach to modelling relevance
The main assumption is that a document is generated from
either one of two classes (i.e. relevant or non-relevant)
Documents are then ranked according to their probability of
being drawn from the relevance class
P(D|R)P(R)
P(R|D) = (4)
P(D|R)P(R)+ P(D|NR)P(NR)
which is rank equivalent to ranking by log-odds
P(D|R)
= log (5)
P(D|NR)
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Relevance-Based Language Models II
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Relevance-Based Language Models III
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
When does RF work?
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Relevance Feedback - Evaluation
How to evaluate if RF
works?
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Relevance Feedback - Evaluation
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Other types of relevance feedback
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel 29
Overview
1 Introduction
2 Relevance Feedback
Rocchio Algorithm
Relevance-Based Language Models
3 Query Expansion
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Query Expansion Motivation
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel 292
Query Expansion Motivation
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel 292
Query Expansion Introduction
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Query Expansion Methods
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Automatic thesaurus generation I
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Automatic thesaurus generation II
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Summary
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Reading
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
End of Unit-3
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel