Information Retrieval
Unit- III Notes
Indexing: Static and Dynamic inverted indices- Index construction and Index
compression- Searching- Sequential search and pattern matching- Query
operations- Query languages- Query Processing-Relevance feedback and Query
Expansion-Automatic local and global analysis.
Static and Dynamic Inverted Indices
An Inverted Index is a data structure used in information retrieval systems to
efficiently retrieve documents or web pages containing a specific term or set of
terms.
In an inverted index, the index is organized by terms (words), and each term
points to a list of documents or web pages that contain that term.
There are two primary types of inverted indices: static and dynamic.
1. Static Inverted Index
A static inverted index is constructed once and is not modified after it has been
created. It represents a snapshot of the document collection at the time of index
creation. This means that if new documents are added or existing ones are
deleted, the index must be rebuilt to reflect these changes.
Features:
• Build time: The static inverted index is built in one pass through the
collection. Once built, it's used for querying.
• No updates: After its creation, it doesn’t change. If documents change or
new documents are added, the index must be rebuilt.
• Efficiency: Searching in a static inverted index is very fast since it’s already
pre-built and optimized for querying.
• Usage: Often used in environments where the document collection doesn’t
change frequently or when you can afford to periodically rebuild the index
(e.g., a collection that doesn't change in real time).
Example:
• A search engine that crawls and indexes a website periodically, and the
index is rebuilt every night.
2. Dynamic Inverted Index
• A dynamic inverted index allows for real-time updates. It can
accommodate changes in the document collection, such as additions,
deletions, and updates, without needing to rebuild the entire index from
scratch.
Features:
• Updates allowed: New documents can be added, and existing documents
can be deleted or modified. The index is updated incrementally to reflect
these changes.
• Complexity: It’s more complex to implement and manage because it must
efficiently handle updates while maintaining fast search performance.
• Efficient updates: To keep the index up-to-date, dynamic inverted indices
often use techniques like merging and partitioning to ensure that updates
don’t degrade search performance.
• Usage: Commonly used in systems where the document collection is
constantly changing, such as news websites, social media platforms, and
real-time search engines.
Example:
• A real-time search system that indexes new documents as they are
uploaded and deletes outdated ones automatically.
Index Construction and Compression
In Information Retrieval (IR), an index is a data structure that allows efficient
retrieval of documents based on queries. The process of index construction
involves creating this data structure, while index compression aims to reduce the
storage space and improve efficiency.
1. Index Construction
Index construction involves building an inverted index, which is the core data
structure used in search engines.
Steps in Index Construction
1. Document Parsing
1. Convert documents into a sequence of tokens (words, numbers,
etc.).
2. Normalize text (lowercasing, stemming(The process of reducing a
word to its root form), stopword removal(Words that are commonly
removed from text), etc.).
2. Tokenization
1. Split text into terms and assign unique identifiers.
3. Building Posting Lists(data structure that stores a list of documents that
contain a specific term)
1. Create a dictionary of terms and associate each with a list of
document IDs where they appear.
Store additional information like term frequency (TF), position, etc.
4. Sorting and Merging
• If the index is built in multiple passes, intermediate structures are merged.
5. Storing the Index
The inverted index is stored on disk with efficient access methods.
Challenges in Index Construction
• Scalability: Large-scale document collections require efficient indexing
techniques.
• Memory Constraints: Full in-memory indexing may not be feasible for large
datasets.
• Dynamic Updates: Handling document insertions and deletions efficiently.
2. Index Compression
Index compression reduces storage space and speeds up retrieval by minimizing
disk I/O. Compression techniques are applied to both the dictionary (list of
unique terms) and postings lists (list of document occurrences).
Dictionary Compression
• Blocking: Store dictionary terms in fixed-size blocks to reduce lookups.
• Front Coding: Store common prefixes only once.
Tries and Ternary Search Trees: Space-efficient data structures.
Postings List Compression
1. Gap Encoding
1. Instead of storing absolute document IDs, it stores differences (gaps).
2. Example: Instead of [2, 5, 10], store [2, 3, 5].
3. This reduces redundancy and allows better compression
2. Variable-Length Encoding
1. γ (Gamma) Encoding: Uses a binary prefix and offset.
2. δ (Delta) Encoding: Uses logarithmic prefix for better efficiency.
3. Bit-Level Compression
1. Golomb Coding: Used for compressing sparse postings lists.
2. Elias Codes: Efficient encoding for integers in postings lists.
Advantages of Compression
• Reduces storage: Saves disk space.
• Faster retrieval: Less data to read from disk.
• Improves caching: Smaller indexes fit better in memory.
Sequential search and pattern matching
1. Sequential Search
Definition:
Sequential search (also called linear search) is a fundamental search technique
where each element of a dataset is examined one by one until the desired element
(or pattern) is found.
Characteristics:
• Works for both ordered and unordered data.
• Simple and easy to implement.
• Inefficient for large datasets due to its linear nature.
Algorithm:
1. Start from the first element.
2. Compare the current element with the search query (pattern or keyword).
3. If a match is found, return the position of the element.
4. If no match is found, continue checking until the end.
5. If the search reaches the end without finding a match, return "Not
Found."
Time Complexity:
• Best Case: O(1) (match found in the first position).
• Worst Case: O(n) (match found at the last position or not found).
Applications in Information Retrieval:
• Searching for a specific keyword in an unstructured document.
• Scanning through a list of filenames for a match.
• Finding records in an unsorted database.
• Spell-checking systems, where words are checked sequentially.
2. Pattern Matching in Information Retrieval
Definition:
Pattern matching is the process of locating occurrences of a specific sequence of
characters (pattern) within a larger body of text (document, database, or
corpus).
Types of Pattern Matching:
1. Exact Pattern Matching – Finding an exact occurrence of the pattern.
2. Approximate Pattern Matching – Finding patterns even with minor
differences (e.g., typos or variations).
Algorithms for Pattern Matching:
1. Naïve String Matching Algorithm
• Method: Checks the pattern at every possible position in the text.
• Time Complexity: O(mn) (where m = pattern length, n = text length).
• Use Case: Small text searches.
2. Knuth-Morris-Pratt (KMP) Algorithm
• Method: Uses a precomputed prefix table to avoid redundant comparisons.
• Time Complexity: O(m+n).
• Use Case: Searching within large documents efficiently.
3. Boyer-Moore Algorithm
• Method: Compares from right to left and skips characters using heuristics
(Bad Character Rule & Good Suffix Rule).
• Time Complexity: O(n/m) (average case, where m is the pattern length).
• Use Case: Large-scale text searching (e.g., in databases).
Rabin-Karp Algorithm
• Method: Uses a hashing technique to check if the pattern exists.
• Time Complexity: O(n) (average case).
• Use Case: Searching multiple patterns in large text bodies.
Pattern matching is crucial in information retrieval (IR) for various applications:
• Search Engines (Google, Bing) – Locating relevant documents based on
query terms.
• Plagiarism Detection – Matching text against a large corpus.
• Spam Filtering – Identifying spam emails based on predefined patterns.
• DNA Sequence Matching – Bioinformatics applications.
• Log File Analysis – Finding error patterns in system logs.
• Natural Language Processing (NLP) – Identifying keywords, entities, and
phrases in text.
Conclusion
• Sequential search is simple but inefficient for large datasets.
• Pattern matching algorithms (like KMP, Boyer-Moore, and Rabin-Karp)
optimize text searching in information retrieval systems.
• These techniques are widely used in search engines, text mining, and data
analytics.
Query Operations
• Introduction
• In information retrieval, "query operations" refers to the actions
performed on a user's search query by an information retrieval system to
identify relevant documents from a collection.
• 1. Challenge in Query Formulation
• Users often struggle to create effective queries due to limited knowledge of
the dataset or retrieval system.
• Web search engines show that users frequently refine their queries
multiple times for better results.
• The initial query should be seen as a starting point, with later refinements
improving retrieval efficiency.
• 2. Query Reformulation Techniques
• Query reformulation involves two main strategies:
• Query Expansion – Adding new terms to the original query.
• Term Reweighting – Adjusting the importance of existing terms in the
query.
• 3. Three Categories of Query Expansion & Term Reweighting
• Feedback-Based Approaches
• Uses information from the user's interaction with retrieved documents.
• Includes user relevance feedback for refining queries.
• Common in vector space models and probabilistic models.
• 4. Local Analysis (Local Context Analysis)
• Relies on terms and patterns found in the initially retrieved set (called the
local set of documents).
• Two techniques for local analysis are discussed.
• Global Analysis
• Uses term relationships across the entire document collection, not just the
retrieved set.
• Two global analysis methods are covered.
1. Boolean Query Operations
Uses logical operators to combine search terms:
• AND → Retrieves documents containing all specified terms (e.g., data
AND science).
• OR → Retrieves documents containing at least one of the terms (e.g., AI
OR machine learning).
• NOT → Excludes documents containing a specific term (e.g., big data NOT
Hadoop).
2. Proximity Query Operations
Searches based on the closeness of words:
• Phrase Search → Finds exact phrases using quotation marks (e.g., "data
mining techniques").
• NEAR Operator → Finds words appearing close to each other (e.g., deep
NEAR learning).
• WITHIN Operator → Specifies the distance between words (e.g., data W/5
analytics means "data" must appear within five words of "analytics").
3. Wildcard and Truncation Operations
Used to match variations of words:
• Wildcard (*, ?) → Replaces characters in a word.
• comput* retrieves "computer," "computing," "computation," etc.
• wom?n retrieves "woman" and "women."
• Truncation (*) → Retrieves words with the same root (e.g., learn* for
"learn," "learning," "learner").
4. Fuzzy and Approximate Matching
• Fuzzy Search (~) → Finds words similar to the query (e.g., color~ retrieves
"colour").
• Soundex / Phonetic Matching → Retrieves words with similar
pronunciation (e.g., "Smith" and "Smyth").
5. Field-Based Search
Searches specific parts of documents:
• title:("machine learning") → Searches only in the title field.
• author:(John Smith) → Retrieves documents by a specific author.
6. Ranking and Relevance-Based Operations
TF-IDF (Term Frequency-Inverse Document Frequency) → Ranks documents
based on word importance.
BM25 (Best Matching 25) → An advanced ranking function improving on TF-
IDF.
Vector Space Model & Cosine Similarity → Measures similarity between query
and documents.
Neural Retrieval (BERT, Transformers) → Uses deep learning models for
ranking results.
7. Query Expansion and Refinement
• Synonym Expansion → Adds synonyms to improve search (car → vehicle,
automobile).
• Relevance Feedback → Adjusts results based on user feedback.
• Query Rewriting → Reformulates the query for better results.
8. Faceted and Filtered Search
Allows narrowing down results based on categories:
• Faceted Search → Uses structured filters like date, category, or price.
• Filtered Search → Excludes or includes specific attributes (year:2023,
category:technology)
Query Languages
Query languages are used in Information Retrieval (IR) to structure search
queries, retrieve relevant documents, and refine results. They range from simple
keyword-based languages to advanced structured query languages.
1. Types of Queries in Information Retrieval
Different retrieval models handle queries differently.
Full-text systems vs. keyword-based ranking systems (e.g., search engines) vs.
hypertext models.
The way a user formulates a query depends on the retrieval model in use.
2. Query Languages vs. Data Retrieval Languages
Query languages help users retrieve relevant information.
Data retrieval languages are designed for structured data retrieval (e.g., SQL).
Keyword-based retrieval is the main approach in basic retrieval models.
3. Query Languages for Data Retrieval
Some languages focus on data retrieval, not necessarily information retrieval.
These languages cannot easily adapt to ranking-based retrieval.
Many query languages are designed for higher-level systems (e.g., online
databases, CD-ROM archives).
4. Protocols vs. Query Languages
In some systems, protocols define how queries are structured instead of formal
languages.
The choice of query language depends on the user experience:
If a user knows exactly what they need → retrieval is easier.
If not → ranking and refinement techniques become essential.
5. Challenges in Query Languages
Query languages focus on text structure rather than semantic meaning.
This limitation affects how well the system finds relevant documents.
Advanced techniques are needed to enhance retrieval efficiency.
6. Improving Query Effectiveness
Query Expansion → Adding related terms.
Synonyms & Thesaurus Use → Finding conceptually similar words.
Stemming → Reducing words to their root forms.
7. Stopwords and Keyword Processing
Some words appear frequently but do not carry significant meaning (e.g., "the,"
"is," "and").
These words are called stopwords and are usually removed during query
processing.
Stopword removal is covered in Chapter 7 (assumed preprocessing is already
done).
While stopwords are often removed in information retrieval, they may still be
relevant in data retrieval.
"Keywords" refer to words that can be retrieved by queries, excluding
stopwords.
8. Retrieval Unit in Information Systems
Retrieval unit = The fundamental element retrieved as a result of a query.
Can include:
Files
Documents
Web pages
Paragraphs
Other structural units
Retrieved units may be ranked by relevance or another criterion.
For simplicity, retrieval units are referred to as "documents."
9. Chapter Organization
Keyword-Based Query Languages
Covers simple keywords, phrases, and Boolean operators.
Manipulates retrieved documents for relevance.
Pattern Matching in Queries
Includes complex search techniques beyond simple keywords.
Enhances retrieval capabilities.
Querying Based on Text Structure
Focuses on how documents are structured.
More relevant to protocol-based retrieval.
Standard Retrieval Protocols
Discussion on protocols used in:
Internet-based retrieval systems
CD-ROM publishers
Query processing
Query processing is a core component of any information retrieval (IR) system.
It refers to the series of steps that transform a user’s natural language query into
a structured form that can be effectively matched against an index of documents.
This process not only enhances retrieval accuracy but also optimizes system
performance by reducing search time and computational cost.
Challenges in Query Processing
Despite the many advances, several challenges remain:
• Ambiguity and Vagueness(lack of focus): Natural language queries are
basically ambiguous. Disambiguating user intent requires sophisticated natural
language processing (NLP) techniques.
• Vocabulary Mismatch: Users and documents often use different terms to
describe the same concept. Query expansion and semantic search techniques
attempt to bridge this gap.
• Scalability: As data volumes grow, efficiently processing and matching queries
in real time is a significant challenge.
• Latency: Balancing the thoroughness of query processing (e.g., semantic
analysis, intent recognition) with the need for rapid responses remains a critical
area of research.
• Query processing serves as the interface between the user’s information need
and the underlying IR system.
• The goal is to accurately interpret the query, transform it into a searchable
format, and then use retrieval models to find and rank the most relevant
documents.
Key Steps in the Query Processing
2.1. Query Preprocessing
Before a query can be matched against an index, it must be preprocessed. Common
preprocessing steps include:
• Tokenization: Breaking the input query into individual tokens (words or
terms).
• Normalization: Converting text to a standard format (e.g., lowercasing,
removing punctuation).
• Stop-Word Removal: Eliminating common words (such as “the,” “is,” etc.)
that do not contribute significant meaning.
• Stemming and Lemmatization: Reducing words to their base or root form.
• Part-of-Speech Tagging: Identifying the grammatical roles of words, which
can assist in understanding query intent.
• These steps ensure that the query is stripped of irrelevant information and is
standardized for effective matching against the document index.
• Query Expansion
• Due to vocabulary mismatches between the query and document language,
query expansion is often used. This involves adding related terms or synonyms
to the original query to capture the user's intent. Techniques may include:
• Thesaurus-based Expansion: Using external resources (e.g., WordNet) to find
synonyms.
• Statistical Methods: Analyzing term co-occurrence in the document corpus.
• Latent Semantic Analysis: Discovering hidden relationships between terms.
• These techniques bridging the “intention gap” between the user's query and the
indexed content.
Matching and Ranking
Once preprocessed and potentially expanded, the query is matched against the
document index. Different IR models can be employed for this purpose:
• Boolean Model: Uses logical operators (AND, OR, NOT) to filter documents.
• Vector Space Model: Represents documents and queries as vectors in a
multidimensional space and uses similarity measures (e.g., cosine similarity) to
rank documents.
• Probabilistic Models: Estimate the likelihood that a document is relevant to
the query.
• Neural and Semantic Models: Leverage deep learning techniques to capture
semantic relationships beyond exact term matching.
• After matching, a ranking algorithm assigns relevance scores to the retrieved
documents. The final output is a ranked list where the highest scoring
documents are presented first.
• Advanced Techniques in Query Processing
• 3.1. User Intent Recognition
• Recent research has focused on understanding the underlying intent of a query
—whether it is informational, navigational, or transactional.
• By analyzing linguistic signals and contextual signals, systems can tailor the
processing strategy and retrieval results to better match the user's goals.
• For instance, frameworks have been developed that incorporate contextual
fuzzy linguistic inference systems (CFLIS) to assess query intent before
executing the search
• Semantic Cache Optimization
• To reduce response time and improve efficiency, some systems maintain a
semantic cache of previously processed queries and their results.
• When a new query is submitted, the system first checks for semantically similar
queries in the cache, which can dramatically lower latency.
• This approach is particularly useful in high-volume or cloud-based search
systems.
• Selective Query Processing
• Selective query processing adapts system parameters on a per-query basis.
• Rather than applying a one-size-fits-all configuration, modern IR systems can
select the most effective configuration for each query, balancing risk and
reward to maximize effectiveness.
• Studies have shown that such risk-sensitive methods can significantly
outperform static configurations
Neural Methods
Neural network-based approaches have recently been introduced to further enhance
query processing.
These methods address vocabulary mismatches and capture deeper semantic
relationships between queries and documents.
Neural architectures, such as those employing multi-head attention and recurrent
units, can dynamically adapt to the specific nuances of each query, improving both
precision and recall
Relevance Feedback and Query Expansion
In the field of Information Retrieval (IR), relevance feedback and query expansion are key
techniques used to improve search results by refining queries and retrieving more relevant
documents.
1. Relevance Feedback (RF)
Relevance feedback is an interactive process in which users provide feedback on the relevance of
retrieved documents, and the system uses this feedback to refine the search query and improve
subsequent retrieval results.
Types of Relevance Feedback
1. Explicit Relevance Feedback:
o Users manually mark retrieved documents as relevant or irrelevant.
o The system modifies the query based on the marked documents.
o Example: Google’s "thumbs up/down" for recommendations.
2. Implicit Relevance Feedback:
o The system infers relevance based on user behavior, such as:
Click-through rate (CTR)
Time spent on a page
Scrolling and interactions
o Example: Search engines adjusting results based on user click patterns.
3. Pseudo (Blind) Relevance Feedback:
o The system assumes that the top-ranked documents from the initial search are
relevant and modifies the query automatically.
o No user input is needed, reducing user effort.
o Example: Rocchio algorithm-based relevance feedback.
Common Approaches to Relevance Feedback
Rocchio Algorithm:
Adjusts the query vector based on relevant and non-relevant document vectors.
Probabilistic Models:
Bayesian models update query terms based on probability estimates.
Neural Networks & Deep Learning:
Machine learning models refine queries based on historical user feedback.
2. Query Expansion (QE)
Query expansion is the process of improving the original query by adding additional terms to
make it more effective.
Types of Query Expansion
1. Manual Query Expansion:
o Users refine their queries manually by adding or modifying terms.
o Example: Searching for “Apple” → Expanding to “Apple Inc. technology
company.”
2. Automatic Query Expansion (AQE):
o The system automatically adds relevant terms based on linguistic, statistical, or
semantic methods.
o Types of AQE:
Thesaurus-Based Expansion:
Uses synonyms, hypernyms, hyponyms.
Example: "car" → "automobile."
Relevance Feedback-Based Expansion:
Extracts terms from relevant documents.
Word Embeddings (e.g., Word2Vec, BERT):
Expands queries using context-aware embeddings.
Query Log Analysis:
Uses historical search data to suggest expansions.
3. Local and Global Analysis:
o Local Analysis: Extracts terms from the top retrieved documents.
o Global Analysis: Uses external knowledge sources like Wikipedia, WordNet, or
query logs.
3. Relationship between RF and QE
Relevance feedback often leads to query expansion as the system identifies better terms
from relevant documents.
Query expansion can improve relevance feedback by refining initial queries before
feedback is applied.
Practical Applications
Search Engines (Google, Bing): Improve results based on user interaction.
Recommendation Systems: Netflix, YouTube use implicit relevance feedback.
Medical & Legal IR Systems: Refining searches for precision in specialized domains.
Conclusion
Relevance feedback and query expansion are crucial in enhancing search accuracy and user
experience in IR systems. By integrating these techniques, modern search engines and AI-based
retrieval systems can deliver more precise and meaningful results.