[go: up one dir, main page]

0% found this document useful (0 votes)
74 views11 pages

Inverted Index-Unit-3

Uploaded by

kruti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views11 pages

Inverted Index-Unit-3

Uploaded by

kruti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Introduction

In this document, we walk through the creation of a search engine system using an
inverted index as the backbone. We start by explaining what an inverted index is,
provide a basic Python implementation, and then progressively add advanced
features like boolean queries, phrase, fuzzy, and wildcard searches. We also
enhance our search relevance with TF-IDF and BM25 ranking. Finally, we integrate
the search engine into a Flask API, cover scaling strategies using SQLite and
multiprocessing, and discuss production deployment considerations.

Understanding the Inverted Index

Definition & Purpose

An inverted index is a data structure that maps words (or terms) to the documents
in which they appear. Instead of listing the words for each document, the inverted
index "inverts" the relationship by mapping each unique word to a list (or posting list)
of documents. This is the fundamental technology behind search engines, enabling
efficient full-text searches.

Basic Structure & Example

Components:

 Unique Words (Terms): A list of all words in the document collection.


 Posting List: For each word, a list of document IDs (and optionally word
positions) where the word appears.

Example:

Consider three documents:

 Doc1: "The cat sat on the mat."


 Doc2: "The dog sat on the log."
 Doc3: "The cat and the dog played together."

Extracted Words:
the, cat, sat, on, mat, dog, log, and, played, together

Inverted Index Table:

Word Posting List


the [Doc1, Doc2, Doc3]
cat [Doc1, Doc3]
sat [Doc1, Doc2]
on [Doc1, Doc2]
mat [Doc1]
Word Posting List
dog [Doc2, Doc3]
log [Doc2]
and [Doc3]
played [Doc3]
together [Doc3]

With this structure, a search query such as "cat" immediately returns Doc1 and
Doc3.

Basic Inverted Index Implementation in Python

Code Overview

Below is a simple Python implementation using a dictionary to map words to


document IDs.

from collections import defaultdict

class InvertedIndex:
def __init__(self):
self.index = defaultdict(set) # Map words to a set of document IDs

def add_document(self, doc_id, text):


"""Tokenize text and update the index."""
words = text.lower().split() # Basic tokenization
for word in words:
self.index[word].add(doc_id)

def search(self, query):


"""Return documents that contain the queried word."""
return self.index.get(query.lower(), set())

def display_index(self):
"""Prints the inverted index."""
for word, doc_ids in self.index.items():
print(f"{word}: {sorted(doc_ids)}")

# Example documents
documents = {
1: "The cat sat on the mat",
2: "The dog sat on the log",
3: "The cat and the dog played together"
}
# Build and display the index
index = InvertedIndex()
for doc_id, text in documents.items():
index.add_document(doc_id, text)

print("Inverted Index:")
index.display_index()
print("\nDocuments containing 'cat':", index.search("cat"))

Usage & Output

 Index Construction: Each document is tokenized and added to the


dictionary.
 Search: A query like "cat" returns {1, 3}.
 Output Example:

Inverted Index:
the: [1, 2, 3]
cat: [1, 3]
sat: [1, 2]
on: [1, 2]
mat: [1]
dog: [2, 3]
log: [2]
and: [3]
played: [3]
together: [3]

Documents containing 'cat': {1, 3}

Advanced Features of the Inverted Index

To handle more complex queries, we enhanced the basic index with additional
search functionalities:

Boolean Search

 Supports: Queries with AND and OR.


 Example:
"dog and cat" returns only documents containing both words.

Phrase Search

 Goal: Retrieve documents where a sequence of words appears in order.


 Method: Track positions of words in documents and check if positions align
for consecutive words.

Fuzzy Search
 Purpose: Correct for typos or similar words (e.g., "dgo" should match "dog").
 Implementation: Uses Python’s difflib.get_close_matches for fuzzy matching.

Wildcard Search

 Capability: Supports * (any sequence of characters) and ? (single character)


in search queries.
 Implementation: Converts wildcard patterns into regular expressions and
finds matching words.

A combined advanced implementation incorporates all these functionalities for robust


search capability.

Enhancing Relevance with Ranking Methods

TF-IDF Ranking

 TF (Term Frequency): Measures how often a term appears in a document.


 IDF (Inverse Document Frequency): Gives higher weight to rarer words.
 Use: Compute scores for documents matching a query to rank them by
relevance.

Code Highlight:

import math

def tf_idf_search(self, query):


words = query.lower().split()
scores = defaultdict(float)
for word in words:
if word in self.index:
idf = math.log((self.num_docs + 1) / (self.df[word] + 1)) + 1
for doc_id in self.index[word]:
tf = self.tf[word][doc_id]
scores[doc_id] += tf * idf
return sorted(scores.items(), key=lambda x: x[1], reverse=True)

BM25 Ranking

BM25 is a state-of-the-art ranking function that improves upon TF-IDF by accounting


for document length and other factors.

 Components:
o k1 and b: Tuning parameters.
o IDF and normalization: Similar to TF-IDF but adjusted for document
length.
 Outcome: More accurate ranking of search results.
Scaling the Inverted Index for Big Data

For a production system handling millions of documents, in-memory storage may not
be sufficient. The following strategies are used:

Disk-Based Storage with SQLite

 Advantage:
o Persistence of the index on disk.
o Efficient SQL queries and indexing.
 Implementation:
o Create tables for documents, index entries, and document statistics.

Parallel Processing for Indexing

 Objective:
o Speed up document processing by utilizing multiple CPU cores.
 Implementation:
o Use Python's multiprocessing.Pool to process documents concurrently.

An advanced implementation combines SQLite for storage and multiprocessing for


faster indexing.

Building a Flask API for the Search Engine

A RESTful API built using Flask provides endpoints to interact with the search
engine. This integration enables adding documents, performing different types of
searches, and returning ranked results.

API Endpoints Overview

 /add_document:
o Method: POST
o Purpose: Add a new document to the index.
 /search:
o Method: GET
o Purpose: Perform a simple word search.
 /boolean_search:
o Method: GET
o Purpose: Handle boolean queries.
 /bm25_search:
o Method: GET
o Purpose: Return BM25-ranked search results.
 /fuzzy_search & /wildcard_search:
o Methods: GET
o Purpose: Support advanced query types.

Flask API Code Example

from flask import Flask, request, jsonify


import sqlite3
import re
import math
from collections import defaultdict
from difflib import get_close_matches

app = Flask(__name__)
DB_NAME = "inverted_index.db"

class InvertedIndex:
def __init__(self, db_name=DB_NAME):
self.db_name = db_name
self.num_docs = 0
self._initialize_db()
self._update_num_docs()

def _initialize_db(self):
with sqlite3.connect(self.db_name) as conn:
cursor = conn.cursor()
cursor.execute("CREATE TABLE IF NOT EXISTS documents (doc_id
INTEGER PRIMARY KEY, text TEXT)")
cursor.execute("CREATE TABLE IF NOT EXISTS index_table (word TEXT,
doc_id INTEGER, tf INTEGER)")
cursor.execute("CREATE TABLE IF NOT EXISTS doc_stats (doc_id
INTEGER PRIMARY KEY, length INTEGER)")
conn.commit()

def _update_num_docs(self):
with sqlite3.connect(self.db_name) as conn:
cursor = conn.cursor()
cursor.execute("SELECT COUNT(*) FROM documents")
self.num_docs = cursor.fetchone()[0]

def add_document(self, doc_id, text):


words = re.findall(r'\w+', text.lower())
term_freq = defaultdict(int)
for word in words:
term_freq[word] += 1

with sqlite3.connect(self.db_name) as conn:


cursor = conn.cursor()
cursor.execute("INSERT OR REPLACE INTO documents (doc_id, text)
VALUES (?, ?)", (doc_id, text))
cursor.execute("INSERT OR REPLACE INTO doc_stats (doc_id, length)
VALUES (?, ?)", (doc_id, len(words)))
for word, freq in term_freq.items():
cursor.execute("INSERT INTO index_table (word, doc_id, tf) VALUES
(?, ?, ?)", (word, doc_id, freq))
conn.commit()
self.num_docs += 1

def search(self, word):


with sqlite3.connect(self.db_name) as conn:
cursor = conn.cursor()
cursor.execute("SELECT DISTINCT doc_id FROM index_table WHERE
word = ?", (word.lower(),))
return {row[0] for row in cursor.fetchall()}

def boolean_search(self, query):


words = query.lower().split()
if "and" in words:
terms = [word for word in words if word != "and"]
result = set.intersection(*(self.search(term) for term in terms))
elif "or" in words:
terms = [word for word in words if word != "or"]
result = set.union(*(self.search(term) for term in terms))
else:
result = self.search(query)
return result

def bm25_search(self, query, k1=1.5, b=0.75):


words = query.lower().split()
scores = defaultdict(float)
avg_doc_length = self._get_avg_doc_length()
with sqlite3.connect(self.db_name) as conn:
cursor = conn.cursor()
for word in words:
cursor.execute("SELECT COUNT(DISTINCT doc_id) FROM index_table
WHERE word = ?", (word,))
df = cursor.fetchone()[0] or 0
idf = math.log((self.num_docs - df + 0.5) / (df + 0.5) + 1)
cursor.execute("SELECT doc_id, tf FROM index_table WHERE word = ?",
(word,))
for doc_id, tf in cursor.fetchall():
cursor.execute("SELECT length FROM doc_stats WHERE doc_id = ?",
(doc_id,))
doc_length = cursor.fetchone()[0]
tf_norm = ((tf * (k1 + 1)) / (tf + k1 * (1 - b + b * (doc_length /
avg_doc_length))))
scores[doc_id] += idf * tf_norm
return sorted(scores.items(), key=lambda x: x[1], reverse=True)

def fuzzy_search(self, word, threshold=0.7):


with sqlite3.connect(self.db_name) as conn:
cursor = conn.cursor()
cursor.execute("SELECT DISTINCT word FROM index_table")
all_words = [row[0] for row in cursor.fetchall()]
close_matches = get_close_matches(word.lower(), all_words, n=3,
cutoff=threshold)
matched_docs = set()
for match in close_matches:
matched_docs.update(self.search(match))
return matched_docs

def wildcard_search(self, pattern):


regex = re.compile(pattern.replace("*", ".*").replace("?", "."))
with sqlite3.connect(self.db_name) as conn:
cursor = conn.cursor()
cursor.execute("SELECT DISTINCT word FROM index_table")
matched_words = [word for word, in cursor.fetchall() if regex.fullmatch(word)]
matched_docs = set()
for word in matched_words:
matched_docs.update(self.search(word))
return matched_docs

def _get_avg_doc_length(self):
with sqlite3.connect(self.db_name) as conn:
cursor = conn.cursor()
cursor.execute("SELECT AVG(length) FROM doc_stats")
avg = cursor.fetchone()[0]
return avg if avg is not None else 1

# Initialize the index


index = InvertedIndex()

@app.route('/add_document', methods=['POST'])
def add_document():
data = request.json
doc_id = data.get("doc_id")
text = data.get("text")
if not doc_id or not text:
return jsonify({"error": "doc_id and text are required"}), 400
index.add_document(doc_id, text)
return jsonify({"message": f"Document {doc_id} added successfully."})

@app.route('/search', methods=['GET'])
def search():
word = request.args.get("word")
if not word:
return jsonify({"error": "Query parameter 'word' is required"}), 400
result = list(index.search(word))
return jsonify({"results": result})
@app.route('/boolean_search', methods=['GET'])
def boolean_search():
query = request.args.get("query")
if not query:
return jsonify({"error": "Query parameter 'query' is required"}), 400
result = list(index.boolean_search(query))
return jsonify({"results": result})

@app.route('/bm25_search', methods=['GET'])
def bm25_search():
query = request.args.get("query")
if not query:
return jsonify({"error": "Query parameter 'query' is required"}), 400
results = index.bm25_search(query)
formatted_results = [{"doc_id": doc_id, "score": score} for doc_id, score in results]
return jsonify({"results": formatted_results})

@app.route('/fuzzy_search', methods=['GET'])
def fuzzy_search():
word = request.args.get("word")
if not word:
return jsonify({"error": "Query parameter 'word' is required"}), 400
result = list(index.fuzzy_search(word))
return jsonify({"results": result})

@app.route('/wildcard_search', methods=['GET'])
def wildcard_search():
pattern = request.args.get("pattern")
if not pattern:
return jsonify({"error": "Query parameter 'pattern' is required"}), 400
result = list(index.wildcard_search(pattern))
return jsonify({"results": result})

if __name__ == '__main__':
app.run(debug=True)

Production Enhancements & Deployment Strategies

To deploy the search engine API in production, several enhancements and best
practices should be considered:

Authentication & Authorization

 Secure Endpoints:
Use token-based authentication (e.g., Flask-JWT-Extended) or OAuth.
 User Management:
Maintain a user database to restrict and monitor access.

Pagination & Caching


 Pagination:
Introduce query parameters for page and limit to paginate results.
 Caching:
Use Flask-Caching or Redis to cache frequent queries, reducing load on the
database.

Logging & Monitoring

 Logging:
Utilize Python’s logging module or services like Sentry to capture errors.
 Monitoring:
Employ monitoring tools (e.g., Prometheus, Grafana) to track performance
metrics.

Containerization and WSGI Deployment

 Docker Containerization:
Create a Dockerfile for consistent deployment.

dockerfile
CopyEdit
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt requirements.txt
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 5000
CMD ["python", "your_flask_app.py"]

 Production WSGI Server:


Use Gunicorn to serve the Flask app.

bash
CopyEdit
gunicorn -w 4 -b 0.0.0.0:5000 your_flask_app:app

 Reverse Proxy:
Configure Nginx as a reverse proxy to handle SSL termination, load
balancing, and static file serving.

Scaling Out

 Horizontal Scaling:
Use Docker Compose or Kubernetes for managing multiple instances.
 Database Considerations:
For higher traffic, consider transitioning from SQLite to PostgreSQL or
Elasticsearch.
Conclusion

This document details the journey from understanding the concept of an inverted
index to developing a production-ready search engine API. Key takeaways include:

 Inverted Index: A powerful data structure for fast text searches.


 Advanced Features: Implementations of boolean, phrase, fuzzy, and
wildcard searches.
 Ranking Methods: Enhancing search relevance using TF-IDF and BM25.
 Scalability: Employing disk-based storage (SQLite) and multiprocessing.
 Flask API: Exposing search functionality through a RESTful web service.
 Production Deployment: Integrating authentication, pagination, caching, and
containerization for a robust, scalable system.

By following these guidelines and using the provided code examples, you can build
and deploy an efficient and scalable search engine solution.

You might also like