Introduction
In this document, we walk through the creation of a search engine system using an
inverted index as the backbone. We start by explaining what an inverted index is,
provide a basic Python implementation, and then progressively add advanced
features like boolean queries, phrase, fuzzy, and wildcard searches. We also
enhance our search relevance with TF-IDF and BM25 ranking. Finally, we integrate
the search engine into a Flask API, cover scaling strategies using SQLite and
multiprocessing, and discuss production deployment considerations.
Understanding the Inverted Index
Definition & Purpose
An inverted index is a data structure that maps words (or terms) to the documents
in which they appear. Instead of listing the words for each document, the inverted
index "inverts" the relationship by mapping each unique word to a list (or posting list)
of documents. This is the fundamental technology behind search engines, enabling
efficient full-text searches.
Basic Structure & Example
Components:
Unique Words (Terms): A list of all words in the document collection.
Posting List: For each word, a list of document IDs (and optionally word
positions) where the word appears.
Example:
Consider three documents:
Doc1: "The cat sat on the mat."
Doc2: "The dog sat on the log."
Doc3: "The cat and the dog played together."
Extracted Words:
the, cat, sat, on, mat, dog, log, and, played, together
Inverted Index Table:
Word Posting List
the [Doc1, Doc2, Doc3]
cat [Doc1, Doc3]
sat [Doc1, Doc2]
on [Doc1, Doc2]
mat [Doc1]
Word Posting List
dog [Doc2, Doc3]
log [Doc2]
and [Doc3]
played [Doc3]
together [Doc3]
With this structure, a search query such as "cat" immediately returns Doc1 and
Doc3.
Basic Inverted Index Implementation in Python
Code Overview
Below is a simple Python implementation using a dictionary to map words to
document IDs.
from collections import defaultdict
class InvertedIndex:
def __init__(self):
self.index = defaultdict(set) # Map words to a set of document IDs
def add_document(self, doc_id, text):
"""Tokenize text and update the index."""
words = text.lower().split() # Basic tokenization
for word in words:
self.index[word].add(doc_id)
def search(self, query):
"""Return documents that contain the queried word."""
return self.index.get(query.lower(), set())
def display_index(self):
"""Prints the inverted index."""
for word, doc_ids in self.index.items():
print(f"{word}: {sorted(doc_ids)}")
# Example documents
documents = {
1: "The cat sat on the mat",
2: "The dog sat on the log",
3: "The cat and the dog played together"
}
# Build and display the index
index = InvertedIndex()
for doc_id, text in documents.items():
index.add_document(doc_id, text)
print("Inverted Index:")
index.display_index()
print("\nDocuments containing 'cat':", index.search("cat"))
Usage & Output
Index Construction: Each document is tokenized and added to the
dictionary.
Search: A query like "cat" returns {1, 3}.
Output Example:
Inverted Index:
the: [1, 2, 3]
cat: [1, 3]
sat: [1, 2]
on: [1, 2]
mat: [1]
dog: [2, 3]
log: [2]
and: [3]
played: [3]
together: [3]
Documents containing 'cat': {1, 3}
Advanced Features of the Inverted Index
To handle more complex queries, we enhanced the basic index with additional
search functionalities:
Boolean Search
Supports: Queries with AND and OR.
Example:
"dog and cat" returns only documents containing both words.
Phrase Search
Goal: Retrieve documents where a sequence of words appears in order.
Method: Track positions of words in documents and check if positions align
for consecutive words.
Fuzzy Search
Purpose: Correct for typos or similar words (e.g., "dgo" should match "dog").
Implementation: Uses Python’s difflib.get_close_matches for fuzzy matching.
Wildcard Search
Capability: Supports * (any sequence of characters) and ? (single character)
in search queries.
Implementation: Converts wildcard patterns into regular expressions and
finds matching words.
A combined advanced implementation incorporates all these functionalities for robust
search capability.
Enhancing Relevance with Ranking Methods
TF-IDF Ranking
TF (Term Frequency): Measures how often a term appears in a document.
IDF (Inverse Document Frequency): Gives higher weight to rarer words.
Use: Compute scores for documents matching a query to rank them by
relevance.
Code Highlight:
import math
def tf_idf_search(self, query):
words = query.lower().split()
scores = defaultdict(float)
for word in words:
if word in self.index:
idf = math.log((self.num_docs + 1) / (self.df[word] + 1)) + 1
for doc_id in self.index[word]:
tf = self.tf[word][doc_id]
scores[doc_id] += tf * idf
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
BM25 Ranking
BM25 is a state-of-the-art ranking function that improves upon TF-IDF by accounting
for document length and other factors.
Components:
o k1 and b: Tuning parameters.
o IDF and normalization: Similar to TF-IDF but adjusted for document
length.
Outcome: More accurate ranking of search results.
Scaling the Inverted Index for Big Data
For a production system handling millions of documents, in-memory storage may not
be sufficient. The following strategies are used:
Disk-Based Storage with SQLite
Advantage:
o Persistence of the index on disk.
o Efficient SQL queries and indexing.
Implementation:
o Create tables for documents, index entries, and document statistics.
Parallel Processing for Indexing
Objective:
o Speed up document processing by utilizing multiple CPU cores.
Implementation:
o Use Python's multiprocessing.Pool to process documents concurrently.
An advanced implementation combines SQLite for storage and multiprocessing for
faster indexing.
Building a Flask API for the Search Engine
A RESTful API built using Flask provides endpoints to interact with the search
engine. This integration enables adding documents, performing different types of
searches, and returning ranked results.
API Endpoints Overview
/add_document:
o Method: POST
o Purpose: Add a new document to the index.
/search:
o Method: GET
o Purpose: Perform a simple word search.
/boolean_search:
o Method: GET
o Purpose: Handle boolean queries.
/bm25_search:
o Method: GET
o Purpose: Return BM25-ranked search results.
/fuzzy_search & /wildcard_search:
o Methods: GET
o Purpose: Support advanced query types.
Flask API Code Example
from flask import Flask, request, jsonify
import sqlite3
import re
import math
from collections import defaultdict
from difflib import get_close_matches
app = Flask(__name__)
DB_NAME = "inverted_index.db"
class InvertedIndex:
def __init__(self, db_name=DB_NAME):
self.db_name = db_name
self.num_docs = 0
self._initialize_db()
self._update_num_docs()
def _initialize_db(self):
with sqlite3.connect(self.db_name) as conn:
cursor = conn.cursor()
cursor.execute("CREATE TABLE IF NOT EXISTS documents (doc_id
INTEGER PRIMARY KEY, text TEXT)")
cursor.execute("CREATE TABLE IF NOT EXISTS index_table (word TEXT,
doc_id INTEGER, tf INTEGER)")
cursor.execute("CREATE TABLE IF NOT EXISTS doc_stats (doc_id
INTEGER PRIMARY KEY, length INTEGER)")
conn.commit()
def _update_num_docs(self):
with sqlite3.connect(self.db_name) as conn:
cursor = conn.cursor()
cursor.execute("SELECT COUNT(*) FROM documents")
self.num_docs = cursor.fetchone()[0]
def add_document(self, doc_id, text):
words = re.findall(r'\w+', text.lower())
term_freq = defaultdict(int)
for word in words:
term_freq[word] += 1
with sqlite3.connect(self.db_name) as conn:
cursor = conn.cursor()
cursor.execute("INSERT OR REPLACE INTO documents (doc_id, text)
VALUES (?, ?)", (doc_id, text))
cursor.execute("INSERT OR REPLACE INTO doc_stats (doc_id, length)
VALUES (?, ?)", (doc_id, len(words)))
for word, freq in term_freq.items():
cursor.execute("INSERT INTO index_table (word, doc_id, tf) VALUES
(?, ?, ?)", (word, doc_id, freq))
conn.commit()
self.num_docs += 1
def search(self, word):
with sqlite3.connect(self.db_name) as conn:
cursor = conn.cursor()
cursor.execute("SELECT DISTINCT doc_id FROM index_table WHERE
word = ?", (word.lower(),))
return {row[0] for row in cursor.fetchall()}
def boolean_search(self, query):
words = query.lower().split()
if "and" in words:
terms = [word for word in words if word != "and"]
result = set.intersection(*(self.search(term) for term in terms))
elif "or" in words:
terms = [word for word in words if word != "or"]
result = set.union(*(self.search(term) for term in terms))
else:
result = self.search(query)
return result
def bm25_search(self, query, k1=1.5, b=0.75):
words = query.lower().split()
scores = defaultdict(float)
avg_doc_length = self._get_avg_doc_length()
with sqlite3.connect(self.db_name) as conn:
cursor = conn.cursor()
for word in words:
cursor.execute("SELECT COUNT(DISTINCT doc_id) FROM index_table
WHERE word = ?", (word,))
df = cursor.fetchone()[0] or 0
idf = math.log((self.num_docs - df + 0.5) / (df + 0.5) + 1)
cursor.execute("SELECT doc_id, tf FROM index_table WHERE word = ?",
(word,))
for doc_id, tf in cursor.fetchall():
cursor.execute("SELECT length FROM doc_stats WHERE doc_id = ?",
(doc_id,))
doc_length = cursor.fetchone()[0]
tf_norm = ((tf * (k1 + 1)) / (tf + k1 * (1 - b + b * (doc_length /
avg_doc_length))))
scores[doc_id] += idf * tf_norm
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
def fuzzy_search(self, word, threshold=0.7):
with sqlite3.connect(self.db_name) as conn:
cursor = conn.cursor()
cursor.execute("SELECT DISTINCT word FROM index_table")
all_words = [row[0] for row in cursor.fetchall()]
close_matches = get_close_matches(word.lower(), all_words, n=3,
cutoff=threshold)
matched_docs = set()
for match in close_matches:
matched_docs.update(self.search(match))
return matched_docs
def wildcard_search(self, pattern):
regex = re.compile(pattern.replace("*", ".*").replace("?", "."))
with sqlite3.connect(self.db_name) as conn:
cursor = conn.cursor()
cursor.execute("SELECT DISTINCT word FROM index_table")
matched_words = [word for word, in cursor.fetchall() if regex.fullmatch(word)]
matched_docs = set()
for word in matched_words:
matched_docs.update(self.search(word))
return matched_docs
def _get_avg_doc_length(self):
with sqlite3.connect(self.db_name) as conn:
cursor = conn.cursor()
cursor.execute("SELECT AVG(length) FROM doc_stats")
avg = cursor.fetchone()[0]
return avg if avg is not None else 1
# Initialize the index
index = InvertedIndex()
@app.route('/add_document', methods=['POST'])
def add_document():
data = request.json
doc_id = data.get("doc_id")
text = data.get("text")
if not doc_id or not text:
return jsonify({"error": "doc_id and text are required"}), 400
index.add_document(doc_id, text)
return jsonify({"message": f"Document {doc_id} added successfully."})
@app.route('/search', methods=['GET'])
def search():
word = request.args.get("word")
if not word:
return jsonify({"error": "Query parameter 'word' is required"}), 400
result = list(index.search(word))
return jsonify({"results": result})
@app.route('/boolean_search', methods=['GET'])
def boolean_search():
query = request.args.get("query")
if not query:
return jsonify({"error": "Query parameter 'query' is required"}), 400
result = list(index.boolean_search(query))
return jsonify({"results": result})
@app.route('/bm25_search', methods=['GET'])
def bm25_search():
query = request.args.get("query")
if not query:
return jsonify({"error": "Query parameter 'query' is required"}), 400
results = index.bm25_search(query)
formatted_results = [{"doc_id": doc_id, "score": score} for doc_id, score in results]
return jsonify({"results": formatted_results})
@app.route('/fuzzy_search', methods=['GET'])
def fuzzy_search():
word = request.args.get("word")
if not word:
return jsonify({"error": "Query parameter 'word' is required"}), 400
result = list(index.fuzzy_search(word))
return jsonify({"results": result})
@app.route('/wildcard_search', methods=['GET'])
def wildcard_search():
pattern = request.args.get("pattern")
if not pattern:
return jsonify({"error": "Query parameter 'pattern' is required"}), 400
result = list(index.wildcard_search(pattern))
return jsonify({"results": result})
if __name__ == '__main__':
app.run(debug=True)
Production Enhancements & Deployment Strategies
To deploy the search engine API in production, several enhancements and best
practices should be considered:
Authentication & Authorization
Secure Endpoints:
Use token-based authentication (e.g., Flask-JWT-Extended) or OAuth.
User Management:
Maintain a user database to restrict and monitor access.
Pagination & Caching
Pagination:
Introduce query parameters for page and limit to paginate results.
Caching:
Use Flask-Caching or Redis to cache frequent queries, reducing load on the
database.
Logging & Monitoring
Logging:
Utilize Python’s logging module or services like Sentry to capture errors.
Monitoring:
Employ monitoring tools (e.g., Prometheus, Grafana) to track performance
metrics.
Containerization and WSGI Deployment
Docker Containerization:
Create a Dockerfile for consistent deployment.
dockerfile
CopyEdit
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt requirements.txt
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 5000
CMD ["python", "your_flask_app.py"]
Production WSGI Server:
Use Gunicorn to serve the Flask app.
bash
CopyEdit
gunicorn -w 4 -b 0.0.0.0:5000 your_flask_app:app
Reverse Proxy:
Configure Nginx as a reverse proxy to handle SSL termination, load
balancing, and static file serving.
Scaling Out
Horizontal Scaling:
Use Docker Compose or Kubernetes for managing multiple instances.
Database Considerations:
For higher traffic, consider transitioning from SQLite to PostgreSQL or
Elasticsearch.
Conclusion
This document details the journey from understanding the concept of an inverted
index to developing a production-ready search engine API. Key takeaways include:
Inverted Index: A powerful data structure for fast text searches.
Advanced Features: Implementations of boolean, phrase, fuzzy, and
wildcard searches.
Ranking Methods: Enhancing search relevance using TF-IDF and BM25.
Scalability: Employing disk-based storage (SQLite) and multiprocessing.
Flask API: Exposing search functionality through a RESTful web service.
Production Deployment: Integrating authentication, pagination, caching, and
containerization for a robust, scalable system.
By following these guidelines and using the provided code examples, you can build
and deploy an efficient and scalable search engine solution.