0% found this document useful (0 votes)

74 views11 pages

Inverted Index-Unit-3

Uploaded by

kruti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

74 views11 pages

Inverted Index-Unit-3

Uploaded by

kruti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

Introduction

In this document, we walk through the creation of a search engine system using an
inverted index as the backbone. We start by explaining what an inverted index is,
provide a basic Python implementation, and then progressively add advanced
features like boolean queries, phrase, fuzzy, and wildcard searches. We also
enhance our search relevance with TF-IDF and BM25 ranking. Finally, we integrate
the search engine into a Flask API, cover scaling strategies using SQLite and
multiprocessing, and discuss production deployment considerations.

Understanding the Inverted Index

Definition & Purpose

An inverted index is a data structure that maps words (or terms) to the documents
in which they appear. Instead of listing the words for each document, the inverted
index "inverts" the relationship by mapping each unique word to a list (or posting list)
of documents. This is the fundamental technology behind search engines, enabling
efficient full-text searches.

Basic Structure & Example

Components:

 Unique Words (Terms): A list of all words in the document collection.

 Posting List: For each word, a list of document IDs (and optionally word
positions) where the word appears.

Example:

Consider three documents:

 Doc1: "The cat sat on the mat."

 Doc2: "The dog sat on the log."
 Doc3: "The cat and the dog played together."

Extracted Words:
the, cat, sat, on, mat, dog, log, and, played, together

Inverted Index Table:

Word Posting List

the [Doc1, Doc2, Doc3]
cat [Doc1, Doc3]
sat [Doc1, Doc2]
on [Doc1, Doc2]
mat [Doc1]
Word Posting List
dog [Doc2, Doc3]
log [Doc2]
and [Doc3]
played [Doc3]
together [Doc3]

With this structure, a search query such as "cat" immediately returns Doc1 and
Doc3.

Basic Inverted Index Implementation in Python

Code Overview

Below is a simple Python implementation using a dictionary to map words to

document IDs.

from collections import defaultdict

class InvertedIndex:
def __init__(self):
self.index = defaultdict(set) # Map words to a set of document IDs

def add_document(self, doc_id, text):

"""Tokenize text and update the index."""
words = text.lower().split() # Basic tokenization
for word in words:
self.index[word].add(doc_id)

def search(self, query):

"""Return documents that contain the queried word."""
return self.index.get(query.lower(), set())

def display_index(self):
"""Prints the inverted index."""
for word, doc_ids in self.index.items():
print(f"{word}: {sorted(doc_ids)}")

# Example documents
documents = {
1: "The cat sat on the mat",
2: "The dog sat on the log",
3: "The cat and the dog played together"
}
# Build and display the index
index = InvertedIndex()
for doc_id, text in documents.items():
index.add_document(doc_id, text)

print("Inverted Index:")
index.display_index()
print("\nDocuments containing 'cat':", index.search("cat"))

Usage & Output

 Index Construction: Each document is tokenized and added to the

dictionary.
 Search: A query like "cat" returns {1, 3}.
 Output Example:

Inverted Index:
the: [1, 2, 3]
cat: [1, 3]
sat: [1, 2]
on: [1, 2]
mat: [1]
dog: [2, 3]
log: [2]
and: [3]
played: [3]
together: [3]

Documents containing 'cat': {1, 3}

Advanced Features of the Inverted Index

To handle more complex queries, we enhanced the basic index with additional
search functionalities:

Boolean Search

 Supports: Queries with AND and OR.

 Example:
"dog and cat" returns only documents containing both words.

Phrase Search

 Goal: Retrieve documents where a sequence of words appears in order.

 Method: Track positions of words in documents and check if positions align
for consecutive words.

Fuzzy Search
 Purpose: Correct for typos or similar words (e.g., "dgo" should match "dog").
 Implementation: Uses Python’s difflib.get_close_matches for fuzzy matching.

Wildcard Search

 Capability: Supports * (any sequence of characters) and ? (single character)

in search queries.
 Implementation: Converts wildcard patterns into regular expressions and
finds matching words.

A combined advanced implementation incorporates all these functionalities for robust

search capability.

Enhancing Relevance with Ranking Methods

TF-IDF Ranking

 TF (Term Frequency): Measures how often a term appears in a document.

 IDF (Inverse Document Frequency): Gives higher weight to rarer words.
 Use: Compute scores for documents matching a query to rank them by
relevance.

Code Highlight:

import math

def tf_idf_search(self, query):

words = query.lower().split()
scores = defaultdict(float)
for word in words:
if word in self.index:
idf = math.log((self.num_docs + 1) / (self.df[word] + 1)) + 1
for doc_id in self.index[word]:
tf = self.tf[word][doc_id]
scores[doc_id] += tf * idf
return sorted(scores.items(), key=lambda x: x[1], reverse=True)

BM25 Ranking

BM25 is a state-of-the-art ranking function that improves upon TF-IDF by accounting

for document length and other factors.

 Components:
o k1 and b: Tuning parameters.
o IDF and normalization: Similar to TF-IDF but adjusted for document
length.
 Outcome: More accurate ranking of search results.
Scaling the Inverted Index for Big Data

For a production system handling millions of documents, in-memory storage may not
be sufficient. The following strategies are used:

Disk-Based Storage with SQLite

 Advantage:
o Persistence of the index on disk.
o Efficient SQL queries and indexing.
 Implementation:
o Create tables for documents, index entries, and document statistics.

Parallel Processing for Indexing

 Objective:
o Speed up document processing by utilizing multiple CPU cores.
 Implementation:
o Use Python's multiprocessing.Pool to process documents concurrently.

An advanced implementation combines SQLite for storage and multiprocessing for

faster indexing.

Building a Flask API for the Search Engine

A RESTful API built using Flask provides endpoints to interact with the search
engine. This integration enables adding documents, performing different types of
searches, and returning ranked results.

API Endpoints Overview

 /add_document:
o Method: POST
o Purpose: Add a new document to the index.
 /search:
o Method: GET
o Purpose: Perform a simple word search.
 /boolean_search:
o Method: GET
o Purpose: Handle boolean queries.
 /bm25_search:
o Method: GET
o Purpose: Return BM25-ranked search results.
 /fuzzy_search & /wildcard_search:
o Methods: GET
o Purpose: Support advanced query types.

Flask API Code Example

from flask import Flask, request, jsonify

import sqlite3
import re
import math
from collections import defaultdict
from difflib import get_close_matches

app = Flask(__name__)
DB_NAME = "inverted_index.db"

class InvertedIndex:
def __init__(self, db_name=DB_NAME):
self.db_name = db_name
self.num_docs = 0
self._initialize_db()
self._update_num_docs()

def _initialize_db(self):
with sqlite3.connect(self.db_name) as conn:
cursor = conn.cursor()
cursor.execute("CREATE TABLE IF NOT EXISTS documents (doc_id
INTEGER PRIMARY KEY, text TEXT)")
cursor.execute("CREATE TABLE IF NOT EXISTS index_table (word TEXT,
doc_id INTEGER, tf INTEGER)")
cursor.execute("CREATE TABLE IF NOT EXISTS doc_stats (doc_id
INTEGER PRIMARY KEY, length INTEGER)")
conn.commit()

def _update_num_docs(self):
with sqlite3.connect(self.db_name) as conn:
cursor = conn.cursor()
cursor.execute("SELECT COUNT(*) FROM documents")
self.num_docs = cursor.fetchone()[0]

def add_document(self, doc_id, text):

words = re.findall(r'\w+', text.lower())
term_freq = defaultdict(int)
for word in words:
term_freq[word] += 1

with sqlite3.connect(self.db_name) as conn:

cursor = conn.cursor()
cursor.execute("INSERT OR REPLACE INTO documents (doc_id, text)
VALUES (?, ?)", (doc_id, text))
cursor.execute("INSERT OR REPLACE INTO doc_stats (doc_id, length)
VALUES (?, ?)", (doc_id, len(words)))
for word, freq in term_freq.items():
cursor.execute("INSERT INTO index_table (word, doc_id, tf) VALUES
(?, ?, ?)", (word, doc_id, freq))
conn.commit()
self.num_docs += 1

def search(self, word):

with sqlite3.connect(self.db_name) as conn:
cursor = conn.cursor()
cursor.execute("SELECT DISTINCT doc_id FROM index_table WHERE
word = ?", (word.lower(),))
return {row[0] for row in cursor.fetchall()}

def boolean_search(self, query):

words = query.lower().split()
if "and" in words:
terms = [word for word in words if word != "and"]
result = set.intersection(*(self.search(term) for term in terms))
elif "or" in words:
terms = [word for word in words if word != "or"]
result = set.union(*(self.search(term) for term in terms))
else:
result = self.search(query)
return result

def bm25_search(self, query, k1=1.5, b=0.75):

words = query.lower().split()
scores = defaultdict(float)
avg_doc_length = self._get_avg_doc_length()
with sqlite3.connect(self.db_name) as conn:
cursor = conn.cursor()
for word in words:
cursor.execute("SELECT COUNT(DISTINCT doc_id) FROM index_table
WHERE word = ?", (word,))
df = cursor.fetchone()[0] or 0
idf = math.log((self.num_docs - df + 0.5) / (df + 0.5) + 1)
cursor.execute("SELECT doc_id, tf FROM index_table WHERE word = ?",
(word,))
for doc_id, tf in cursor.fetchall():
cursor.execute("SELECT length FROM doc_stats WHERE doc_id = ?",
(doc_id,))
doc_length = cursor.fetchone()[0]
tf_norm = ((tf * (k1 + 1)) / (tf + k1 * (1 - b + b * (doc_length /
avg_doc_length))))
scores[doc_id] += idf * tf_norm
return sorted(scores.items(), key=lambda x: x[1], reverse=True)

def fuzzy_search(self, word, threshold=0.7):

with sqlite3.connect(self.db_name) as conn:
cursor = conn.cursor()
cursor.execute("SELECT DISTINCT word FROM index_table")
all_words = [row[0] for row in cursor.fetchall()]
close_matches = get_close_matches(word.lower(), all_words, n=3,
cutoff=threshold)
matched_docs = set()
for match in close_matches:
matched_docs.update(self.search(match))
return matched_docs

def wildcard_search(self, pattern):

regex = re.compile(pattern.replace("*", ".*").replace("?", "."))
with sqlite3.connect(self.db_name) as conn:
cursor = conn.cursor()
cursor.execute("SELECT DISTINCT word FROM index_table")
matched_words = [word for word, in cursor.fetchall() if regex.fullmatch(word)]
matched_docs = set()
for word in matched_words:
matched_docs.update(self.search(word))
return matched_docs

def _get_avg_doc_length(self):
with sqlite3.connect(self.db_name) as conn:
cursor = conn.cursor()
cursor.execute("SELECT AVG(length) FROM doc_stats")
avg = cursor.fetchone()[0]
return avg if avg is not None else 1

# Initialize the index

index = InvertedIndex()

@app.route('/add_document', methods=['POST'])
def add_document():
data = request.json
doc_id = data.get("doc_id")
text = data.get("text")
if not doc_id or not text:
return jsonify({"error": "doc_id and text are required"}), 400
index.add_document(doc_id, text)
return jsonify({"message": f"Document {doc_id} added successfully."})

@app.route('/search', methods=['GET'])
def search():
word = request.args.get("word")
if not word:
return jsonify({"error": "Query parameter 'word' is required"}), 400
result = list(index.search(word))
return jsonify({"results": result})
@app.route('/boolean_search', methods=['GET'])
def boolean_search():
query = request.args.get("query")
if not query:
return jsonify({"error": "Query parameter 'query' is required"}), 400
result = list(index.boolean_search(query))
return jsonify({"results": result})

@app.route('/bm25_search', methods=['GET'])
def bm25_search():
query = request.args.get("query")
if not query:
return jsonify({"error": "Query parameter 'query' is required"}), 400
results = index.bm25_search(query)
formatted_results = [{"doc_id": doc_id, "score": score} for doc_id, score in results]
return jsonify({"results": formatted_results})

@app.route('/fuzzy_search', methods=['GET'])
def fuzzy_search():
word = request.args.get("word")
if not word:
return jsonify({"error": "Query parameter 'word' is required"}), 400
result = list(index.fuzzy_search(word))
return jsonify({"results": result})

@app.route('/wildcard_search', methods=['GET'])
def wildcard_search():
pattern = request.args.get("pattern")
if not pattern:
return jsonify({"error": "Query parameter 'pattern' is required"}), 400
result = list(index.wildcard_search(pattern))
return jsonify({"results": result})

if __name__ == '__main__':
app.run(debug=True)

Production Enhancements & Deployment Strategies

To deploy the search engine API in production, several enhancements and best
practices should be considered:

Authentication & Authorization

 Secure Endpoints:
Use token-based authentication (e.g., Flask-JWT-Extended) or OAuth.
 User Management:
Maintain a user database to restrict and monitor access.

Pagination & Caching

 Pagination:
Introduce query parameters for page and limit to paginate results.
 Caching:
Use Flask-Caching or Redis to cache frequent queries, reducing load on the
database.

Logging & Monitoring

 Logging:
Utilize Python’s logging module or services like Sentry to capture errors.
 Monitoring:
Employ monitoring tools (e.g., Prometheus, Grafana) to track performance
metrics.

Containerization and WSGI Deployment

 Docker Containerization:
Create a Dockerfile for consistent deployment.

dockerfile
CopyEdit
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt requirements.txt
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 5000
CMD ["python", "your_flask_app.py"]

 Production WSGI Server:

Use Gunicorn to serve the Flask app.

bash
CopyEdit
gunicorn -w 4 -b 0.0.0.0:5000 your_flask_app:app

 Reverse Proxy:
Configure Nginx as a reverse proxy to handle SSL termination, load
balancing, and static file serving.

Scaling Out

 Horizontal Scaling:
Use Docker Compose or Kubernetes for managing multiple instances.
 Database Considerations:
For higher traffic, consider transitioning from SQLite to PostgreSQL or
Elasticsearch.
Conclusion

This document details the journey from understanding the concept of an inverted
index to developing a production-ready search engine API. Key takeaways include:

 Inverted Index: A powerful data structure for fast text searches.

 Advanced Features: Implementations of boolean, phrase, fuzzy, and
wildcard searches.
 Ranking Methods: Enhancing search relevance using TF-IDF and BM25.
 Scalability: Employing disk-based storage (SQLite) and multiprocessing.
 Flask API: Exposing search functionality through a RESTful web service.
 Production Deployment: Integrating authentication, pagination, caching, and
containerization for a robust, scalable system.

By following these guidelines and using the provided code examples, you can build
and deploy an efficient and scalable search engine solution.

IR Practical 1
No ratings yet
IR Practical 1
5 pages
Ir Mod4 Notes
No ratings yet
Ir Mod4 Notes
19 pages
IR Journal 21054
No ratings yet
IR Journal 21054
30 pages
Ir Journal
No ratings yet
Ir Journal
41 pages
Certificate: T.Y.Bsc Cs
No ratings yet
Certificate: T.Y.Bsc Cs
120 pages
IR Journal
No ratings yet
IR Journal
36 pages
Example & Program For Inverted Index
No ratings yet
Example & Program For Inverted Index
2 pages
Inverted Index
No ratings yet
Inverted Index
13 pages
Programming Assignment Unit 05 - CS 3308 - Information Retrieval - University of The People
No ratings yet
Programming Assignment Unit 05 - CS 3308 - Information Retrieval - University of The People
9 pages
Chap5 Index Construction
No ratings yet
Chap5 Index Construction
38 pages
Document Indexing & Retrieval Guide
No ratings yet
Document Indexing & Retrieval Guide
20 pages
Indexing 1
No ratings yet
Indexing 1
61 pages
IR Unit 2 Dictionaries and Query Processing
No ratings yet
IR Unit 2 Dictionaries and Query Processing
20 pages
IR Prac 1
No ratings yet
IR Prac 1
3 pages
115 Ir 9
No ratings yet
115 Ir 9
4 pages
Ir Op 6
No ratings yet
Ir Op 6
2 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
Unit 4 Source Code
No ratings yet
Unit 4 Source Code
11 pages
L05
No ratings yet
L05
33 pages
Course Name: Advanced Information Retrieval
No ratings yet
Course Name: Advanced Information Retrieval
6 pages
Unit 3 Indexing
100% (1)
Unit 3 Indexing
10 pages
Text Indexing
No ratings yet
Text Indexing
11 pages
Information Retrieval Models Guide
No ratings yet
Information Retrieval Models Guide
54 pages
Implementation
No ratings yet
Implementation
16 pages
IRS Module 5
No ratings yet
IRS Module 5
24 pages
Irs Lab Week-3
No ratings yet
Irs Lab Week-3
2 pages
Rescued Document
No ratings yet
Rescued Document
4 pages
Assignment 4
No ratings yet
Assignment 4
11 pages
Inverted Index for Developers
No ratings yet
Inverted Index for Developers
2 pages
Lab3 IR BIM
No ratings yet
Lab3 IR BIM
14 pages
Chapter 3 Indexing
No ratings yet
Chapter 3 Indexing
48 pages
Project Report
No ratings yet
Project Report
5 pages
Lecture 2 Inverted Index PDF
No ratings yet
Lecture 2 Inverted Index PDF
24 pages
20BCE1779 - Web Mining - Lab-1
No ratings yet
20BCE1779 - Web Mining - Lab-1
9 pages
CS 3308 Programming Assignment Unit 4
No ratings yet
CS 3308 Programming Assignment Unit 4
7 pages
Inverted File
No ratings yet
Inverted File
20 pages
Assignment 4
No ratings yet
Assignment 4
13 pages
ASSIGNMENT
No ratings yet
ASSIGNMENT
2 pages
IR Assignment4
No ratings yet
IR Assignment4
5 pages
Unit 2 IR
No ratings yet
Unit 2 IR
13 pages
Completed UNIT-III 20.9.17
No ratings yet
Completed UNIT-III 20.9.17
61 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Thesis
No ratings yet
Thesis
49 pages
Ir 1
No ratings yet
Ir 1
14 pages
Written Assignmen Unit Four IR
No ratings yet
Written Assignmen Unit Four IR
3 pages
Introduction To Information Rertrieval Recitation
No ratings yet
Introduction To Information Rertrieval Recitation
2 pages
SEO Guide: Indexing Basics & Techniques
No ratings yet
SEO Guide: Indexing Basics & Techniques
34 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
CS 3308 Programming Assignment Unit 2
No ratings yet
CS 3308 Programming Assignment Unit 2
10 pages
Sakina Assign1 Batch3
No ratings yet
Sakina Assign1 Batch3
1 page
Indexing Structures Explained
No ratings yet
Indexing Structures Explained
44 pages
Vanessaa Wim
No ratings yet
Vanessaa Wim
9 pages
3 Indexing
No ratings yet
3 Indexing
28 pages
IR - 754 All Practical
No ratings yet
IR - 754 All Practical
21 pages
4.index Construction - New
No ratings yet
4.index Construction - New
46 pages
FOP Efficiency Indexing 13
No ratings yet
FOP Efficiency Indexing 13
22 pages
Ir Task
No ratings yet
Ir Task
6 pages
03lecture 3 - Biomedical IR-indexing
No ratings yet
03lecture 3 - Biomedical IR-indexing
27 pages
1 Overview
No ratings yet
1 Overview
44 pages
Introduction To Web Crawling Chapter - 13
No ratings yet
Introduction To Web Crawling Chapter - 13
3 pages
Countries Exercises Xpath
No ratings yet
Countries Exercises Xpath
4 pages
Modern Era Energy Grid - A Smart Grid Overview
No ratings yet
Modern Era Energy Grid - A Smart Grid Overview
27 pages
Challenges in Today's Power System and Their Impact
No ratings yet
Challenges in Today's Power System and Their Impact
20 pages
Core Semantic Web Technologies
No ratings yet
Core Semantic Web Technologies
6 pages
Laravel Framework
No ratings yet
Laravel Framework
4 pages
Laravel: Efficient PHP Framework Guide
No ratings yet
Laravel: Efficient PHP Framework Guide
4 pages
Application of Reliability Centered Maintenance (RCM) To HVDC Converter Station
No ratings yet
Application of Reliability Centered Maintenance (RCM) To HVDC Converter Station
7 pages
L13c Searching
No ratings yet
L13c Searching
57 pages
Java Ejb Presentation
No ratings yet
Java Ejb Presentation
61 pages
Grapevine Communication
No ratings yet
Grapevine Communication
2 pages
Goyal Surampalli 2018 Impact of Climate Change On Water Resources in India
No ratings yet
Goyal Surampalli 2018 Impact of Climate Change On Water Resources in India
10 pages
NLP Sample Questions-Stu
No ratings yet
NLP Sample Questions-Stu
4 pages
TNSTC
No ratings yet
TNSTC
1 page
3865022734PL ImpactAbsorption
No ratings yet
3865022734PL ImpactAbsorption
8 pages
MACDxxxxxx
100% (1)
MACDxxxxxx
8 pages
DM 16 GraphRepresentation Isomophism
No ratings yet
DM 16 GraphRepresentation Isomophism
36 pages
11th Accountancy Half Yearly Exam Important Questions English Medium PDF Download
No ratings yet
11th Accountancy Half Yearly Exam Important Questions English Medium PDF Download
6 pages
MGNT314 Subject Outline PDF
No ratings yet
MGNT314 Subject Outline PDF
28 pages
Ran Offline
No ratings yet
Ran Offline
3 pages
Chapter Two Review
No ratings yet
Chapter Two Review
4 pages
Fixed Asset and Depreciation Schedule
No ratings yet
Fixed Asset and Depreciation Schedule
5 pages
J Over Voltage Protection 2
No ratings yet
J Over Voltage Protection 2
11 pages
Introduction To PE CH # 4
No ratings yet
Introduction To PE CH # 4
21 pages
Subsea Cathodic Protection System
No ratings yet
Subsea Cathodic Protection System
4 pages
Unit 21 Film Editing Laa Lab
No ratings yet
Unit 21 Film Editing Laa Lab
53 pages
SCRIPT
No ratings yet
SCRIPT
4 pages
Fyson 2018
No ratings yet
Fyson 2018
21 pages
U41 DT AssignmentBrief Jan23
No ratings yet
U41 DT AssignmentBrief Jan23
15 pages
Comprehensive Chart of Accounts
No ratings yet
Comprehensive Chart of Accounts
1 page
Hitachi India - Brochure Sep 2021
No ratings yet
Hitachi India - Brochure Sep 2021
1 page
Top NIRF Ranked Distance - Online Universities in India 2024
No ratings yet
Top NIRF Ranked Distance - Online Universities in India 2024
4 pages
Harvey and Penzo - Parenting A Child Who Has Intense Emotions
100% (2)
Harvey and Penzo - Parenting A Child Who Has Intense Emotions
225 pages
Template Inspection of HSSE XLSmart - ZTE Project 2025 - Rev 1.0
No ratings yet
Template Inspection of HSSE XLSmart - ZTE Project 2025 - Rev 1.0
8 pages
Philosophy for Senior High Students
100% (1)
Philosophy for Senior High Students
22 pages
Balinese Homestay Management Amid Millennial Trends
No ratings yet
Balinese Homestay Management Amid Millennial Trends
14 pages
Handbook of Digital Homecare: Successes and Failures (Communications in Medical and Care Compunetics, 3) - ISBN 3642270646, 978-3642270642
100% (19)
Handbook of Digital Homecare: Successes and Failures (Communications in Medical and Care Compunetics, 3) - ISBN 3642270646, 978-3642270642
23 pages

Inverted Index-Unit-3

Uploaded by

Inverted Index-Unit-3

Uploaded by

Introduction

Understanding the Inverted Index

Definition & Purpose

Basic Structure & Example

 Unique Words (Terms): A list of all words in the document collection.

Consider three documents:

 Doc1: "The cat sat on the mat."

Inverted Index Table:

Word Posting List

Basic Inverted Index Implementation in Python

Below is a simple Python implementation using a dictionary to map words to

from collections import defaultdict

def add_document(self, doc_id, text):

def search(self, query):

Usage & Output

 Index Construction: Each document is tokenized and added to the

Documents containing 'cat': {1, 3}

Advanced Features of the Inverted Index

 Supports: Queries with AND and OR.

 Goal: Retrieve documents where a sequence of words appears in order.

 Capability: Supports * (any sequence of characters) and ? (single character)

A combined advanced implementation incorporates all these functionalities for robust

Enhancing Relevance with Ranking Methods

 TF (Term Frequency): Measures how often a term appears in a document.

def tf_idf_search(self, query):

BM25 is a state-of-the-art ranking function that improves upon TF-IDF by accounting

Disk-Based Storage with SQLite

Parallel Processing for Indexing

An advanced implementation combines SQLite for storage and multiprocessing for

Building a Flask API for the Search Engine

API Endpoints Overview

Flask API Code Example

from flask import Flask, request, jsonify

def add_document(self, doc_id, text):

with sqlite3.connect(self.db_name) as conn:

def search(self, word):

def boolean_search(self, query):

def bm25_search(self, query, k1=1.5, b=0.75):

def fuzzy_search(self, word, threshold=0.7):

def wildcard_search(self, pattern):

# Initialize the index

Production Enhancements & Deployment Strategies

Authentication & Authorization

Pagination & Caching

Logging & Monitoring

Containerization and WSGI Deployment

 Production WSGI Server:

 Inverted Index: A powerful data structure for fast text searches.

You might also like