SearchEngine-rs

A high-performance hybrid web search engine implemented in Rust, combining efficient inverted indexing with state-of-the-art semantic search capabilities. It supports indexing millions of documents with limited memory and provides fast multilingual query processing through both command-line and web interfaces.

Overview

This search engine was built from scratch to demonstrate core information retrieval principles and modern hybrid search techniques. It features:

Hybrid Search: Combines BM25 lexical search with dense vector semantic search.
Cross-Encoder Reranking: Improves result precision by reranking top candidates using a cross-encoder model.
Memory-efficient indexing of large datasets using block-based processing and k-way merging.
Fast query processing with caching optimization and variable-byte compression.
Multilingual support through Unicode-aware tokenization.
Dual interfaces for both command-line and web-based interaction.
Centralized Configuration: All settings managed via a single TOML configuration file.
Python Interop: Seamless integration with Python for deep learning models (Sentence Transformers).

The system has been tested on the MS MARCO datasets:

Passage-ranking: 8.8M documents, indexed in ~100 seconds using <600MB RAM
Document-ranking: 3.2M documents, indexed in ~600 seconds using <800MB RAM

Architecture

The search engine consists of three main components:

1. Indexer (`src/bin/indexer.rs`)

Builds an inverted index from TSV datasets by:

Reading and tokenizing documents with a language-agnostic tokenizer
Building in-memory inverted index mappings (term → postings list)
Periodically flushing partial indexes to disk when memory threshold is reached
Supporting multiple output formats: text, binary, and variable-byte encoding

2. Merger (`src/bin/merger.rs`)

Merges partial indexes into a final consolidated index using:

K-way merge algorithm with min-heap for efficient processing
Delta encoding (d-gap) to compress document IDs
Lexicon construction for fast term lookup
Pre-computing BM25 statistics

3. Query Processor (`src/bin/query_processor.rs`)

Processes user queries and returns ranked results by:

Lexical Search: Standard BM25 ranking using the inverted index.
Semantic Search: Dense vector retrieval using pre-computed embeddings (via Python/PyO3).
Hybrid Reranking: Re-scoring top results using a Cross-Encoder model for higher accuracy.
Caching: LRU caching for inverted lists and document content.
Snippet Generation: Direct file seeking for O(1) snippet extraction.

Installation

Prerequisites

Rust 2024 edition or later
Cargo 1.19.0 or later
Python 3.10+ (for semantic search features)

Python Environment Setup

To enable semantic search, you need to set up a Python environment with the required dependencies:

# Install dependencies
uv sync
source .venv/bin/activate

Build

# Clone the repository
git clone https://github.com/JDScript/SearchEngine-rs.git
cd SearchEngine-rs

# Build in release mode (recommended for performance)
cargo build --release

Configuration

The system is configured via config/search_engine.toml. This file controls all aspects of indexing, merging, and querying.

Example configuration:

[indexer]
dataset = "./data/msmarco-docs.tsv"
output = "./index"
format = "vbyte"
block_size = 50000
buffer_size = 8_388_608

[merger]
input = "./index/temporary"
output = "./index"
input_format = "vbyte"
output_format = "vbyte"
store_diff = true
chunk_size = 8192
output_entry_size = 131072
docs = "./index/docs.txt"
bm25_output = "./index/bm25.bin"
bm25_k = 0.9
bm25_b = 0.4

[query]
lexicon = "./index/storage_vbyte.txt"
docs    = "./index/docs.txt"
ids     = "./index/index.vbyte"
freqs   = "./index/freqs.vbyte"
format  = "vbyte"
collection = "./data/msmarco-docs.tsv"
bm25 = "./index/bm25.bin"

[query.semantic]
model = "msmarco-MiniLM-L6-cos-v5"
cross_encoder_model = "cross-encoder/ms-marco-MiniLM-L-6-v2"
embeddings_path = "corpus_embeddings_msmarco-MiniLM-L6-cos-v5.pt"
doc_ids_path = "embedding_doc_ids.pt"
device = "cuda:0" # Use "cpu" if no GPU is available
port = 8080
snippet_len = 200

Usage

Step 1: Index Your Dataset

cargo run --bin indexer --release -- --config config/search_engine.toml

Step 2: Merge Partial Indexes

cargo run --bin merger --release -- --config config/search_engine.toml

Step 3: Generate Embeddings (Optional, for Semantic Search)

If you want to use semantic search, you need to generate embeddings for your dataset.

# Ensure your virtual environment is active
source .venv/bin/activate

# Run the embedding generation script
python -m python.semantic_search.create_embedding

Note: This script reads the docs.txt generated by the indexer to ensure alignment between the Rust index and Python embeddings.

Step 4: Query the Index

Command-Line Interface

cargo run --bin query_processor --release -- --config config/search_engine.toml

Once running, you can use the following commands:

query: Standard BM25 search.
/mode sem: Switch to Semantic Search mode.
/mode dis: Switch to Disjunctive (OR) mode.
/mode con: Switch to Conjunctive (AND) mode.
/rerank on: Enable Cross-Encoder reranking.
/rerank off: Disable Cross-Encoder reranking.

Web Interface

If port is set to a non-zero value in config/search_engine.toml (e.g., 8080), the query processor will start a web server.

Open http://localhost:8080 in your browser to use the search interface.

Performance (Not Updated Yet)

Indexing (MS MARCO Passage-Ranking, 8.8M documents)

Format	Time	Total Size	Compression	Peak RAM
Text	110.12s	3.86GB	100%	~600MB
Binary	89.38s	6.15GB	159%	~600MB
VByte	102.68s	2.07GB	54%	~600MB

Merging (MS MARCO Passage-Ranking)

Input → Output	Time	Final Size (Index + Freqs)	Peak RAM
Binary → Binary	13.98s	2.86GB	~700MB
VByte → VByte	28.94s	830.5MB	~700MB
Text → Text	26.88s	1.83GB	~700MB

Query Processing (MS MARCO Passage-Ranking)

Query	Mode	Load Time*	Search Time*	Total Time*
Hello World	AND	239ms (0ms)	1.3ms (2.5ms)	243ms (3ms)
Hello World	OR	232ms (0ms)	65ms (110ms)	498ms (305ms)
Dog Cat Elephant	AND	99ms (0ms)	0.2ms (0.6ms)	100ms (1ms)
Dog Cat Elephant	OR	79ms (0ms)	57ms (30ms)	221ms (80ms)

*Times shown as: first run (cached run)

Key Observations:

Variable-byte encoding provides the best compression (~50% reduction) with acceptable performance
Conjunctive (AND) queries are significantly faster than disjunctive (OR) queries
Caching dramatically improves repeated query performance
Common terms (e.g., "and", "the") significantly slow down disjunctive queries

Features

Core Functionality

Block-based Indexing: Processes datasets larger than available RAM by flushing partial indexes periodically
K-way Merging: Efficiently merges multiple partial indexes with minimal memory usage
BM25 Ranking: Industry-standard probabilistic relevance ranking with tunable parameters (k=0.9, b=0.4)
Variable-byte Encoding: Efficient compression for document IDs and term frequencies
Delta Encoding: Further compresses sorted document IDs using gap encoding
LRU Caching: Intelligent caching of frequently accessed inverted lists
Direct Snippet Retrieval: O(1) snippet generation using stored document offsets

Query Modes

Conjunctive (AND): Returns documents containing all query terms (intersection)
Disjunctive (OR): Returns documents containing any query term (union)
Semantic: Returns documents based on vector similarity (cosine similarity)

Multilingual Support

The tokenizer handles multiple languages through Unicode-aware character processing:

ASCII text (case-insensitive)
Chinese, Japanese, Korean (CJK)
Arabic, Hebrew, and other RTL scripts
Cyrillic, Greek, and other alphabets

Project Structure

search_engine/
├── config/               # Configuration files
│   └── search_engine.toml
├── src/
│   ├── bin/              # Executable binaries
│   │   ├── indexer.rs    # Index builder
│   │   ├── merger.rs     # Index merger
│   │   └── query_processor.rs  # Query handler & web server
│   └── lib/              # Shared library code
│       ├── config.rs     # Configuration loading
│       ├── encoding.rs   # Variable-byte encoding/decoding
│       ├── format.rs     # File format handlers
│       ├── inverted_list.rs  # Inverted list abstraction
│       ├── merger.rs     # K-way merge logic
│       ├── types.rs      # Core data structures
│       ├── utils.rs      # Utilities (tokenization, etc.)
│       └── mod.rs        # Library root
├── python/               # Python scripts for semantic search
│   └── semantic_search/
│       ├── create_embedding.py # Offline embedding generation
│       └── search.py           # Runtime search logic
├── static/               # Web interface assets
│   ├── index.html        # Main web UI
│   └── assets/           # CSS, JS, images
├── data/                 # Dataset and indexes
├── Cargo.toml            # Rust package manifest
└── README.md             # This file

Technical Details

Tokenization

The tokenizer uses a character-based approach that:

Converts ASCII letters to lowercase
Preserves non-ASCII characters (for multilingual support)
Treats whitespace and punctuation as delimiters
Supports Unicode character classes (General Punctuation, CJK Symbols)

Variable-byte Encoding

Each integer is encoded using 7 bits per byte:

MSB (bit 7) = continuation flag (0 = more bytes follow, 1 = last byte)
Lower 7 bits store value chunks in little-endian order
Highly efficient for small numbers (<128 uses 1 byte)

BM25 Scoring

BM25(d, q) = Σ IDF(t) · [f(t,d) · (k+1)] / [f(t,d) + k·(1 - b + b·|d|/avgdl)]
             t∈q

Where:

f(t, d) = term frequency in document
|d| = document length
avgdl = average document length
k = 0.9, b = 0.4 (tuning parameters)

Known Limitations

Indexer

Single-threaded: Could benefit from parallel processing for multi-core systems
Naive tokenization: No stemming, lemmatization, or stop-word removal
Static memory control: Block size based on document count rather than actual memory usage

Merger

Memory spikes: Very common terms may cause temporary memory pressure
Sequential merging: Could use multi-level merging for hundreds of partial indexes

Query Processor

No query optimization: Doesn't reorder terms or perform early termination
Limited caching: Cache size is arbitrary rather than memory-aware
Basic term matching: No phrase search, proximity, wildcards, or semantic search
Online BM25: Scores computed on-the-fly (could precompute partial scores)

Future Improvements

Parallel indexing with thread-based document partitioning
Advanced tokenization with stemming and language detection
Query optimization (term reordering, early termination)
Phrase and proximity search support
Score caching and pre-computation
Dynamic memory-aware caching

Dependencies

Key Rust crates used:

axum - Web framework for HTTP server
tokio - Async runtime
clap - Command-line argument parsing
serde/serde_json - Serialization
byteorder - Binary I/O utilities
indicatif - Progress bars
colored - Terminal colors
pyo3 - Rust bindings for Python (used for semantic search)

See Cargo.toml for complete list.

License

This project is an academic assignment for CS-GY 6913: Web Search Engines at NYU.

Author

Zeyu Jiang (zj2631, N16724747)
zeyu.jiang@nyu.edu

Acknowledgments

Dr. Torsten Suel for course instruction
Open-source Rust community for excellent libraries
Claude for assistance with web interface and Rust programming guidance

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.vscode		.vscode
config		config
python		python
results_2019		results_2019
results_2020		results_2020
src		src
static		static
.cspell.json		.cspell.json
.gitignore		.gitignore
.python-version		.python-version
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
pyproject.toml		pyproject.toml
rustfmt.toml		rustfmt.toml
uv.lock		uv.lock

JDScript/SearchEngine-rs

Folders and files

Latest commit

History

Repository files navigation