A high-performance hybrid web search engine implemented in Rust, combining efficient inverted indexing with state-of-the-art semantic search capabilities. It supports indexing millions of documents with limited memory and provides fast multilingual query processing through both command-line and web interfaces.
This search engine was built from scratch to demonstrate core information retrieval principles and modern hybrid search techniques. It features:
- Hybrid Search: Combines BM25 lexical search with dense vector semantic search.
- Cross-Encoder Reranking: Improves result precision by reranking top candidates using a cross-encoder model.
- Memory-efficient indexing of large datasets using block-based processing and k-way merging.
- Fast query processing with caching optimization and variable-byte compression.
- Multilingual support through Unicode-aware tokenization.
- Dual interfaces for both command-line and web-based interaction.
- Centralized Configuration: All settings managed via a single TOML configuration file.
- Python Interop: Seamless integration with Python for deep learning models (Sentence Transformers).
The system has been tested on the MS MARCO datasets:
- Passage-ranking: 8.8M documents, indexed in ~100 seconds using <600MB RAM
- Document-ranking: 3.2M documents, indexed in ~600 seconds using <800MB RAM
The search engine consists of three main components:
Builds an inverted index from TSV datasets by:
- Reading and tokenizing documents with a language-agnostic tokenizer
- Building in-memory inverted index mappings (term → postings list)
- Periodically flushing partial indexes to disk when memory threshold is reached
- Supporting multiple output formats: text, binary, and variable-byte encoding
Merges partial indexes into a final consolidated index using:
- K-way merge algorithm with min-heap for efficient processing
- Delta encoding (d-gap) to compress document IDs
- Lexicon construction for fast term lookup
- Pre-computing BM25 statistics
Processes user queries and returns ranked results by:
- Lexical Search: Standard BM25 ranking using the inverted index.
- Semantic Search: Dense vector retrieval using pre-computed embeddings (via Python/PyO3).
- Hybrid Reranking: Re-scoring top results using a Cross-Encoder model for higher accuracy.
- Caching: LRU caching for inverted lists and document content.
- Snippet Generation: Direct file seeking for O(1) snippet extraction.
- Rust 2024 edition or later
- Cargo 1.19.0 or later
- Python 3.10+ (for semantic search features)
To enable semantic search, you need to set up a Python environment with the required dependencies:
# Install dependencies
uv sync
source .venv/bin/activate# Clone the repository
git clone https://github.com/JDScript/SearchEngine-rs.git
cd SearchEngine-rs
# Build in release mode (recommended for performance)
cargo build --releaseThe system is configured via config/search_engine.toml. This file controls all aspects of indexing, merging, and querying.
Example configuration:
[indexer]
dataset = "./data/msmarco-docs.tsv"
output = "./index"
format = "vbyte"
block_size = 50000
buffer_size = 8_388_608
[merger]
input = "./index/temporary"
output = "./index"
input_format = "vbyte"
output_format = "vbyte"
store_diff = true
chunk_size = 8192
output_entry_size = 131072
docs = "./index/docs.txt"
bm25_output = "./index/bm25.bin"
bm25_k = 0.9
bm25_b = 0.4
[query]
lexicon = "./index/storage_vbyte.txt"
docs = "./index/docs.txt"
ids = "./index/index.vbyte"
freqs = "./index/freqs.vbyte"
format = "vbyte"
collection = "./data/msmarco-docs.tsv"
bm25 = "./index/bm25.bin"
[query.semantic]
model = "msmarco-MiniLM-L6-cos-v5"
cross_encoder_model = "cross-encoder/ms-marco-MiniLM-L-6-v2"
embeddings_path = "corpus_embeddings_msmarco-MiniLM-L6-cos-v5.pt"
doc_ids_path = "embedding_doc_ids.pt"
device = "cuda:0" # Use "cpu" if no GPU is available
port = 8080
snippet_len = 200cargo run --bin indexer --release -- --config config/search_engine.tomlcargo run --bin merger --release -- --config config/search_engine.tomlIf you want to use semantic search, you need to generate embeddings for your dataset.
# Ensure your virtual environment is active
source .venv/bin/activate
# Run the embedding generation script
python -m python.semantic_search.create_embeddingNote: This script reads the docs.txt generated by the indexer to ensure alignment between the Rust index and Python embeddings.
cargo run --bin query_processor --release -- --config config/search_engine.tomlOnce running, you can use the following commands:
query: Standard BM25 search./mode sem: Switch to Semantic Search mode./mode dis: Switch to Disjunctive (OR) mode./mode con: Switch to Conjunctive (AND) mode./rerank on: Enable Cross-Encoder reranking./rerank off: Disable Cross-Encoder reranking.
If port is set to a non-zero value in config/search_engine.toml (e.g., 8080), the query processor will start a web server.
Open http://localhost:8080 in your browser to use the search interface.
| Format | Time | Total Size | Compression | Peak RAM |
|---|---|---|---|---|
| Text | 110.12s | 3.86GB | 100% | ~600MB |
| Binary | 89.38s | 6.15GB | 159% | ~600MB |
| VByte | 102.68s | 2.07GB | 54% | ~600MB |
| Input → Output | Time | Final Size (Index + Freqs) | Peak RAM |
|---|---|---|---|
| Binary → Binary | 13.98s | 2.86GB | ~700MB |
| VByte → VByte | 28.94s | 830.5MB | ~700MB |
| Text → Text | 26.88s | 1.83GB | ~700MB |
| Query | Mode | Load Time* | Search Time* | Total Time* |
|---|---|---|---|---|
| Hello World | AND | 239ms (0ms) | 1.3ms (2.5ms) | 243ms (3ms) |
| Hello World | OR | 232ms (0ms) | 65ms (110ms) | 498ms (305ms) |
| Dog Cat Elephant | AND | 99ms (0ms) | 0.2ms (0.6ms) | 100ms (1ms) |
| Dog Cat Elephant | OR | 79ms (0ms) | 57ms (30ms) | 221ms (80ms) |
*Times shown as: first run (cached run)
Key Observations:
- Variable-byte encoding provides the best compression (~50% reduction) with acceptable performance
- Conjunctive (AND) queries are significantly faster than disjunctive (OR) queries
- Caching dramatically improves repeated query performance
- Common terms (e.g., "and", "the") significantly slow down disjunctive queries
- Block-based Indexing: Processes datasets larger than available RAM by flushing partial indexes periodically
- K-way Merging: Efficiently merges multiple partial indexes with minimal memory usage
- BM25 Ranking: Industry-standard probabilistic relevance ranking with tunable parameters (k=0.9, b=0.4)
- Variable-byte Encoding: Efficient compression for document IDs and term frequencies
- Delta Encoding: Further compresses sorted document IDs using gap encoding
- LRU Caching: Intelligent caching of frequently accessed inverted lists
- Direct Snippet Retrieval: O(1) snippet generation using stored document offsets
- Conjunctive (AND): Returns documents containing all query terms (intersection)
- Disjunctive (OR): Returns documents containing any query term (union)
- Semantic: Returns documents based on vector similarity (cosine similarity)
The tokenizer handles multiple languages through Unicode-aware character processing:
- ASCII text (case-insensitive)
- Chinese, Japanese, Korean (CJK)
- Arabic, Hebrew, and other RTL scripts
- Cyrillic, Greek, and other alphabets
search_engine/
├── config/ # Configuration files
│ └── search_engine.toml
├── src/
│ ├── bin/ # Executable binaries
│ │ ├── indexer.rs # Index builder
│ │ ├── merger.rs # Index merger
│ │ └── query_processor.rs # Query handler & web server
│ └── lib/ # Shared library code
│ ├── config.rs # Configuration loading
│ ├── encoding.rs # Variable-byte encoding/decoding
│ ├── format.rs # File format handlers
│ ├── inverted_list.rs # Inverted list abstraction
│ ├── merger.rs # K-way merge logic
│ ├── types.rs # Core data structures
│ ├── utils.rs # Utilities (tokenization, etc.)
│ └── mod.rs # Library root
├── python/ # Python scripts for semantic search
│ └── semantic_search/
│ ├── create_embedding.py # Offline embedding generation
│ └── search.py # Runtime search logic
├── static/ # Web interface assets
│ ├── index.html # Main web UI
│ └── assets/ # CSS, JS, images
├── data/ # Dataset and indexes
├── Cargo.toml # Rust package manifest
└── README.md # This file
The tokenizer uses a character-based approach that:
- Converts ASCII letters to lowercase
- Preserves non-ASCII characters (for multilingual support)
- Treats whitespace and punctuation as delimiters
- Supports Unicode character classes (General Punctuation, CJK Symbols)
Each integer is encoded using 7 bits per byte:
- MSB (bit 7) = continuation flag (0 = more bytes follow, 1 = last byte)
- Lower 7 bits store value chunks in little-endian order
- Highly efficient for small numbers (<128 uses 1 byte)
BM25(d, q) = Σ IDF(t) · [f(t,d) · (k+1)] / [f(t,d) + k·(1 - b + b·|d|/avgdl)]
t∈q
Where:
f(t, d)= term frequency in document|d|= document lengthavgdl= average document lengthk = 0.9,b = 0.4(tuning parameters)
- Single-threaded: Could benefit from parallel processing for multi-core systems
- Naive tokenization: No stemming, lemmatization, or stop-word removal
- Static memory control: Block size based on document count rather than actual memory usage
- Memory spikes: Very common terms may cause temporary memory pressure
- Sequential merging: Could use multi-level merging for hundreds of partial indexes
- No query optimization: Doesn't reorder terms or perform early termination
- Limited caching: Cache size is arbitrary rather than memory-aware
- Basic term matching: No phrase search, proximity, wildcards, or semantic search
- Online BM25: Scores computed on-the-fly (could precompute partial scores)
- Parallel indexing with thread-based document partitioning
- Advanced tokenization with stemming and language detection
- Query optimization (term reordering, early termination)
- Phrase and proximity search support
- Score caching and pre-computation
- Dynamic memory-aware caching
Key Rust crates used:
axum- Web framework for HTTP servertokio- Async runtimeclap- Command-line argument parsingserde/serde_json- Serializationbyteorder- Binary I/O utilitiesindicatif- Progress barscolored- Terminal colorspyo3- Rust bindings for Python (used for semantic search)
See Cargo.toml for complete list.
This project is an academic assignment for CS-GY 6913: Web Search Engines at NYU.
Zeyu Jiang (zj2631, N16724747)
zeyu.jiang@nyu.edu
- Dr. Torsten Suel for course instruction
- Open-source Rust community for excellent libraries
- Claude for assistance with web interface and Rust programming guidance