8000 GitHub - JDScript/SearchEngine-rs: A naive search engine built by rust
[go: up one dir, main page]

Skip to content

JDScript/SearchEngine-rs

Repository files navigation

SearchEngine-rs

A high-performance hybrid web search engine implemented in Rust, combining efficient inverted indexing with state-of-the-art semantic search capabilities. It supports indexing millions of documents with limited memory and provides fast multilingual query processing through both command-line and web interfaces.

Overview

This search engine was built from scratch to demonstrate core information retrieval principles and modern hybrid search techniques. It features:

  • Hybrid Search: Combines BM25 lexical search with dense vector semantic search.
  • Cross-Encoder Reranking: Improves result precision by reranking top candidates using a cross-encoder model.
  • Memory-efficient indexing of large datasets using block-based processing and k-way merging.
  • Fast query processing with caching optimization and variable-byte compression.
  • Multilingual support through Unicode-aware tokenization.
  • Dual interfaces for both command-line and web-based interaction.
  • Centralized Configuration: All settings managed via a single TOML configuration file.
  • Python Interop: Seamless integration with Python for deep learning models (Sentence Transformers).

The system has been tested on the MS MARCO datasets:

  • Passage-ranking: 8.8M documents, indexed in ~100 seconds using <600MB RAM
  • Document-ranking: 3.2M documents, indexed in ~600 seconds using <800MB RAM

Architecture

The search engine consists of three main components:

1. Indexer (src/bin/indexer.rs)

Builds an inverted index from TSV datasets by:

  • Reading and tokenizing documents with a language-agnostic tokenizer
  • Building in-memory inverted index mappings (term → postings list)
  • Periodically flushing partial indexes to disk when memory threshold is reached
  • Supporting multiple output formats: text, binary, and variable-byte encoding

2. Merger (src/bin/merger.rs)

Merges partial indexes into a final consolidated index using:

  • K-way merge algorithm with min-heap for efficient processing
  • Delta encoding (d-gap) to compress document IDs
  • Lexicon construction for fast term lookup
  • Pre-computing BM25 statistics

3. Query Processor (src/bin/query_processor.rs)

Processes user queries and returns ranked results by:

  • Lexical Search: Standard BM25 ranking using the inverted index.
  • Semantic Search: Dense vector retrieval using pre-computed embeddings (via Python/PyO3).
  • Hybrid Reranking: Re-scoring top results using a Cross-Encoder model for higher accuracy.
  • Caching: LRU caching for inverted lists and document content.
  • Snippet Generation: Direct file seeking for O(1) snippet extraction.

Installation

Prerequisites

  • Rust 2024 edition or later
  • Cargo 1.19.0 or later
  • Python 3.10+ (for semantic search features)

Python Environment Setup

To enable semantic search, you need to set up a Python environment with the required dependencies:

# Install dependencies
uv sync
source .venv/bin/activate

Build

# Clone the repository
git clone https://github.com/JDScript/SearchEngine-rs.git
cd SearchEngine-rs

# Build in release mode (recommended for performance)
cargo build --release

Configuration

The system is configured via config/search_engine.toml. This file controls all aspects of indexing, merging, and querying.

Example configuration:

[indexer]
dataset = "./data/msmarco-docs.tsv"
output = "./index"
format = "vbyte"
block_size = 50000
buffer_size = 8_388_608

[merger]
input = "./index/temporary"
output = "./index"
input_format = "vbyte"
output_format = "vbyte"
store_diff = true
chunk_size = 8192
output_entry_size = 131072
docs = "./index/docs.txt"
bm25_output = "./index/bm25.bin"
bm25_k = 0.9
bm25_b = 0.4

[query]
lexicon = "./index/storage_vbyte.txt"
docs    = "./index/docs.txt"
ids     = "./index/index.vbyte"
freqs   = "./index/freqs.vbyte"
format  = "vbyte"
collection = "./data/msmarco-docs.tsv"
bm25 = "./index/bm25.bin"

[query.semantic]
model = "msmarco-MiniLM-L6-cos-v5"
cross_encoder_model = "cross-encoder/ms-marco-MiniLM-L-6-v2"
embeddings_path = "corpus_embeddings_msmarco-MiniLM-L6-cos-v5.pt"
doc_ids_path = "embedding_doc_ids.pt"
device = "cuda:0" # Use "cpu" if no GPU is available
port = 8080
snippet_len = 200

Usage

Step 1: Index Your Dataset

cargo run --bin indexer --release -- --config config/search_engine.toml

Step 2: Merge Partial Indexes

cargo run --bin merger --release -- --config config/search_engine.toml

Step 3: Generate Embeddings (Optional, for Semantic Search)

If you want to use semantic search, you need to generate embeddings for your dataset.

# Ensure your virtual environment is active
source .venv/bin/activate

# Run the embedding generation script
python -m python.semantic_search.create_embedding

Note: This script reads the docs.txt generated by the indexer to ensure alignment between the Rust index and Python embeddings.

Step 4: Query the Index

Command-Line Interface

cargo run --bin query_processor --release -- --config config/search_engine.toml

Once running, you can use the following commands:

  • query: Standard BM25 search.
  • /mode sem: Switch to Semantic Search mode.
  • /mode dis: Switch to Disjunctive (OR) mode.
  • /mode con: Switch to Conjunctive (AND) mode.
  • /rerank on: Enable Cross-Encoder reranking.
  • /rerank off: Disable Cross-Encoder reranking.

Web Interface

If port is set to a non-zero value in config/search_engine.toml (e.g., 8080), the query processor will start a web server.

Open http://localhost:8080 in your browser to use the search interface.

Performance (Not Updated Yet)

Indexing (MS MARCO Passage-Ranking, 8.8M documents)

Format Time Total Size Compression Peak RAM
Text 110.12s 3.86GB 100% ~600MB
Binary 89.38s 6.15GB 159% ~600MB
VByte 102.68s 2.07GB 54% ~600MB

Merging (MS MARCO Passage-Ranking)

Input → Output Time Final Size (Index + Freqs) Peak RAM
Binary → Binary 13.98s 2.86GB ~700MB
VByte → VByte 28.94s 830.5MB ~700MB
Text → Text 26.88s 1.83GB ~700MB

Query Processing (MS MARCO Passage-Ranking)

Query Mode Load Time* Search Time* Total Time*
Hello World AND 239ms (0ms) 1.3ms (2.5ms) 243ms (3ms)
Hello World OR 232ms (0ms) 65ms (110ms) 498ms (305ms)
Dog Cat Elephant AND 99ms (0ms) 0.2ms (0.6ms) 100ms (1ms)
Dog Cat Elephant OR 79ms (0ms) 57ms (30ms) 221ms (80ms)

*Times shown as: first run (cached run)

Key Observations:

  • Variable-byte encoding provides the best compression (~50% reduction) with acceptable performance
  • Conjunctive (AND) queries are significantly faster than disjunctive (OR) queries
  • Caching dramatically improves repeated query performance
  • Common terms (e.g., "and", "the") significantly slow down disjunctive queries

Features

Core Functionality

  • Block-based Indexing: Processes datasets larger than available RAM by flushing partial indexes periodically
  • K-way Merging: Efficiently merges multiple partial indexes with minimal memory usage
  • BM25 Ranking: Industry-standard probabilistic relevance ranking with tunable parameters (k=0.9, b=0.4)
  • Variable-byte Encoding: Efficient compression for document IDs and term frequencies
  • Delta Encoding: Further compresses sorted document IDs using gap encoding
  • LRU Caching: Intelligent caching of frequently accessed inverted lists
  • Direct Snippet Retrieval: O(1) snippet generation using stored document offsets

Query Modes

  • Conjunctive (AND): Returns documents containing all query terms (intersection)
  • Disjunctive (OR): Returns documents containing any query term (union)
  • Semantic: Returns documents based on vector similarity (cosine similarity)

Multilingual Support

The tokenizer handles multiple languages through Unicode-aware character processing:

  • ASCII text (case-insensitive)
  • Chinese, Japanese, Korean (CJK)
  • Arabic, Hebrew, and other RTL scripts
  • Cyrillic, Greek, and other alphabets

Project Structure

search_engine/
├── config/               # Configuration files
│   └── search_engine.toml
├── src/
│   ├── bin/              # Executable binaries
│   │   ├── indexer.rs    # Index builder
│   │   ├── merger.rs     # Index merger
│   │   └── query_processor.rs  # Query handler & web server
│   └── lib/              # Shared library code
│       ├── config.rs     # Configuration loading
│       ├── encoding.rs   # Variable-byte encoding/decoding
│       ├── format.rs     # File format handlers
│       ├── inverted_list.rs  # Inverted list abstraction
│       ├── merger.rs     # K-way merge logic
│       ├── types.rs      # Core data structures
│       ├── utils.rs      # Utilities (tokenization, etc.)
│       └── mod.rs        # Library root
├── python/               # Python scripts for semantic search
│   └── semantic_search/
│       ├── create_embedding.py # Offline embedding generation
│       └── search.py           # Runtime search logic
├── static/               # Web interface assets
│   ├── index.html        # Main web UI
│   └── assets/           # CSS, JS, images
├── data/                 # Dataset and indexes
├── Cargo.toml            # Rust package manifest
└── README.md             # This file

Technical Details

Tokenization

The tokenizer uses a character-based approach that:

  1. Converts ASCII letters to lowercase
  2. Preserves non-ASCII characters (for multilingual support)
  3. Treats whitespace and punctuation as delimiters
  4. Supports Unicode character classes (General Punctuation, CJK Symbols)

Variable-byte Encoding

Each integer is encoded using 7 bits per byte:

  • MSB (bit 7) = continuation flag (0 = more bytes follow, 1 = last byte)
  • Lower 7 bits store value chunks in little-endian order
  • Highly efficient for small numbers (<128 uses 1 byte)

BM25 Scoring

BM25(d, q) = Σ IDF(t) · [f(t,d) · (k+1)] / [f(t,d) + k·(1 - b + b·|d|/avgdl)]
             t∈q

Where:

  • f(t, d) = term frequency in document
  • |d| = document length
  • avgdl = average document length
  • k = 0.9, b = 0.4 (tuning parameters)

Known Limitations

Indexer

  • Single-threaded: Could benefit from parallel processing for multi-core systems
  • Naive tokenization: No stemming, lemmatization, or stop-word removal
  • Static memory control: Block size based on document count rather than actual memory usage

Merger

  • Memory spikes: Very common terms may cause temporary memory pressure
  • Sequential merging: Could use multi-level merging for hundreds of partial indexes

Query Processor

  • No query optimization: Doesn't reorder terms or perform early termination
  • Limited caching: Cache size is arbitrary rather than memory-aware
  • Basic term matching: No phrase search, proximity, wildcards, or semantic search
  • Online BM25: Scores computed on-the-fly (could precompute partial scores)

Future Improvements

  • Parallel indexing with thread-based document partitioning
  • Advanced tokenization with stemming and language detection
  • Query optimization (term reordering, early termination)
  • Phrase and proximity search support
  • Score caching and pre-computation
  • Dynamic memory-aware caching

Dependencies

Key Rust crates used:

  • axum - Web framework for HTTP server
  • tokio - Async runtime
  • clap - Command-line argument parsing
  • serde/serde_json - Serialization
  • byteorder - Binary I/O utilities
  • indicatif - Progress bars
  • colored - Terminal colors
  • pyo3 - Rust bindings for Python (used for semantic search)

See Cargo.toml for complete list.

License

This project is an academic assignment for CS-GY 6913: Web Search Engines at NYU.

Author

Zeyu Jiang (zj2631, N16724747)
zeyu.jiang@nyu.edu

Acknowledgments

  • Dr. Torsten Suel for course instruction
  • Open-source Rust community for excellent libraries
  • Claude for assistance with web interface and Rust programming guidance

About

A naive search engine built by rust

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0