E5C3 GitHub - breadrock1/doc-searcher: There is documents searcher project based on Rust and Opensearch technologies. · GitHub
[go: up one dir, main page]

Skip to content

breadrock1/doc-searcher

Repository files navigation

Pull Request Actions

Target - Linux Target - MacOS Target - Windows

Doc-Search Metaverse project

Doc-Search is the simple and flexible searching documents application, leveraging the capabilities of Rust and Opensearch to provide efficient and effective full-text search in documents. This project aims to offer a straightforward solution for indexing and searching through a large corpus of documents with the speed and accuracy provided by Opensearch.

The main goal is implement simple and powerful system of storing and indexing documents with searching functionality (full-text, semantic and hybrid). I decided to use opensearch as default searching engine, but you may use own solutions by implementing several async traits for Tantivy, QDrant or own solution:

The princ AAD4 iple schema: architecture.png

Doc-Search includes following sub-services:

  • Cache Service - API of caching service like Redis;
  • Metrics Service - API of metrics to Prometheus monitoring;
  • Storage Service - API (CRUD) of indexed folders and documents;
  • Searcher Service - API of searcher functionalities (fulltext, semantic, hybrid);
  • Embeddings Service (removed) - API of embeddings service if you would like to use own model.

Changelog:

OpenSearch instead Elasticsearch Searcher and Storage services at this moment has common implementation with opensearch

Removed custom embeddings functionality After switching on OpenSearch instead Elasticsearch the neccessary of custon embeddings model integration has gone, because the newer versions of OpenSearch provides ML plugin with neccessary functionality (chunking and emebdding). So Embeddings module was been removed from code base. When i add Qdrant supporting his functionality will be added into infrastructure with Qdrant client implementation.

Features

Service based:

  • Rust Performance: Benefit from the speed and safety of Rust;
  • REST API: Easy to use REST API for searching documents and control management of indexing;
  • Swagger: Using swagger documentation service for all available endpoints;
  • Remote logging: Send error or warning messages or other metrics to remote server;
  • Docker Support: Easy deployment with Docker and docker-compose;
  • Caching Queries: Store data to cache service like Redis or own solutions;

Searching:

  • Full-Text Search: Quickly find documents based on content based on choose searching engine;
  • Semantic Search: Fast semantic searching by external embeddings service;
  • Hybrid Search: Fast hybrid searching by external embeddings service;

Domain

There are following domains:

domain
   |----> Document storage (core)
   |        |----> Index
   |        |       |----> Context: index management into vector storage
   |        |       |----> Services: IIndexStorage
   |        |----> Document
   |                |----> Context: splits document on parts and stores into vector storage
   |                |----> Services: IDocumentPartStorage
   |
   |----> Document searching (core)
   |        |----> Founded document
   |        |       |----> Context: multiple searching kind results 
   |        |       |----> Services: ISearcher
   |        |----> Pagination
   |                |----> Context: paginating of founded results
   |                |----> Services: IPAginator

And there are usecases:

usecase
   |----> Storage Use Case
   |        |----> CRUD of index and document
   |        |----> split large document on parts to store 
   |        |----> upload file to storage and create new task processing event
   |
   |----> Searching Use Case
   |        |----> searching document parts by multiple algorithms
   |        |----> paginate founded document parts results

There is context map:

+----------------+         +-----------------+
| StorageUseCase | <────── | SearcherUseCase |
+----------------+         +-----------------+
        |                           |
        ▼                           ▼
+----------------+         +-----------------+
| Storage Domain |         | Searcher Domain |
+----------------+         +-----------------+

Context data flow:

HTTP Request
     │
     ▼
HTTP Handler (ServerState)
     │
     ▼
ServerAppState
    ├── StorageUseCase (application)
    │       │
    │       ▼
    │    Storage (domain)
    │
    └── SearcherUseCase (application)
            │
            ▼
          Task (domain)

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

  • Rust
  • Docker & docker-compose
  • Cache (Redis)
  • Opensearch

Quick Start

  1. Check docs/opensearch scripts how load ml cluster into single node and setup infrastructure as ingest and searching pipelines and deploying model.
  2. Clone the repository
  3. Run cargo install --path . to build project
  4. Setting up .env file with services creds
  5. Run cargo run --bin init-infrastructure to init elasticsearch schemas
  6. Run cargo run --bin launch to launch service

Features of project

Features to parse and store documents localy from current service (Not stable):

  • enable-unique-doc-id - enable generating unique document id based on index and document ids.

Bread White - doc-search

stars - doc-search forks - doc-search

About

There is documents searcher project based on Rust and Opensearch technologies.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

0