8000 GitHub - blogbydev/semantic-spotter
[go: up one dir, main page]

Skip to content

blogbydev/semantic-spotter

Repository files navigation

Long Doc Assist

Problem Statement

Project Goal

The goal of this project is to create a system that can help users understand and extract information from long documents, such as insurance policies, by leveraging the capabilities of language models and vector databases.

Data Sources

Long Documents: The system will be designed to work with long documents, such as insurance policies, legal contracts, or technical manuals. These documents can be in PDF format and may contain complex language and structure.

Challenges faced

  • Going through the various options available in the langchain documentation and understanding the apis.
  • Not all pdfs work out of the box, they need to be in a certain format with lots of text.

Next Steps

  • This application is able to understand most pdfs, however there needs to be support built for many different pdf formats.
  • The chunking strategy should be available in a dropdown format for the user to choose from, may be an option to choose the chunking strategy.
  • Application's AI Evaluations should be conducted to get a measurable value that would represent how well is the application performing.
  • Overall, this is a general pdf upload and query application that works

Solution Overview

Design Choices

  • Document Ingestion: The system will use a document loader to read and process long documents, splitting them into smaller, manageable chunks for better retrieval and understanding.
  • Vector Store: A vector database will be used to store the document chunks, allowing for efficient retrieval based on semantic similarity. Qdrant will be used as the vector store.
  • Language Model: The system will use a language model to generate responses based on user queries and the retrieved document chunks. OpenAI's GPT-3.5-turbo will be used for this purpose.
  • User Interface: A web application will be built using Streamlit to provide an interactive interface for users to upload documents, ask questions, and receive answers based on the content of the documents.
  • Retrieval-Augmented Generation (RAG): The system will implement a RAG approach, where the language model retrieves relevant information from the vector store before generating a response to the user's query. This allows for more accurate and context-aware answers.

Why Langchain?

  • Langchain provides a modular and flexible framework for building applications with language models, making it easier to integrate various components such as document loaders, text splitters, vector stores, and language models.
  • It supports various document loaders and vector stores, allowing for easy integration with different data sources and storage solutions.
  • Langchain has a strong community and extensive documentation, making it easier to find support and resources for building applications.
  • Langchain's integration with popular cloud providers and deployment platforms makes it easy to deploy applications at scale, ensuring that the solution can handle large volumes of data and user queries efficiently.

Architecture

Architecture Diagram

Architecture Description

This diagram represents the architecture of a Long Document Assistant system that allows users to interact with large documents using a chatbot interface.

🧑‍💻 User Interaction

  • The user uploads a document.
  • Then interacts with the Long Doc Bot (built using Streamlit) to ask questions about the document.

🧠 Document Ingestion

  • The Document Ingestion Engine handles:
    • Reading the document
    • Chunking it into manageable parts
    • Embedding each chunk using OpenAI embeddings
  • The embedded chunks are stored in Qdrant, a vector database.

🔍 Query Processing

  • The Query Engine:
    • Retrieves the most relevant chunks from Qdrant
    • Combines them with the user’s question
    • Sends the combined context to OpenAI to generate a response

🤖 Response Delivery

  • The Long Doc Bot presents the generated answer back to the user, enabling conversational querying of large documents.

Solution Implementation

Technology Stack

  • Langchain: A framework for building applications with language models.
  • Qdrant: A vector database for storing and retrieving document chunks.
  • OpenAI: A language model for generating responses based on user queries and creating embeddings for document chunks.
  • Streamlit: A web application framework for building interactive applications.

Development Environment

  • Install poetry for dependency management
  • Install pyenv for python version management
  • Install docker and docker-compose to start qdrant db locally
  • Place an OPENAI_API_KEY.txt file with your open ai api key at the project root.
  • run docker-compose up -d to start qdrant db, then view the qdrant dashboard at http://localhost:6333:dashboard
  • run poetry install to install the dependencies
  • run poetry shell to activate the virtual environment
  • run jupyter lab to start the jupyter notebook server
  • run sh build.sh to update the README.md & output_longdocassist.html
  • run sh run.sh to start the streamlit application

OpenAI components

  • get_openai_api_key: Reads the OpenAI API key from a file.
  • get_embeddings_model: Returns an instance of OpenAIEmbeddings for generating embeddings.
  • get_llm: Returns an instance of ChatOpenAI for generating responses.

Qdrant components

  • re_create_collection: Deletes and recreates a Qdrant collection with the specified name and vector parameters.
  • get_vector_store: Creates a QdrantVectorStore instance for storing and retrieving document chunks, using the specified collection name and embedding model.

Ingestion Engine components

  • get_documents: Loads documents from a PDF file using PyPDFLoader.
  • get_chunks: Splits the loaded documents into smaller chunks using RecursiveCharacterTextSplitter.
  • ingest_document: Ingests the document chunks into a Qdrant vector store, using the file name as the collection name.

Query Engine components

  • get_user_query_response: Takes a user query and a vector store, retrieves relevant documents, and generates a response using the language model. It uses RetrievalQA to handle the retrieval and response generation.

Streamlit Application (User Interface)

  • The Streamlit application allows users to upload a PDF document, which is then processed and stored in the Qdrant vector store.
  • Users can ask questions about the document, and the application retrieves relevant information and generates responses using the language model.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0