Long Doc Assist

Problem Statement

Project Goal

The goal of this project is to create a system that can help users understand and extract information from long documents, such as insurance policies, by leveraging the capabilities of language models and vector databases.

Data Sources

Long Documents: The system will be designed to work with long documents, such as insurance policies, legal contracts, or technical manuals. These documents can be in PDF format and may contain complex language and structure.

Challenges faced

Going through the various options available in the langchain documentation and understanding the apis.
Not all pdfs work out of the box, they need to be in a certain format with lots of text.

Next Steps

This application is able to understand most pdfs, however there needs to be support built for many different pdf formats.
The chunking strategy should be available in a dropdown format for the user to choose from, may be an option to choose the chunking strategy.
Application's AI Evaluations should be conducted to get a measurable value that would represent how well is the application performing.
Overall, this is a general pdf upload and query application that works

Solution Overview

Design Choices

Document Ingestion: The system will use a document loader to read and process long documents, splitting them into smaller, manageable chunks for better retrieval and understanding.
Vector Store: A vector database will be used to store the document chunks, allowing for efficient retrieval based on semantic similarity. Qdrant will be used as the vector store.
Language Model: The system will use a language model to generate responses based on user queries and the retrieved document chunks. OpenAI's GPT-3.5-turbo will be used for this purpose.
User Interface: A web application will be built using Streamlit to provide an interactive interface for users to upload documents, ask questions, and receive answers based on the content of the documents.
Retrieval-Augmented Generation (RAG): The system will implement a RAG approach, where the language model retrieves relevant information from the vector store before generating a response to the user's query. This allows for more accurate and context-aware answers.

Why Langchain?

Langchain provides a modular and flexible framework for building applications with language models, making it easier to integrate various components such as document loaders, text splitters, vector stores, and language models.
It supports various document loaders and vector stores, allowing for easy integration with different data sources and storage solutions.
Langchain has a strong community and extensive documentation, making it easier to find support and resources for building applications.
Langchain's integration with popular cloud providers and deployment platforms makes it easy to deploy applications at scale, ensuring that the solution can handle large volumes of data and user queries efficiently.

Architecture

Architecture Description

This diagram represents the architecture of a Long Document Assistant system that allows users to interact with large documents using a chatbot interface.

🧑‍💻 User Interaction

The user uploads a document.
Then interacts with the Long Doc Bot (built using Streamlit) to ask questions about the document.

🧠 Document Ingestion

The Document Ingestion Engine handles:
- Reading the document
- Chunking it into manageable parts
- Embedding each chunk using OpenAI embeddings
The embedded chunks are stored in Qdrant, a vector database.

🔍 Query Processing

The Query Engine:
- Retrieves the most relevant chunks from Qdrant
- Combines them with the user’s question
- Sends the combined context to OpenAI to generate a response

🤖 Response Delivery

The Long Doc Bot presents the generated answer back to the user, enabling conversational querying of large documents.

Solution Implementation

Technology Stack

Langchain: A framework for building applications with language models.
Qdrant: A vector database for storing and retrieving document chunks.
OpenAI: A language model for generating responses based on user queries and creating embeddings for document chunks.
Streamlit: A web application framework for building interactive applications.

Development Environment

Install poetry for dependency management
Install pyenv for python version management
Install docker and docker-compose to start qdrant db locally
Place an OPENAI_API_KEY.txt file with your open ai api key at the project root.
run docker-compose up -d to start qdrant db, then view the qdrant dashboard at http://localhost:6333:dashboard
run poetry install to install the dependencies
run poetry shell to activate the virtual environment
run jupyter lab to start the jupyter notebook server
run sh build.sh to update the README.md & output_longdocassist.html
run sh run.sh to start the streamlit application

OpenAI components

get_openai_api_key: Reads the OpenAI API key from a file.
get_embeddings_model: Returns an instance of OpenAIEmbeddings for generating embeddings.
get_llm: Returns an instance of ChatOpenAI for generating responses.

Qdrant components

re_create_collection: Deletes and recreates a Qdrant collection with the specified name and vector parameters.
get_vector_store: Creates a QdrantVectorStore instance for storing and retrieving document chunks, using the specified collection name and embedding model.

Ingestion Engine components

get_documents: Loads documents from a PDF file using PyPDFLoader.
get_chunks: Splits the loaded documents into smaller chunks using RecursiveCharacterTextSplitter.
ingest_document: Ingests the document chunks into a Qdrant vector store, using the file name as the collection name.

Query Engine components

get_user_query_response: Takes a user query and a vector store, retrieves relevant documents, and generates a response using the language model. It uses RetrievalQA to handle the retrieval and response generation.

Streamlit Application (User Interface)

The Streamlit application allows users to upload a PDF document, which is then processed and stored in the Qdrant vector store.
Users can ask questions about the document, and the application retrieves relevant information and generates responses using the language model.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
build.sh		build.sh
docker-compose.yml		docker-compose.yml
langchain-RAG.jpg		langchain-RAG.jpg
longdocassist.ipynb		longdocassist.ipynb
output_longdocassist.html		output_longdocassist.html
output_longdocassist.py		output_longdocassist.py
poetry.lock		poetry.lock
project_report.pdf		project_report.pdf
pyproject.toml		pyproject.toml
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Long Doc Assist

Problem Statement

Project Goal

Data Sources

Challenges faced

Next Steps

Solution Overview

Design Choices

Why Langchain?

Architecture

Architecture Description

🧑‍💻 User Interaction

🧠 Document Ingestion

🔍 Query Processing

🤖 Response Delivery

Solution Implementation

Technology Stack

Development Environment

OpenAI components

Qdrant components

Ingestion Engine components

Query Engine components

Streamlit Application (User Interface)

About

Uh oh!

Releases

Packages

Uh oh!

Languages

blogbydev/semantic-spotter

Folders and files

Latest commit

History

Repository files navigation

Long Doc Assist

Problem Statement

Project Goal

Data Sources

Challenges faced

Next Steps

Solution Overview

Design Choices

Why Langchain?

Architecture

Architecture Description

🧑‍💻 User Interaction

🧠 Document Ingestion

🔍 Query Processing

🤖 Response Delivery

Solution Implementation

Technology Stack

Development Environment

OpenAI components

Qdrant components

Ingestion Engine components

Query Engine components

Streamlit Application (User Interface)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages