The goal of this project is to create a system that can help users understand and extract information from long documents, such as insurance policies, by leveraging the capabilities of language models and vector databases.
Long Documents: The system will be designed to work with long documents, such as insurance policies, legal contracts, or technical manuals. These documents can be in PDF format and may contain complex language and structure.
- Going through the various options available in the langchain documentation and understanding the apis.
- Not all pdfs work out of the box, they need to be in a certain format with lots of text.
- This application is able to understand most pdfs, however there needs to be support built for many different pdf formats.
- The chunking strategy should be available in a dropdown format for the user to choose from, may be an option to choose the chunking strategy.
- Application's AI Evaluations should be conducted to get a measurable value that would represent how well is the application performing.
- Overall, this is a general pdf upload and query application that works
- Document Ingestion: The system will use a document loader to read and process long documents, splitting them into smaller, manageable chunks for better retrieval and understanding.
- Vector Store: A vector database will be used to store the document chunks, allowing for efficient retrieval based on semantic similarity. Qdrant will be used as the vector store.
- Language Model: The system will use a language model to generate responses based on user queries and the retrieved document chunks. OpenAI's GPT-3.5-turbo will be used for this purpose.
- User Interface: A web application will be built using Streamlit to provide an interactive interface for users to upload documents, ask questions, and receive answers based on the content of the documents.
- Retrieval-Augmented Generation (RAG): The system will implement a RAG approach, where the language model retrieves relevant information from the vector store before generating a response to the user's query. This allows for more accurate and context-aware answers.
- Langchain provides a modular and flexible framework for building applications with language models, making it easier to integrate various components such as document loaders, text splitters, vector stores, and language models.
- It supports various document loaders and vector stores, allowing for easy integration with different data sources and storage solutions.
- Langchain has a strong community and extensive documentation, making it easier to find support and resources for building applications.
- Langchain's integration with popular cloud providers and deployment platforms makes it easy to deploy applications at scale, ensuring that the solution can handle large volumes of data and user queries efficiently.
This diagram represents the architecture of a Long Document Assistant system that allows users to interact with large documents using a chatbot interface.
- The user uploads a document.
- Then interacts with the Long Doc Bot (built using Streamlit) to ask questions about the document.
- The Document Ingestion Engine handles:
- Reading the document
- Chunking it into manageable parts
- Embedding each chunk using OpenAI embeddings
- The embedded chunks are stored in Qdrant, a vector database.
- The Query Engine:
- Retrieves the most relevant chunks from Qdrant
- Combines them with the user’s question
- Sends the combined context to OpenAI to generate a response
- The Long Doc Bot presents the generated answer back to the user, enabling conversational querying of large documents.
- Langchain: A framework for building applications with language models.
- Qdrant: A vector database for storing and retrieving document chunks.
- OpenAI: A language model for generating responses based on user queries and creating embeddings for document chunks.
- Streamlit: A web application framework for building interactive applications.
- Install poetry for dependency management
- Install pyenv for python version management
- Install docker and docker-compose to start qdrant db locally
- Place an OPENAI_API_KEY.txt file with your open ai api key at the project root.
- run
docker-compose up -d
to start qdrant db, then view the qdrant dashboard at http://localhost:6333:dashboard - run
poetry install
to install the dependencies - run
poetry shell
to activate the virtual environment - run
jupyter lab
to start the jupyter notebook server - run
sh build.sh
to update theREADME.md
&output_longdocassist.html
- run
sh run.sh
to start the streamlit application
- get_openai_api_key: Reads the OpenAI API key from a file.
- get_embeddings_model: Returns an instance of OpenAIEmbeddings for generating embeddings.
- get_llm: Returns an instance of ChatOpenAI for generating responses.
- re_create_collection: Deletes and recreates a Qdrant collection with the specified name and vector parameters.
- get_vector_store: Creates a QdrantVectorStore instance for storing and retrieving document chunks, using the specified collection name and embedding model.
- get_documents: Loads documents from a PDF file using PyPDFLoader.
- get_chunks: Splits the loaded documents into smaller chunks using RecursiveCharacterTextSplitter.
- ingest_document: Ingests the document chunks into a Qdrant vector store, using the file name as the collection name.
- get_user_query_response: Takes a user query and a vector store, retrieves relevant documents, and generates a response using the language model. It uses RetrievalQA to handle the retrieval and response generation.
- The Streamlit application allows users to upload a PDF document, which is then processed and stored in the Qdrant vector store.
- Users can ask questions about the document, and the application retrieves relevant information and generates responses using the language model.