10BC0 [KnowledgeBase] Rich Text KB Sources Are Not Chunked · Issue #13834 · botpress/botpress · GitHub
[go: up one dir, main page]

Skip to content

[KnowledgeBase] Rich Text KB Sources Are Not Chunked #13834

@Gordon-BP

Description

@Gordon-BP

Make sure the issue is related to code located in this repository.

  • I confirm that the reported bug or feature request is not related to Botpress on premise version (v12 and below)
  • I confirm that the reported bug or feature request is not related to the Botpress Studio
  • I confirm that the reported bug or feature request is not related to the Botpress Dashboard

Description of the bug or feature request

Issue Summary

If you have an enormous amount of text copy-pasted into a Rich Text KB source, the entire text is stored as a single chunk and stuffed into every KB search, slowing search down and skyrocketing costs.

Steps to Reproduce

  1. Go to Project Gutenburg and download The Great Gatsby. Save it as a markdown file.
  2. Open up a new bot, or load from this JSON, with two knowledge bases: Rich Text and File
  3. Upload the markdown file from step 1 into the "File" KB

Image

  1. Copy/paste the text from the markdown file into a rich text source in the "Rich Text" KB

Image

  1. In your main flow, make two nodes, "Search file" and "Search rich text". Disable KB search on start node.
  • Each node gets two cards: one to query KB card and one text card to say the answer
  • Scope the search so that "Search file" only searches the "File" KB and "Search rich text" only searches the "Rich text" KB

Image

  1. Turn your knowledge agent's model to GPT 4.1 Mini so you don't go broke
  2. Connect up the start to "search rich text" node and ask the bot "Who is Tom Buchanan?" Look at the logs to see how many chunks/tokens were used. Mine used 70,451 tokens!

Image

  1. Now connect the start node to "search file" and ask "Who is Nick Carraway?" Look at the logs to see how many chunks/tokens were used. Mine used only 10,221 this time!

Image

Desired Outcome

Since rich text sources get turned into HTML and uploaded to an S3 anyways, why not treat them like any other file and chunk them before indexing? It would be great if there was no functional difference between having plain text in a rich text source vs saved in a markdown file and uploaded.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0