[KnowledgeBase] Rich Text KB Sources Are Not Chunked

Make sure the issue is related to code located in this repository.

I confirm that the reported bug or feature request is not related to Botpress on premise version (v12 and below)
I confirm that the reported bug or feature request is not related to the Botpress Studio
I confirm that the reported bug or feature request is not related to the Botpress Dashboard

Description of the bug or feature request

Issue Summary

If you have an enormous amount of text copy-pasted into a Rich Text KB source, the entire text is stored as a single chunk and stuffed into every KB search, slowing search down and skyrocketing costs.

Steps to Reproduce

Go to Project Gutenburg and download The Great Gatsby. Save it as a markdown file.
Open up a new bot, or load from this JSON, with two knowledge bases: Rich Text and File
Upload the markdown file from step 1 into the "File" KB

Copy/paste the text from the markdown file into a rich text source in the "Rich Text" KB

In your main flow, make two nodes, "Search file" and "Search rich text". Disable KB search on start node.

Each node gets two cards: one to query KB card and one text card to say the answer
Scope the search so that "Search file" only searches the "File" KB and "Search rich text" only searches the "Rich text" KB

Turn your knowledge agent's model to GPT 4.1 Mini so you don't go broke
Connect up the start to "search rich text" node and ask the bot "Who is Tom Buchanan?" Look at the logs to see how many chunks/tokens were used. Mine used 70,451 tokens!

Now connect the start node to "search file" and ask "Who is Nick Carraway?" Look at the logs to see how many chunks/tokens were used. Mine used only 10,221 this time!

Desired Outcome

Since rich text sources get turned into HTML and uploaded to an S3 anyways, why not treat them like any other file and chunk them before indexing? It would be great if there was no functional difference between having plain text in a rich text source vs saved in a markdown file and uploaded.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KnowledgeBase] Rich Text KB Sources Are Not Chunked #13834

Make sure the issue is related to code located in this repository.

Description of the bug or feature request

Issue Summary

Steps to Reproduce

Desired Outcome

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[KnowledgeBase] Rich Text KB Sources Are Not Chunked #13834

Description

Make sure the issue is related to code located in this repository.

Description of the bug or feature request

Issue Summary

Steps to Reproduce

Desired Outcome

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions