Improve DocChatAgent citations #477

pchalasani · 2024-05-22T19:58:58Z

This is an obvious-in-hindsight idea that should have been implemented long ago. It parallels the "sentence-numbering trick" used in the Relevance Extractor.

Currently, DocChatAgent.answer_from_docs(query, passages) (where passages are already relevant extracts from chunks, pulled using the LLM) sends this prompt to LLM:

Answer the QUERY based on the PASSAGES, and append CITE SOURCES you have used, showing for each 
source, the SOURCE and EXTRACTS, where EXTRACTS should at most contain the first 3 and last 3 words of each extract.

PASSAGES: 
{passages}

QUERY: 
{query}

This results in an LLM response that looks like:

In the year 2050, GPT10 was released. Additionally, all countries merged into Lithuania.

SOURCE: wikipedia
EXTRACTS: In the year ... GPT10 was released.

SOURCE: almanac
EXTRACTS: In the year ... merged into Lithuania.

SOURCE: world history, 2070 edition
EXTRACTS: All countries had  ... back in 2050

There are many issues with this:

having the LLM generate (even partial) extracts is wasteful (token cost), slow, and results in incomplete extracts (since we're trying to save tokens by only generating the first/last few words)
When the response is long, there may be several references used, but the above scheme results in all the references showing up at the end, rather than more granular references for different parts of the response. So we don't know which parts of the response came from which reference.

This can be much improved by instead doing this:

number the passages sent in the prompt, [1]... [2]... etc
ask the LLM to just cite sources using markdown footnote-notation like [^1][^3], etc
the code should then extract the final fully-detailed cited texts and display them (again in markdown footnote syntax) after the LLM generates its answer.

So the idea is just have LLM generate granular, numerical citations, and let the code extract the detailed source text (so we don't spend LLM token cost on this).

This will result in a response that is much more like a standard footnote or reference format:

In the year 2050, GPT10 was released [^1]. Additionally, all countries merged into Lithuania [^2][^5].

SOURCES:
[^1] wikipedia
    In the year 2050, GPT10 was released.
[^2] almanac
    In the year 2050, all countries merged into Lithuania.
[^5] world history, 2070 edition
    All countries had already become part of Lithuania, back in 2050

Note the granular citations. Also, unlike the existing approach, the citations are detailed, not just snippets, and are not generated by the LLM (they are extracted from the LLM's numerical citations).

The text was updated successfully, but these errors were encountered:

* doc_chat_agent: better citation mechanism * handle LLM deviations in table_chat_agent.py * fix rich spinner/streaming edge cases * lance query planner tweaks to help gpt-4o * chat_agent.py fix timing of response -> ChatDocument

pchalasani · 2024-05-24T11:07:00Z

implemented in PR #476

pchalasani mentioned this issue May 22, 2024

doc_chat_agent: better citation mechanism to address #477 #476

Merged

pchalasani closed this as completed May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve DocChatAgent citations #477

Improve DocChatAgent citations #477

Improve DocChatAgent citations #477

Improve DocChatAgent citations #477

Comments