Reading order follows PDF creation order instead of human reading order #2791

srimaailuri · 2025-12-16T07:47:18Z

srimaailuri
Dec 16, 2025

Hi Docling team,

I’m parsing PDFs with Docling and noticed that the extracted text follows the PDF creation/object order rather than the human reading order, even for normal text and paragraphs.
In some PDFs, paragraphs that appear later on the page visually are returned before earlier paragraphs, which breaks the natural reading flow.

This happens even when OCR and table structure extraction are enabled.

Code used:

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.generate_picture_images = True
pipeline_options.table_structure_options.do_cell_matching = True

doc_converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)

result = doc_converter.convert("sample.pdf")
doc = result.document

Is there any existing option or recommended way in Docling to enforce human/visual reading order for paragraphs?

Thanks!

@srimaailuri · 2025-12-16T07:50:53Z

dosubot[bot]
bot Dec 16, 2025

Hi @srimaailuri! I'm Dosu and I’m helping the docling team.

Docling doesn’t currently offer a built-in option or pipeline parameter to enforce human/visual reading order for paragraphs during PDF parsing—the reading order is handled internally by the layout model and isn’t user-configurable. The default Heron layout model generally gives the best results for multi-column layouts, but it doesn’t guarantee correct reading order for all PDFs, especially complex or multi-column documents source.

The most effective workaround is to use the external package docling-hierarchical-pdf as a post-processing step. After converting your PDF with Docling’s DocumentConverter, you can call ResultPostprocessor(result).process() to reorder paragraphs and infer heading hierarchy in place. Here’s how you can integrate it:

from docling.document_converter import DocumentConverter
from hierarchical.postprocessor import ResultPostprocessor

result = DocumentConverter().convert("sample.pdf")
ResultPostprocessor(result).process()
# Now result.document will reflect human/visual reading order

This approach works for both scanned and text-based PDFs, and uses the PDF’s TOC/bookmarks if available, otherwise it falls back to style-based inference source.

Other workarounds include manual postprocessing—like sorting extracted text by spatial coordinates or clustering text boxes by x-coordinates—but these require custom code and may not handle all cases reliably source.

Improvements to reading order (such as spatial ordering options or ML-based strategies) are under discussion but not yet available in released Docling versions source.

If you need help adapting this to your pipeline setup, let me know your Docling version and how you instantiate your pipeline—I can provide a more tailored code example.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

0 replies

yqliving · 2026-01-15T21:49:43Z

yqliving
Jan 15, 2026

@dosu in the last comment, you mentioned "The most effective workaround is to use the external package docling-hierarchical-pdf as a post-processing step." What will happen for the text-based pdf if pdf parser in the backend returns incorrect reading order? will docling-hierarchical-pdf overwrite the result in exported plain text or md?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading order follows PDF creation order instead of human reading order #2791

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Reading order follows PDF creation order instead of human reading order #2791

Uh oh!

srimaailuri Dec 16, 2025

Replies: 2 comments

Uh oh!

dosubot[bot] bot Dec 16, 2025

Uh oh!

yqliving Jan 15, 2026

srimaailuri
Dec 16, 2025

dosubot[bot]
bot Dec 16, 2025

yqliving
Jan 15, 2026