Reading order follows PDF creation order instead of human reading order #2791
Replies: 2 comments
-
|
Hi @srimaailuri! I'm Dosu and I’m helping the docling team. Docling doesn’t currently offer a built-in option or pipeline parameter to enforce human/visual reading order for paragraphs during PDF parsing—the reading order is handled internally by the layout model and isn’t user-configurable. The default Heron layout model generally gives the best results for multi-column layouts, but it doesn’t guarantee correct reading order for all PDFs, especially complex or multi-column documents source. The most effective workaround is to use the external package docling-hierarchical-pdf as a post-processing step. After converting your PDF with Docling’s from docling.document_converter import DocumentConverter
from hierarchical.postprocessor import ResultPostprocessor
result = DocumentConverter().convert("sample.pdf")
ResultPostprocessor(result).process()
# Now result.document will reflect human/visual reading orderThis approach works for both scanned and text-based PDFs, and uses the PDF’s TOC/bookmarks if available, otherwise it falls back to style-based inference source. Other workarounds include manual postprocessing—like sorting extracted text by spatial coordinates or clustering text boxes by x-coordinates—but these require custom code and may not handle all cases reliably source. Improvements to reading order (such as spatial ordering options or ML-based strategies) are under discussion but not yet available in released Docling versions source. If you need help adapting this to your pipeline setup, let me know your Docling version and how you instantiate your pipeline—I can provide a more tailored code example. To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
-
|
@dosu in the last comment, you mentioned "The most effective workaround is to use the external package docling-hierarchical-pdf as a post-processing step." What will happen for the text-based pdf if pdf parser in the backend returns incorrect reading order? will docling-hierarchical-pdf overwrite the result in exported plain text or md? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Docling team,
I’m parsing PDFs with Docling and noticed that the extracted text follows the PDF creation/object order rather than the human reading order, even for normal text and paragraphs.
In some PDFs, paragraphs that appear later on the page visually are returned before earlier paragraphs, which breaks the natural reading flow.
This happens even when OCR and table structure extraction are enabled.
Code used:
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.generate_picture_images = True
pipeline_options.table_structure_options.do_cell_matching = True
doc_converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
result = doc_converter.convert("sample.pdf")
doc = result.document
Is there any existing option or recommended way in Docling to enforce human/visual reading order for paragraphs?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions