-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Requested feature
The current docling PDF pipeline has functional limitations in fillable PDF processing, docling should be able to extract the text from the fillable areas and keep the correct reading order.
Summary of limitations based on different pdf backend:
- PyPdfiumDocumentBackend:
- text from the fillable areas is missing. -->
pypdfium2 5.x version provides page flatten function to extract the fillable data, so docling needs to upgrade pypdfium2, ticket is opened [Bee] support the latest pypdfium2 version 5.x #2874
- text from the fillable areas is missing. -->
- DoclingParseV4DocumentBackend
- text from the fillable areas can be extracted, but the reading order is wrong. all the fillable content is appened to the end of the page
- with force_full_page_ocr as True
- no matter which backend is selected, the text from the fillable areas is missing
Version:
docling==2.69.0
docling-core==2.59.0
docling-hierarchical-pdf==0.1.3
docling-ibm-models==3.9.1
docling-parse==4.7.3
Testing file:
INV_059_1.pdf
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request