10BC0 [Bee] Fillable PDF handling · Issue #2899 · docling-project/docling · GitHub
[go: up one dir, main page]

Skip to content

[Bee] Fillable PDF handling #2899

@yqliving

Description

@yqliving

Requested feature

The current docling PDF pipeline has functional limitations in fillable PDF processing, docling should be able to extract the text from the fillable areas and keep the correct reading order.

Summary of limitations based on different pdf backend:

  1. PyPdfiumDocumentBackend:
  2. DoclingParseV4DocumentBackend
    • text from the fillable areas can be extracted, but the reading order is wrong. all the fillable content is appened to the end of the page
  3. with force_full_page_ocr as True
    • no matter which backend is selected, the text from the fillable areas is missing

Version:
docling==2.69.0
docling-core==2.59.0
docling-hierarchical-pdf==0.1.3
docling-ibm-models==3.9.1
docling-parse==4.7.3

Testing file:
INV_059_1.pdf

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0