Transcribing code blocks with external VLM instead of OCR #2903

vaclcer · 2026-01-21T13:04:55Z

vaclcer
Jan 21, 2026

Hi, I noticed that code blocks are ignored with disabled OCR. My plan was to send the code blocks to external VLM for transcription as any other image, but that does not work. General Images and diagrams work, code blocks do not.

Having expensive OCR enabled only for that does not seem right (and its slow).

Or is that somehow configurable? Thanks!

dolfim-ibm · 2026-01-21T13:08:34Z

dolfim-ibm
Jan 21, 2026
Maintainer

For Code blocks we have the CodeFormula vlm which is really tuned for doing the OCR task of reading code, i.e. taking care of proper syntax, spaces, etc.

The current implementation relies on the model to use the transformers runtime (sometime problematic, like on macos) but we are just about to generalize the model runtimes to allow it to also run as API or mlx, etc.

1 reply

dolfim-ibm Jan 21, 2026
Maintainer

For reference: https://docling-project.github.io/docling/usage/enrichments/#code-understanding

vaclcer · 2026-01-21T13:42:19Z

vaclcer
Jan 21, 2026
Author

Thanks for a quick reply!

I guess the same problem as reported here #2833

Understood - will try it out, but just to be sure, there is no way to treat the code blocks as any other images? Or in more general way: extract text only and send all the rest of found figures/blocks into external VLM for transcriptions as images?

My reasoning is that properly deployed VLM can be very performant and/or have better outputs than small internal Docling models.

2 replies

noamkap Jan 27, 2026

@dosu can you provide a full code example for this use case?

dosubot[bot] bot Jan 27, 2026

Here's a full code example for sending all non-text blocks (including code blocks, figures, tables, etc.) to an external Vision-Language Model (VLM) for transcription as images in Docling. This approach uses a two-step pipeline: first, convert your PDF to a Docling JS FF8 ON with embedded images; then, post-process each detected region (including code blocks) by cropping and sending it to your external VLM API for transcription. This is based on the official example post_process_ocr_with_vlm.py and the enrichment pipeline docs.

from pathlib import Path
from docling.document_converter import DocumentConverter, PdfFormatOption, FormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions, PictureDescriptionApiOptions
from docling.datamodel.accelerator_options import AcceleratorOptions
from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline
from docling.pipeline.simple_pipeline import SimplePipeline
from docling_core.types.doc.document import DoclingDocument
from docling_core.types.doc.items import DocItem, TableItem, GraphCell
from concurrent.futures import ThreadPoolExecutor
import os

# Step 1: Convert PDF to Docling JSON with images
def convert_pdf(pdf_path: Path, out_intermediate_json: Path):
    pipeline_options = PdfPipelineOptions()
    pipeline_options.generate_page_images = True
    pipeline_options.generate_picture_images = True
    pipeline_options.images_scale = 2.0

    doc_converter = DocumentConverter(
        allowed_formats=[InputFormat.PDF],
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_cls=StandardPdfPipeline, pipeline_options=pipeline_options
            )
        },
    )
    conv_result = doc_converter.convert(pdf_path)
    conv_result.document.save_as_json(filename=out_intermediate_json, image_mode="embedded")

# Step 2: Post-process with external VLM OCR
class PostOcrApiEnrichmentModel:
    def __init__(self, api_options):
        self.api_options = api_options
        self.concurrency = api_options.concurrency or 2

    def _api_request(self, image):
        # Implement your API call here (e.g., requests.post with image and prompt)
        # Should return the transcribed text
        raise NotImplementedError("Implement your VLM API call here")

    def annotate_images(self, images):
        with ThreadPoolExecutor(max_workers=self.concurrency) as executor:
            return list(executor.map(self._api_request, images))

    def __call__(self, doc: DoclingDocument, elements):
        images = [el['image'] for el in elements]
        outputs = self.annotate_images(images)
        for el, output in zip(elements, outputs):
            el['item'].text = output
            yield el['item']

def post_process_json(in_json: Path, out_final_json: Path, api_options):
    doc = DoclingDocument.load_from_json(in_json)
    elements = []
    for item in doc.iterate_items(traverse_pictures=True):
        # Crop image for each item (code block, figure, etc.)
        # This is simplified; see the official example for full bounding box logic
        if hasattr(item, "get_image"):
            image = item.get_image(doc)
            if image:
                elements.append({'item': item, 'image': image})
    model = PostOcrApiEnrichmentModel(api_options)
    for _ in model(doc, elements):
        pass
    doc.save_as_json(out_final_json)

# Usage
pdf_path = Path("your_file.pdf")
inter_json = Path("intermediate.json")
final_json = Path("final.json")

convert_pdf(pdf_path, inter_json)

api_options = PictureDescriptionApiOptions(
    url="http://your-vlm-server:8000/v1/chat/completions",
    params={"model": "your-vlm-model"},
    prompt="If the image contains code, transcribe it exactly as seen, preserving formatting.",
    timeout=60,
    concurrency=4,
    batch_size=4,
    scale=1.5,
    picture_area_threshold=0.0,
)

post_process_json(inter_json, final_json, api_options)

The key is to disable OCR and code enrichment in the initial pipeline, enable image extraction, and use a post-processing step to send all regions (including code blocks) as images to your VLM API.
The prompt should instruct the VLM to transcribe code blocks exactly, preserving formatting.
For a production-ready version, see the full example script and adapt the API call logic as needed.

This approach is modular and lets you leverage Docling's layout and region detection, while offloading all image-based transcription (including code blocks) to your external VLM for best results.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transcribing code blocks with external VLM instead of OCR #2903

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Transcribing code blocks with external VLM instead of OCR #2903

Uh oh!

vaclcer Jan 21, 2026

Replies: 2 comments · 3 replies

Uh oh!

dolfim-ibm Jan 21, 2026 Maintainer

Uh oh!

dolfim-ibm Jan 21, 2026 Maintainer

Uh oh!

vaclcer Jan 21, 2026 Author

Uh oh!

noamkap Jan 27, 2026

Uh oh!

dosubot[bot] bot Jan 27, 2026

vaclcer
Jan 21, 2026

Replies: 2 comments 3 replies

dolfim-ibm
Jan 21, 2026
Maintainer

dolfim-ibm Jan 21, 2026
Maintainer

vaclcer
Jan 21, 2026
Author