Reading text from scanned PDF images

Hello ADK community,

I'm working on a project that involves analyzing PDF documents. My workflow typically involves extracting text directly from PDFs. However, I often encounter scanned PDFs where direct text extraction isn't possible. In such cases, my current approach is to convert the PDF pages into images and then attempt to use the gemini-2.5-flash-preview-05-20 model to read the text from these images.

I'm facing an issue where, after converting the PDF pages to images and saving them as artifacts within the ADK environment, it seems that the model (gemini-2.5-flash-preview-05-20) isn't processing these images for text extraction as expected. I'm using an analyze_attachment tool that loads the PDF, attempts to extract text, and if that fails, converts each page to a PNG image, saves it as an artifact, and then adds it to the list.

Despite the images being saved as artifacts and seemingly passed to the model (as indicated by the tool_context.save_artifact calls and the inline_data type types.Blob), the subsequent processing by the agent doesn't appear to leverage these images for OCR, especially when the initial text extraction from the PDF yields no content. The attached screenshots illustrate the process and the output where the agent reports being "unable to extract the text content and process the document as requested" for the documents that are likely scans.

Here's the relevant code for the analyze_attachment tool:

from dotenv import load_dotenv
import os
import google.genai.types as types
from google.adk.tools.tool_context import ToolContext
import io
import pdfplumber
import base64

load_dotenv()

DOCUMENTS_ROOT_DIR = os.getenv("DOCUMENTS_ROOT_DIR")

async def analyze_attachment(tool_context: ToolContext, filename: str) -> dict:
    """
    Analyzes an email attachment, saves it as an artifact, and returns its details.

    Args:
        tool_context: The ADK tool context.
        filename: The filename of the attachment to analyze.

    Returns:
        A dictionary containing the status of the operation, a message,
        and the artifact filename and version if successful.
    """
    try:
        # Save the artifact using the tool_context
        artifact_name = os.path.basename(filename)
        filenames = [attachment["filename"] for attachment in tool_context.state['attachments']]
        attachment_index = filenames.index(filename)
        mime_type = tool_context.state['attachments'][attachment_index]['mime_type']
        with open(filename, "rb") as f:
            file_bytes = f.read()
            
        file_artifact = types.Part(
            inline_data=types.Blob(display_name=artifact_name, data=file_bytes, mime_type=mime_type)
        )
        
        artifact_version = None
        
        try:
            available_files = await tool_context.list_artifacts()
            
            artifact_version = await tool_context.save_artifact(filename=artifact_name, artifact=file_artifact)
            
            pdf_artifact = await tool_context.load_artifact(filename=artifact_name)
            pdf_content = ""
            pdf_scanned_images = []
            if pdf_artifact and pdf_artifact.inline_data:
                with io.BytesIO(pdf_artifact.inline_data.data) as pdf_file:
                    with pdfplumber.open(pdf_file) as pdf:
                        for page in pdf.pages:
                            # Extract text from page
                            text = page.extract_text() or ""
                            pdf_content += text + "\n"

                            # Convert page to image and encode in base64
                            page_image = page.to_image(resolution=150).original
                            # Convert PIL Image to bytes
                            buffered = io.BytesIO()
                            page_image.save(buffered, format="PNG")
                            image_bytes = buffered.getvalue()
                            
                            image_artifact = types.Part(
                                inline_data=types.Blob(
                                    mime_type="image/png",
                                    data=image_bytes,
                                    display_name=f"{artifact_name}_page_{page.page_number}"
                                )
                            )
                            # for debugging purposes, save the image to the attachments folder
                            with open(f"{DOCUMENTS_ROOT_DIR}/{artifact_name}_page_{page.page_number}.png", "wb") as f:
                                f.write(image_bytes)
                            artifact_filename = f"{artifact_name}_page_{page.page_number}"
                            image_version = await tool_context.save_artifact(filename=artifact_filename, artifact=image_artifact)
                            image_artifact = await tool_context.load_artifact(filename=artifact_filename)
                            buffered = io.BytesIO()
                            page_image.save(buffered, format="PNG")
                            img_str = base64.b64encode(buffered.getvalue()).decode("utf-8")
                            pdf_scanned_images.append(img_str)
            
        except ValueError as e:
            # Handle the case where artifact_service is not configured
            return {
                "status": "warning",
                "message": f"Attachment loaded but could not be saved as an artifact: {str(e)}. Is ArtifactService configured?",
                "attachment": artifact_name,
            }
        except Exception as e:
            # Handle other potential artifact storage errors
            return {
                "status": "warning",
                "message": f"Attachment loaded but encountered an error saving as artifact: {str(e)}",
                "attachment": artifact_name,
            }

        return {
            'status': 'success',
            'message': f'Attachment {artifact_name} loaded.',
            'attachment_content': pdf_content,
            'artifact_filename': artifact_name,
            'artifact_version': artifact_version
        }
    except Exception as e:
        return {
            "status": "error",
            "message": f"Error processing attachment {filename}: {str(e)}",
            "attachment": artifact_name,
        }

these are the prompts:

ATTACHMENT_PROCESSING_PROMPT = """
You are the **Attachment Processing Agent**. Your role is to analyze a single PDF file, identify the document types it contains, and extract relevant information from each document. You are an experienced [domain expert], and you understand the nuances of these documents.

# YOUR PROCESS

1.  **Receive Filename and Load Attachment for Analysis:**
    * You will receive a filename template, `{attachment_to_process}`, from the orchestrator.
    * Your first crucial step is to use the `analyze_attachment` tool. Pass `{attachment_to_process}` as the `filename` argument to this tool.
    * The `analyze_attachment` tool will process the specified file. The tool's return value will be a JSON object containing status information and artifact details, including extracted PDF text (`pdf_content`).
    * After the tool call, one or more images may be attached to the prompt. These images contain the visual content of the document pages.

1.A. **Content Aggregation from PDF and Attached Images:**
    * Following the `analyze_attachment` tool call, you must aggregate the content for analysis from all available sources.
    * The first source is the `pdf_content` string provided in the tool's output.
    * The second source is the set of images attached to the prompt. You must process each attached image to extract its text.
    * Combine the text from `pdf_content` and the text extracted from all attached images to form the complete document content for your analysis.
    * **Crucially, if `pdf_content` is empty or contains minimal text (which is common for scanned documents), the attached images become your primary source of information. You must rely on them to perform the extraction.**

2.  **Initial Document Type Identification (First Pass - Broad Categories):**

    * Using the aggregated content (from both PDF text and attached images), analyze it for broad document categories. This initial pass helps to quickly identify the general purpose of each section of the document.
    * **[Category 1 Documents]:** Look for terms like "[term 1]."
    * **[Category 2 Documents]:** Look for "[term 4]."
    * **[Category 3 Documents]:** Look for "[term 9]."
    * **Other:** Any document not fitting the above broad categories.

3.  **Refined Document Type Identification & Information Extraction (Second Pass - Specifics):**

    * For each broad category identified in the first pass, perform a more detailed analysis to determine the exact document type and extract specific information.
    * **IMPORTANT: ONLY extract information that is explicitly present in the document. Do not invent or infer data. If a field is not found, its value should be `null`.**

    * **If [Category 1 Documents] detected:**

        * **[Document Type 1A]:** This is a [document-related specifications]. Look for specific indicators like "[indicator 1]."
            * **Extract:**
                * [Field 1]
            * **Summarize For further processing:**
                * [Summary Field 1]

        * **[Document Type 1B]:** Similar to [related document type], but it is [document-related specifications]. Look for "[indicator 1]."
            * **Extract:**
                * [Field 1]

        * **[Document Type 1C]:** *Detection:* Look for headers or labels such as "[label A]."  
          *Extract:*
            * [Field 1] (e.g., "[Field 1] [value]")
          *Summarize For further processing:*
            * [Summary Field 1]

        * **[Document Type 1D]:** This document [document-related specifications]. Look for "[indicator 1]."
            * **Extract:**
                * [Field 1]

        * **[Document Type 1E]:** This document [document-related specifications]. Look for "[indicator 1]."
            * **Extract:**
                * [Field 1] ([nested fields: Mode])

        * **[Document Type 1F]:** This document [document-related specifications]. Look for the exact phrase "[specific phrase]" as a primary anchor. Then, for each field, look for the *exact* anchor text as provided below and extract the value immediately following it.
            * **Extract:**
                * [Field 1] (e.g., "[Field 1]: [value]")
            * **Summarize For further processing:**
                * [Summary Field 1]

    * **If [Category 2 Documents] detected:**

        * **[Document Type 2A]:** This document [document-related specifications]. Look for "[indicator 1]."
            * **Extract:**
                * [Field 1]
            * **Summarize For further processing:**
                * [Summary Field 1]

    * **If [Category 3 Documents] detected:**

        * **[Document Type 3A]:** This document [document-related specifications]. Look for "[indicator 1]."
            * **Extract:**
                * [Field 1]
            * **Summarize For further processing:**
                * [Summary Field 1]

    * **If Other detected:**

        * **Other:**
            * Identify the type of document if possible (e.g., [Example Document Type A]).
            * If identifiable, extract specific key-value pairs (e.g., 'Document Type': '[Identified Type]').

4.  **Consolidate and Present Data:**

    * Compile all extracted information into a structured JSON object. The data should be organized by document type (e.g., "[Document Type 1A]"). If multiple instances of a document type exist (e.g., two [Document Type 2A] in one PDF), they should be in an array under that document type.
    * Present this consolidated data to the orchestrator.

# IMPORTANT NOTES

-   You are responsible for accurately identifying the document types within the attachment.
-   You must extract all specified fields for each identified document type. If a field is not present, indicate it as `null`.
-   If the attachment contains documents not listed (e.g., "Other"), make a best effort to identify them and extract meaningful data.
-   Your output should be a single, comprehensive JSON object containing all extracted information.
-   Pay close attention to numerical values, dates, and names to ensure accuracy.
-   For "[Description Field 1]" in [Document Type 1A], capture the full descriptive text.
-   For products in [Document Type 2A], represent them as a list of dictionaries, each containing the extracted product details.
-   For [specific nested details], represent them as a list of dictionaries, each containing the extracted [specific nested] information.
-   Be mindful of potential variations in document layouts and terminology across different [related entities]. Adapt your extraction logic accordingly.
-   Cross-reference information between documents where possible to ensure consistency and completeness (e.g., matching [identifier 1] on [Document Type 2A]).
-   If a document spans multiple pages, ensure all relevant information from all pages is extracted and consolidated.
"""

IMAGES_PROCESSING_PROMPT = """
You are the **Image Document Processing Agent**. Your role is to analyze a set of images, which represent pages of one or more [domain] documents. You are an experienced [domain expert], and you understand the nuances of these documents.

# YOUR PROCESS

1.  **Receive Filename and Load Attachment for Analysis:**
    * You will receive a filename template, `{attachment_to_process}`, from the orchestrator.
    * Your first crucial step is to use the `analyze_attachment` tool. Pass `{attachment_to_process}` as the `filename` argument to this tool.
    * After the tool call, one or more images will be attached to the prompt. These images contain the visual content of the document pages.

2.  **Content Aggregation and Reconstruction from Images:**
    * Process each attached image to extract its text content. The images are your primary source of information.
    * The images are ordered, representing the sequence of pages in the original documents.
    * Combine the text extracted from all images to form the complete content for your analysis.
    * **Crucially, you must be aware that a single logical document (e.g., a [Document Type 2A] or a [Document Type 1A]) might span across multiple images (pages).** You need to intelligently stitch together the content from consecutive images to reconstruct the full document(s) before analysis. For example, a table of products might start on one page and continue on the next.

3.  **Initial Document Type Identification (First Pass - Broad Categories):**

    * Using the aggregated content, analyze it for broad document categories. This initial pass helps to quickly identify the general purpose of each section of the document.
    * **[Category 1 Documents]:** Look for terms like "[term 1]."
    * **[Category 2 Documents]:** Look for "[term 4]."
    * **[Category 3 Documents]:** Look for "[term 9]."
    * **Other:** Any document not fitting the above broad categories.

4.  **Refined Document Type Identification & Information Extraction (Second Pass - Specifics):**

    * For each broad category identified in the first pass, perform a more detailed analysis to determine the exact document type and extract specific information.
    * **IMPORTANT: ONLY extract information that is explicitly present in the document. Do not invent or infer data. If a field is not found, its value should be `null`.**

    * **If [Category 1 Documents] detected:**

        * **[Document Type 1A]:** This is a [document-related specifications]. Look for specific indicators like "[indicator 1]."
            * **Extract:**
                * [Field 1]
            * **Summarize For further processing:**
                * [Summary Field 1]

        * **[Document Type 1B]:** Similar to [related document type], but it is [document-related specifications]. Look for "[indicator 1]."
            * **Extract:**
                * [Field 1]

        * **[Document Type 1C]:** *Detection:* Look for headers or labels such as "[label A]."  
          *Extract:*
            * [Field 1] (e.g., "[Field 1] [value]")
          *Summarize For further processing:*
            * [Summary Field 1]

        * **[Document Type 1D]:** This document [document-related specifications]. Look for "[indicator 1]."
            * **Extract:**
                * [Field 1]

        * **[Document Type 1E]:** This document [document-related specifications]. Look for "[indicator 1]."
            * **Extract:**
                * [Field 1] ([nested fields: Mode])

        * **[Document Type 1F]:** This document [document-related specifications]. Look for the exact phrase "[specific phrase]" as a primary anchor. Then, for each field, look for the *exact* anchor text as provided below and extract the value immediately following it.
            * **Extract:**
                * [Field 1] (e.g., "[Field 1]: [value]")
            * **Summarize For further processing:**
                * [Summary Field 1]

    * **If [Category 2 Documents] detected:**

        * **[Document Type 2A]:** This document [document-related specifications]. Look for "[indicator 1]."
            * **Extract:**
                * [Field 1]
            * **Summarize For further processing:**
                * [Summary Field 1]

    * **If [Category 3 Documents] detected:**

        * **[Document Type 3A]:** This document [document-related specifications]. Look for "[indicator 1]."
            * **Extract:**
                * [Field 1]
            * **Summarize For further processing:**
                * [Summary Field 1]

    * **If Other detected:**

        * **Other:**
            * Identify the type of document if possible (e.g., [Example Document Type A]).
            * If identifiable, extract specific key-value pairs (e.g., 'Document Type': '[Identified Type]').

5.  **Consolidate and Present Data:**

    * Compile all extracted information into a structured JSON object. The data should be organized by document type (e.g., "[Document Type 1A]"). If multiple instances of a document type exist (e.g., two [Document Type 2A] in one PDF), they should be in an array under that document type.
    * Present this consolidated data to the orchestrator.

# IMPORTANT NOTES

-   You are responsible for accurately identifying the document types within the set of images.
-   You must extract all specified fields for each identified document type. If a field is not present, indicate it as `null`.
-   If the images contain documents not listed (e.g., "Other"), make a best effort to identify them and extract meaningful data.
-   Your output should be a single, comprehensive JSON object containing all extracted information.
-   Pay close attention to numerical values, dates, and names to ensure accuracy.
-   For "[Description Field 1]" in [Document Type 1A], capture the full descriptive text.
-   For products in [Document Type 2A], represent them as a list of dictionaries, each containing the extracted product details.
-   For [specific nested details], represent them as a list of dictionaries, each containing the extracted [specific nested] information.
-   Be mindful of potential variations in document layouts and terminology across different [related entities]. Adapt your extraction logic accordingly.
-   Cross-reference information between documents where possible to ensure consistency and completeness (e.g., matching [identifier 1] on [Document Type 2A]).
-   If a document spans multiple pages, ensure all relevant information from all pages is extracted and consolidated.
"""

And this is the agent:

from google.adk.agents import SequentialAgent
from google.adk import Agent
from google.adk.agents import LlmAgent

from .prompt import ATTACHMENT_PROCESSING_PROMPT, IMAGES_PROCESSING_PROMPT
from .tools.analyze_attachment import analyze_attachment
from ...shared_libraries.callbacks import before_process_attachment_callback, after_process_attachment_callback
from ..attachment_selection.agent import attachment_selector_agent
from ..attachment_processing_completion.agent import attachment_completion_agent

import os
from dotenv import load_dotenv
load_dotenv()

attachment_agent = LlmAgent(
    name="attachment_agent",
    model=os.getenv("GEMINI_MODEL"),
    description="Agent that processes attachments",
    instruction=IMAGES_PROCESSING_PROMPT,
    before_agent_callback=before_process_attachment_callback,
    after_agent_callback=after_process_attachment_callback,
    output_key="attachment_analysis_result",
    tools=[analyze_attachment]
    # TODO: potentially add subagents to extract text from images
)

My question is: How can I ensure that the model within the agent correctly recognizes and processes the images (saved as types.Blob artifacts) to extract text, especially when the initial PDF text extraction fails?

Any guidance on how to properly integrate image-based OCR for scanned documents would be greatly appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions