8000 feat: Add Audio Extraction by ronantakizawa · Pull Request #3720 · browser-use/browser-use · GitHub
[go: up one dir, main page]

Skip to content

Conversation

@ronantakizawa
Copy link
@ronantakizawa ronantakizawa commented Dec 4, 2025

Adds a new interpret_audio action that enables agents to extract and transcribe audio content from web pages using OpenAI's Whisper API.

Problem

Browser-use can't understand audio content on a website. I tried to use browser-use to do tasks that required listening to an audio file on page and it couldn't.

Solution

This feature allows browser-use agents to understand and process audio elements they encounter during web navigation.

Functionality:

- Extracts audio URLs from HTML <audio> elements on the current page
- Downloads audio files (with redirect support for CDN-hosted content)
- Transcribes audio using OpenAI Whisper API
- Optionally summarizes transcriptions using the agent's LLM
- Returns structured transcription results to the agent

Testing script

  import asyncio
  import os
  from browser_use import Agent, BrowserSession
  from browser_use.llm import ChatOpenAI

  async def test_archive_audio():
      assert os.getenv('OPENAI_API_KEY'), "Set OPENAI_API_KEY"

      browser = BrowserSession()
      llm = ChatOpenAI(model='gpt-4o')

      agent = Agent(
          task="Go to https://archive.org/details/testmp3testfile and transcribe the audio",
          llm=llm,
          browser_session=browser,
          max_steps=10,
      )

      try:
          result = await agent.run()
          print(f"\n✅ Transcription result:\n{result}")
      finally:
          await browser.kill()

  asyncio.run(test_archive_audio())

Summary by cubic

Add interpret_audio to let agents transcribe and optionally summarize audio from web pages using OpenAI Whisper. This helps browser-use agents understand audio/video content they encounter.

  • New Features

    • Extracts audio source from audio/video elements (via attributes, CDP, or JS). Works across iframes and shadow DOM, with optional element index.
    • Downloads with redirects, resolves relative URLs, supports base64 data URLs, and returns clear errors for blob/streaming URLs.
    • Transcribes with OpenAI Whisper; optional summary via the page-extraction LLM; cleans up temp files; returns structured results and memory notes.
    • Docs updated to include interpret_audio in available tools.
  • Dependencies

    • Added aiofiles>=24.1.0.

Written for commit 0357cb7. Summary will update automatically on new commits.

@CLAassistant
Copy link
CLAassistant commented Dec 4, 2025

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor
@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 4 files

Prompt for AI agents (all 1 issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="browser_use/tools/service.py">

<violation number="1" location="browser_use/tools/service.py:964">
P2: Data URLs (`data:`) are excluded from relative URL conversion but not handled explicitly. Attempting to download a data URL via httpx will fail. Consider adding similar handling for data URLs (either decode them directly or return an appropriate error).</violation>
</file>

Since this is your first cubic review, here's how it works:

  • cubic automatically reviews your code and comments on bugs and improvements
  • Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
  • Ask questions if you need clarification on any suggestion

Reply to cubic to teach it or ask questions. Re-run a review with @cubic-dev-ai review this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

0