Audio and Speech - OpenAI API

Voice agents utilize APIs for audio processing, enabling tasks to be handled in natural language. There are two main approaches: speech-to-speech models with the Realtime API for low-latency interactions, or a chained method using speech-to-text and text-to-speech models. OpenAI offers various APIs for building audio applications, with options for real-time processing, transcription, and specialized functionalities.

Uploaded by

Richard Scandrett

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views1 page

Audio and Speech - OpenAI API

Uploaded by

Richard Scandrett

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

Voice agents Choosing the right API

Voice agents understand audio to handle tasks and respond back in natural language. There are multiple APIs for transcribing or generating audio:
Copy page There are two main ways to approach voice agents: either with speech-to-speech
models and the Realtime API, or by chaining together a speech-to-text model, a text API SUPPORTED MODALITIES STREAMING SUPPORT

Audio and speech language model to process the request, and a text-to-speech model to respond. Realtime API Audio and text inputs and outputs Audio streaming in and out
Speech-to-speech is lower latency and more natural, but chaining together a voice
Explore audio and speech features in the OpenAI API. agent is a reliable way to extend a text-based agent into a voice agent. If you are Chat Completions API Audio and text inputs and outputs Audio streaming out

already using the Agents SDK, you can extend your existing agents with voice Transcription API Audio inputs Audio streaming out
The OpenAI API provides a range of audio capabilities. If you know what you want to Transcribe audio capabilities using the chained approach.
Speech API Text inputs and audio outputs Audio streaming out
build, find your use case below to get started. If you're not sure where to start, read this Convert speech to text instantly and accurately.

page as an overview.
Streaming audio
General use APIs vs. specialized APIs
Process audio in real time to build voice agents and other low-latency applications,
Build with audio The main distinction is general use APIs vs. specialized APIs. With the Realtime and
including transcription use cases. You can stream audio in and out of a model with the
Realtime API. Our advanced speech models provide automatic speech recognition for Chat Completions APIs, you can use our latest models' native audio understanding and
improved accuracy, low-latency interactions, and multilingual support. generation capabilities and combine them with other features like function calling.
These APIs can be used for a wide range of use cases, and you can select the model
you want to use.
Text to speech
On the other hand, the Transcription, Translation and Speech APIs are specialized to
For turning text into speech, use the Audio API audio/speech endpoint. Models
work with specific models and only meant for one purpose.
compatible with this endpoint are gpt-4o-mini-tts , tts-1 , and tts-1-hd . With
gpt-4o-mini-tts , you can ask the model to speak a certain way or with a certain
tone of voice. Talking with a model vs. controlling the script
Speak text
Turn text into natural-sounding speech in real time. Another way to select the right API is asking yourself how much control you need. To
Speech to text design conversational interactions, where the model thinks and responds in speech,
use the Realtime or Chat Completions API, depending if you need low-latency or not.
A tour of audio use cases For speech to text, use the Audio API audio/transcriptions endpoint. Models
Build voice agents compatible with this endpoint are gpt-4o-transcribe , gpt-4o-mini-transcribe ,
Build interactive voice-driven applications.
You won't know exactly what the model will say ahead of time, as it will generate audio
LLMs can process audio by using sound as input, creating sound as output, or both. and whisper-1 . With streaming, you can continuously pass in audio and get a responses directly, but the conversation will feel natural.
OpenAI has several API endpoints that help you build audio applications or voice continuous stream of text back.
agents. For more control and predictability, you can use the Speech-to-text / LLM / Text-to-
speech pattern, so you know exactly what the model will say and can control the

https://platform.openai.com/docs/guides/audio?api-mode=chat 7/7/25, 7:27 AM https://platform.openai.com/docs/guides/audio?api-mode=chat 7/7/25, 7:27 AM https://platform.openai.com/docs/guides/audio?api-mode=chat 7/7/25, 7:27 AM https://platform.openai.com/docs/guides/audio?api-mode=chat 7/7/25, 7:27 AM
Page 1 of 6 Page 2 of 6 Page 3 of 6 Page 4 of 6

response. Please note that with this method, there will be added latency. Create a human-like audio response to a prompt javascript

This is what the Audio APIs are for: pair an LLM with the audio/transcriptions 1 import { writeFileSync } from "node:fs";
and audio/speech endpoints to take spoken user input, process and generate a text import OpenAI from "openai";
2
response, and then convert that to speech that the user can hear. 3
const openai = new OpenAI();
4
5
Recommendations 6
// Generate an audio response to the given prompt

7 const response = await openai.chat.completions.create({

If you need real-time interactions or transcription, use the Realtime API. model: "gpt-4o-audio-preview",
8
If realtime is not a requirement but you're looking to build a voice agent or an 9 modalities: ["text", "audio"],
10 audio: { voice: "alloy", format: "wav" },
audio-based application that requires features such as function calling, use the
11 messages: [
Chat Completions API.
12 {
For use cases with one specific purpose, use the Transcription, Translation, or 13 role: "user",
Speech APIs. 14 content: "Is a golden retriever a good family dog?"
15 }
16 ],
Add audio to your existing application 17 store: true,
18 });
19
Models such as GPT-4o or GPT-4o mini are natively multimodal, meaning they can
20
understand and generate multiple modalities as input and output. // Inspect returned data
21
console.log(response.choices[0]);
22
If you already have a text-based LLM application with the Chat Completions endpoint, 23
// Write audio data to a file
you may want to add audio capabilities. For example, if your chat application supports 24
writeFileSync(
text input, you can add audio input and output—just include audio in the 25
"dog.wav",
26
modalities array and use an audio model, like gpt-4o-audio-preview . Buffer.from(response.choices[0].message.audio.data, 'base64'),
27
28 { encoding: "utf-8" }
);
Audio is not yet supported in the Responses API.

Audio output from model Audio input to model

https://platform.openai.com/docs/guides/audio?api-mode=chat 7/7/25, 7:27 AM https://platform.openai.com/docs/guides/audio?api-mode=chat 7/7/25, 7:27 AM

Page 5 of 6 Page 6 of 6

Audio and Speech - OpenAI API
No ratings yet
Audio and Speech - OpenAI API
5 pages
OpenAI O1 and New Tools For Developers - OpenAI
No ratings yet
OpenAI O1 and New Tools For Developers - OpenAI
12 pages
Using OpenAI's RealTime API - WorkAdventure Documentation
No ratings yet
Using OpenAI's RealTime API - WorkAdventure Documentation
1 page
Azure Ai Services Speech Service
No ratings yet
Azure Ai Services Speech Service
1,442 pages
Text Generation - OpenAI API
No ratings yet
Text Generation - OpenAI API
12 pages
Natural Language Processing: Task4
No ratings yet
Natural Language Processing: Task4
12 pages
Modal Poster Template (Autosaved)
No ratings yet
Modal Poster Template (Autosaved)
1 page
Voice Assistant - Doge: Bachelor of Engineering IN Computer Science & Engineering
No ratings yet
Voice Assistant - Doge: Bachelor of Engineering IN Computer Science & Engineering
48 pages
API Reference - OpenAI API
No ratings yet
API Reference - OpenAI API
116 pages
Build Custom AI Voice Server
No ratings yet
Build Custom AI Voice Server
13 pages
Getting Started - OpenAI Realtime and WebRTC - by Chris McKenzie - Medium
No ratings yet
Getting Started - OpenAI Realtime and WebRTC - by Chris McKenzie - Medium
17 pages
Azure Ai Services Speech Service
No ratings yet
Azure Ai Services Speech Service
1,475 pages
How Assistants Work - OpenAI API
No ratings yet
How Assistants Work - OpenAI API
9 pages
Open AI Python
No ratings yet
Open AI Python
1 page
Survey Paper
No ratings yet
Survey Paper
10 pages
Survey Paper Updated
No ratings yet
Survey Paper Updated
12 pages
Elevenlabs
No ratings yet
Elevenlabs
17 pages
Interrupting The OpenAI RealTime API - WorkAdventure Documentation
No ratings yet
Interrupting The OpenAI RealTime API - WorkAdventure Documentation
1 page
Project Testing
No ratings yet
Project Testing
11 pages
Py Report
No ratings yet
Py Report
8 pages
AI For Speech Recognition Complete
No ratings yet
AI For Speech Recognition Complete
4 pages
Algorithm For AssemblyAi and Gemini
No ratings yet
Algorithm For AssemblyAi and Gemini
8 pages
? Leo Project Interview Questions and Answers
No ratings yet
? Leo Project Interview Questions and Answers
4 pages
Fundamentals of Azure AI Speech With QA
No ratings yet
Fundamentals of Azure AI Speech With QA
6 pages
Building A Windows Desktop AI Assistant (Python, Voice I - O, 3D Avatar)
No ratings yet
Building A Windows Desktop AI Assistant (Python, Voice I - O, 3D Avatar)
5 pages
Engineering LLM-Powered Voice Agents - Open-Source vs. AWS Approaches
No ratings yet
Engineering LLM-Powered Voice Agents - Open-Source vs. AWS Approaches
13 pages
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
No ratings yet
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
10 pages
AI Assistant PBL Project
No ratings yet
AI Assistant PBL Project
13 pages
Case Study: Speech Recognition For Virtual Assistants: 1. Problem Identification
No ratings yet
Case Study: Speech Recognition For Virtual Assistants: 1. Problem Identification
8 pages
Intro To OpenAI GPT API - Intro To OpenAI GPT API Cheatsheet - Codecademy
No ratings yet
Intro To OpenAI GPT API - Intro To OpenAI GPT API Cheatsheet - Codecademy
7 pages
Rag - Voice Assistant: Features Modular Design Support For Multiple Apis Configuration Management
No ratings yet
Rag - Voice Assistant: Features Modular Design Support For Multiple Apis Configuration Management
3 pages
Speech Understanding Content
No ratings yet
Speech Understanding Content
9 pages
System Overview
No ratings yet
System Overview
6 pages
AI Tools: GPT vs Assistants API
No ratings yet
AI Tools: GPT vs Assistants API
5 pages
Voice Assistant Using Python 2
No ratings yet
Voice Assistant Using Python 2
20 pages
1.2 - Types of Conversational Agents
No ratings yet
1.2 - Types of Conversational Agents
4 pages
Ai102renewal 29-12-23
No ratings yet
Ai102renewal 29-12-23
36 pages
Final
No ratings yet
Final
12 pages
Speech To Text
No ratings yet
Speech To Text
17 pages
GroqCloud Speech
No ratings yet
GroqCloud Speech
5 pages
Python Assistent Mini Project Report
No ratings yet
Python Assistent Mini Project Report
23 pages
Code Docs 9.6 To 9.12
No ratings yet
Code Docs 9.6 To 9.12
5 pages
Voice Assistant
No ratings yet
Voice Assistant
34 pages
Department of Mechanical Engineering: Mini Project Phase 1 Presentation
No ratings yet
Department of Mechanical Engineering: Mini Project Phase 1 Presentation
12 pages
Similarity 0505064848
No ratings yet
Similarity 0505064848
56 pages
Ai Voice Assistant PPT Project
No ratings yet
Ai Voice Assistant PPT Project
23 pages
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
No ratings yet
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
10 pages
Ai Voice Assistant PPT Project
0% (1)
Ai Voice Assistant PPT Project
22 pages
Desktop Assistant Final
No ratings yet
Desktop Assistant Final
15 pages
Code in Voices
No ratings yet
Code in Voices
10 pages
Web Speech Api Documentation
No ratings yet
Web Speech Api Documentation
6 pages
Openomni: An Open-Source Multimodal Systems
No ratings yet
Openomni: An Open-Source Multimodal Systems
6 pages
Text To Speech
No ratings yet
Text To Speech
14 pages
Personal Voice Assistant in Python
86% (22)
Personal Voice Assistant in Python
30 pages
Anurag Synop
No ratings yet
Anurag Synop
9 pages
Export Text - New Recording 1.m4a (09 - 05 - 2025)
No ratings yet
Export Text - New Recording 1.m4a (09 - 05 - 2025)
3 pages
AI Desktop Assistant Project
No ratings yet
AI Desktop Assistant Project
14 pages
Whitepaper Deepgram Vs Whisper Benchmark
No ratings yet
Whitepaper Deepgram Vs Whisper Benchmark
21 pages
Doc-20231217-Wa0003. 20231217 234608 0000
No ratings yet
Doc-20231217-Wa0003. 20231217 234608 0000
11 pages
Data MCQs for Commerce Exams
No ratings yet
Data MCQs for Commerce Exams
9 pages
35 Basic Interview Questions
No ratings yet
35 Basic Interview Questions
8 pages
Critical and Creative Thinking: TH Orists A
No ratings yet
Critical and Creative Thinking: TH Orists A
8 pages
Unit 1
100% (3)
Unit 1
13 pages
Villas Community Construction Plan
No ratings yet
Villas Community Construction Plan
1 page
Real-Time Virtual Cinematography
No ratings yet
Real-Time Virtual Cinematography
8 pages
(Undergraduate Lecture Notes in Physics) Mark Gallaway - An Introduction To Observational Astrophysics (2020, Springer)
100% (2)
(Undergraduate Lecture Notes in Physics) Mark Gallaway - An Introduction To Observational Astrophysics (2020, Springer)
242 pages
Capitulo 1 Ista Reglas Internacionales Semillas
No ratings yet
Capitulo 1 Ista Reglas Internacionales Semillas
22 pages
Dry Erase Marker Experiment
No ratings yet
Dry Erase Marker Experiment
9 pages
Learning To Identify Internet Sexual Predation
No ratings yet
Learning To Identify Internet Sexual Predation
22 pages
Chemistry An Atoms First Approach 2nd Edition Steven S Zumdahl Susan A Zumdahl Digital Access
100% (1)
Chemistry An Atoms First Approach 2nd Edition Steven S Zumdahl Susan A Zumdahl Digital Access
405 pages
Quality Attributes in Call Centers
No ratings yet
Quality Attributes in Call Centers
13 pages
Chapter 3
No ratings yet
Chapter 3
81 pages
GP Guidelines (JAN-MAY 2025)
No ratings yet
GP Guidelines (JAN-MAY 2025)
26 pages
Optical Wireless & IoT Course
No ratings yet
Optical Wireless & IoT Course
1 page
Bubble Bubble Interaction
No ratings yet
Bubble Bubble Interaction
11 pages
Erp Assignment 1
No ratings yet
Erp Assignment 1
2 pages
Grade 4 Parents' Orientation Guide
No ratings yet
Grade 4 Parents' Orientation Guide
30 pages
216, 226, 236 and 246 Hydraulic System Skid Steer Loaders: 4NZ1-3399 5FZ1-6699 4YZ1-3999 5SZ1-3999
0% (1)
216, 226, 236 and 246 Hydraulic System Skid Steer Loaders: 4NZ1-3399 5FZ1-6699 4YZ1-3999 5SZ1-3999
2 pages
Unit 10
No ratings yet
Unit 10
64 pages
Manuel D'utilisation CONSTELLATION® Vision System-238-306
No ratings yet
Manuel D'utilisation CONSTELLATION® Vision System-238-306
69 pages
Triumphant College
No ratings yet
Triumphant College
5 pages
2000 General Catalogue
No ratings yet
2000 General Catalogue
102 pages
PMI-100 Exam Dumps & Practice Test
No ratings yet
PMI-100 Exam Dumps & Practice Test
8 pages
Jadual Mba Odl Sem 1 2022-2023 (Update 13102022) 3
No ratings yet
Jadual Mba Odl Sem 1 2022-2023 (Update 13102022) 3
16 pages
Notes - Working With Functions
No ratings yet
Notes - Working With Functions
13 pages
My Ferrari
No ratings yet
My Ferrari
11 pages
Fusional Language
No ratings yet
Fusional Language
3 pages
OL8000 Oct 07
No ratings yet
OL8000 Oct 07
4 pages
Students Who Work or Unemployed Students
No ratings yet
Students Who Work or Unemployed Students
2 pages

Audio and Speech - OpenAI API

Uploaded by

Audio and Speech - OpenAI API

Uploaded by

Voice agents Choosing the right API

7 const response = await openai.chat.completions.create({

Audio output from model Audio input to model

https://platform.openai.com/docs/guides/audio?api-mode=chat 7/7/25, 7:27 AM https://platform.openai.com/docs/guides/audio?api-mode=chat 7/7/25, 7:27 AM

You might also like