[go: up one dir, main page]

0% found this document useful (0 votes)
44 views1 page

Audio and Speech - OpenAI API

Voice agents utilize APIs for audio processing, enabling tasks to be handled in natural language. There are two main approaches: speech-to-speech models with the Realtime API for low-latency interactions, or a chained method using speech-to-text and text-to-speech models. OpenAI offers various APIs for building audio applications, with options for real-time processing, transcription, and specialized functionalities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views1 page

Audio and Speech - OpenAI API

Voice agents utilize APIs for audio processing, enabling tasks to be handled in natural language. There are two main approaches: speech-to-speech models with the Realtime API for low-latency interactions, or a chained method using speech-to-text and text-to-speech models. OpenAI offers various APIs for building audio applications, with options for real-time processing, transcription, and specialized functionalities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Voice agents Choosing the right API

Voice agents understand audio to handle tasks and respond back in natural language. There are multiple APIs for transcribing or generating audio:
Copy page There are two main ways to approach voice agents: either with speech-to-speech
models and the Realtime API, or by chaining together a speech-to-text model, a text API SUPPORTED MODALITIES STREAMING SUPPORT

Audio and speech language model to process the request, and a text-to-speech model to respond. Realtime API Audio and text inputs and outputs Audio streaming in and out
Speech-to-speech is lower latency and more natural, but chaining together a voice
Explore audio and speech features in the OpenAI API. agent is a reliable way to extend a text-based agent into a voice agent. If you are Chat Completions API Audio and text inputs and outputs Audio streaming out

already using the Agents SDK, you can extend your existing agents with voice Transcription API Audio inputs Audio streaming out
The OpenAI API provides a range of audio capabilities. If you know what you want to Transcribe audio capabilities using the chained approach.
Speech API Text inputs and audio outputs Audio streaming out
build, find your use case below to get started. If you're not sure where to start, read this Convert speech to text instantly and accurately.

page as an overview.
Streaming audio
General use APIs vs. specialized APIs
Process audio in real time to build voice agents and other low-latency applications,
Build with audio The main distinction is general use APIs vs. specialized APIs. With the Realtime and
including transcription use cases. You can stream audio in and out of a model with the
Realtime API. Our advanced speech models provide automatic speech recognition for Chat Completions APIs, you can use our latest models' native audio understanding and
improved accuracy, low-latency interactions, and multilingual support. generation capabilities and combine them with other features like function calling.
These APIs can be used for a wide range of use cases, and you can select the model
you want to use.
Text to speech
On the other hand, the Transcription, Translation and Speech APIs are specialized to
For turning text into speech, use the Audio API audio/speech endpoint. Models
work with specific models and only meant for one purpose.
compatible with this endpoint are gpt-4o-mini-tts , tts-1 , and tts-1-hd . With
gpt-4o-mini-tts , you can ask the model to speak a certain way or with a certain
tone of voice. Talking with a model vs. controlling the script
Speak text
Turn text into natural-sounding speech in real time. Another way to select the right API is asking yourself how much control you need. To
Speech to text design conversational interactions, where the model thinks and responds in speech,
use the Realtime or Chat Completions API, depending if you need low-latency or not.
A tour of audio use cases For speech to text, use the Audio API audio/transcriptions endpoint. Models
Build voice agents compatible with this endpoint are gpt-4o-transcribe , gpt-4o-mini-transcribe ,
Build interactive voice-driven applications.
You won't know exactly what the model will say ahead of time, as it will generate audio
LLMs can process audio by using sound as input, creating sound as output, or both. and whisper-1 . With streaming, you can continuously pass in audio and get a responses directly, but the conversation will feel natural.
OpenAI has several API endpoints that help you build audio applications or voice continuous stream of text back.
agents. For more control and predictability, you can use the Speech-to-text / LLM / Text-to-
speech pattern, so you know exactly what the model will say and can control the

https://platform.openai.com/docs/guides/audio?api-mode=chat 7/7/25, 7:27 AM https://platform.openai.com/docs/guides/audio?api-mode=chat 7/7/25, 7:27 AM https://platform.openai.com/docs/guides/audio?api-mode=chat 7/7/25, 7:27 AM https://platform.openai.com/docs/guides/audio?api-mode=chat 7/7/25, 7:27 AM
Page 1 of 6 Page 2 of 6 Page 3 of 6 Page 4 of 6

response. Please note that with this method, there will be added latency. Create a human-like audio response to a prompt javascript

This is what the Audio APIs are for: pair an LLM with the audio/transcriptions 1 import { writeFileSync } from "node:fs";
and audio/speech endpoints to take spoken user input, process and generate a text import OpenAI from "openai";
2
response, and then convert that to speech that the user can hear. 3
const openai = new OpenAI();
4
5
Recommendations 6
// Generate an audio response to the given prompt

7 const response = await openai.chat.completions.create({


If you need real-time interactions or transcription, use the Realtime API. model: "gpt-4o-audio-preview",
8
If realtime is not a requirement but you're looking to build a voice agent or an 9 modalities: ["text", "audio"],
10 audio: { voice: "alloy", format: "wav" },
audio-based application that requires features such as function calling, use the
11 messages: [
Chat Completions API.
12 {
For use cases with one specific purpose, use the Transcription, Translation, or 13 role: "user",
Speech APIs. 14 content: "Is a golden retriever a good family dog?"
15 }
16 ],
Add audio to your existing application 17 store: true,
18 });
19
Models such as GPT-4o or GPT-4o mini are natively multimodal, meaning they can
20
understand and generate multiple modalities as input and output. // Inspect returned data
21
console.log(response.choices[0]);
22
If you already have a text-based LLM application with the Chat Completions endpoint, 23
// Write audio data to a file
you may want to add audio capabilities. For example, if your chat application supports 24
writeFileSync(
text input, you can add audio input and output—just include audio in the 25
"dog.wav",
26
modalities array and use an audio model, like gpt-4o-audio-preview . Buffer.from(response.choices[0].message.audio.data, 'base64'),
27
28 { encoding: "utf-8" }
);
Audio is not yet supported in the Responses API.

Audio output from model Audio input to model

https://platform.openai.com/docs/guides/audio?api-mode=chat 7/7/25, 7:27 AM https://platform.openai.com/docs/guides/audio?api-mode=chat 7/7/25, 7:27 AM


Page 5 of 6 Page 6 of 6

You might also like