A Deep Dive into OpenAI's Realtime API for Voice and Text Interactions

author
By Tanu Chahal

31/10/2024

cover image for the blog

The Realtime API allows developers to build low-latency, multi-modal conversational experiences, supporting both text and audio as input and output, alongside function calling. One of its standout features is its native speech-to-speech capability, which minimizes latency by skipping the intermediate text format, ensuring natural and fluid responses. The voices generated by the API exhibit human-like inflections, able to whisper, laugh, and follow tone directions, making the conversations more engaging and lifelike. Moreover, the API supports simultaneous multimodal output, where audio and text can be used concurrently, providing seamless integration for moderation or other purposes. The audio responses are generated faster than real-time, ensuring smooth playback, making it ideal for interactive applications.

Currently in beta, the Realtime API does not offer client-side authentication, meaning developers must handle routing audio from the client to a secure application server, which in turn communicates with the API. Given the unpredictable nature of network conditions, especially for client-side or telephony applications, delivering audio in real-time at scale can be challenging. Developers are advised to consider third-party solutions for production use in scenarios where network reliability cannot be guaranteed. The API is designed to work via a server-side WebSocket interface, and a demo console is available to showcase its features, though the demo patterns are not recommended for production environments.

The API operates in a stateful, event-driven manner, where communication between the client and server is conducted over WebSockets. Once connected, the API manages interactions via JSON-formatted events representing text, audio, function calls, and various configuration updates. A session is established upon connection, and during this session, clients can push or receive data, including voice output and text transcripts of that voice output. The server maintains a list of items in the conversation history, allowing for a seamless back-and-forth interaction. Messages, function calls, and function call outputs are key components of these interactions. Developers can append audio to the Input Audio Buffer, and the server responds with a variety of events based on the session's configuration.

One of the key aspects of the API is its handling of responses. The default mode involves the server detecting voice activity (VAD), generating a response when the user finishes speaking. For more customized experiences, developers can disable this turn detection and control responses explicitly, which can be useful for push-to-talk interfaces or when the client runs its own voice detection. The API also supports function calls, where the model may trigger external tools or functions based on user input. These functions can be configured on a session-wide level or set per response, providing flexibility in handling tasks like retrieving weather information or submitting orders.

In terms of audio, the Realtime API currently supports two formats: raw 16-bit PCM audio at 24kHz and G.711 at 8kHz. Audio must be encoded in base64 format before being transmitted to the server. This ensures that the audio is correctly processed by the API, which can then return corresponding text and voice outputs. The system is designed to maintain a consistent interaction flow, even in complex scenarios like function calling or handling interruptions during audio playback. Developers can manually adjust session parameters, update the default settings, and control various aspects of the conversation dynamically throughout the session.

The Realtime API also emphasizes natural speech synthesis and transcription capabilities. While the model natively produces text transcripts of audio outputs, there can occasionally be deviations between the spoken words and the text, particularly when the output includes elements like code or technical terms that the model might skip verbalizing. On the input side, transcription is not automatically generated, as the API directly processes audio. However, input transcription can be enabled by updating the session's configuration. Additionally, the API supports handling interruptions, where users can halt ongoing audio responses, retaining the truncated response in the conversation history for further interaction.