Robots Atlas>ROBOTS ATLAS

OpenAI Launches GPT-Realtime-2: Voice Intelligence with GPT-5-Class Reasoning

OpenAI Launches GPT-Realtime-2: Voice Intelligence with GPT-5-Class Reasoning

On May 7, 2026, OpenAI released three new audio models in its Realtime API: GPT-Realtime-2 equipped with GPT-5-class reasoning, GPT-Realtime-Translate enabling live speech translation across more than 70 input languages and 13 output languages, and GPT-Realtime-Whisper — a model for streaming speech transcription. The launch sets a new benchmark for AI voice interfaces, which until now were confined to simple call-and-response interactions without genuine context analysis or multi-step reasoning.

Key takeaways

  • GPT-Realtime-2 is OpenAI's first voice model with GPT-5-class reasoning and a context window expanded to 128K tokens (previously 32K)
  • GPT-Realtime-Translate supports 70+ input languages and 13 output languages, priced at $0.034/minute
  • GPT-Realtime-Whisper: live streaming speech-to-text transcription, priced at $0.017/minute
  • GPT-Realtime-2 scored 15.2% higher on Big Bench Audio and 13.8% higher on Audio MultiChallenge than its predecessor GPT-Realtime-1.5
  • Zillow reported a 26-point increase in call success rate after prompt optimization with GPT-Realtime-2 (95% vs. 69%)

Three models — three distinct use cases

GPT-Realtime-2 is not a simple successor to GPT-Realtime-1.5. OpenAI redesigned the model to carry a conversation while simultaneously calling tools, checking context, and adjusting tone to the situation. A new "preamble" feature lets the model speak short phrases — "let me check that" or "one moment" — before generating a full response, signaling to the user that the AI agent is actively processing the request rather than hanging. Parallel tool calls and the ability to announce them audibly — "checking your calendar" — make the voice assistant behave like a real conversational partner. The maximum context window has grown fourfold: from 32K to 128K tokens.

GPT-Realtime-Translate follows a separate path: a dedicated translation model running on a distinct /v1/realtime/translations endpoint. Unlike the agent mode, a translation session is continuous — the model immediately processes incoming audio without waiting for a turn to close. Deutsche Telekom is testing the model for multilingual customer interactions; BolnaAI reported 12.5% lower Word Error Rates in Hindi, Tamil, and Telugu compared to any other model it tested.

GPT-Realtime-Whisper rounds out the offering with streaming transcription. The model can power live captions, meeting notes generated mid-conversation, CRM systems, and customer support tools. Unlike traditional speech-to-speech AI, Whisper does not generate audio responses — it delivers text only.

Benchmarks and early tester results

OpenAI publishes two sets of benchmark results. Big Bench Audio measures reasoning capabilities of models supporting audio input — GPT-Realtime-2 (high) scores 15.2% higher than GPT-Realtime-1.5. Audio MultiChallenge tests multi-turn conversational intelligence, including instruction following, self-consistency, and handling natural speech corrections — GPT-Realtime-2 (xhigh) scores 13.8% higher than its predecessor.

Zillow, testing the model for voice-based property search, confirmed a 26-point increase in call success rate after prompt optimization (95% vs. 69%), along with improved Fair Housing compliance. Glean, Genspark, and Priceline report similar results in their own domains — productivity assistants, travel planning, and customer service.

Pricing and availability

All three models are available through the OpenAI API in the Realtime API. GPT-Realtime-2 is billed per token: $32 per 1M audio input tokens ($0.40 for cached input tokens) and $64 per 1M audio output tokens. GPT-Realtime-Translate and GPT-Realtime-Whisper are billed per minute — $0.034/min and $0.017/min respectively. Compared to the previous GPT-Realtime-1.5, the new models offer higher capability at a comparable or lower per-minute cost for translation and transcription use cases.

For developers building browser-based apps, the recommended path is WebRTC via the Agents SDK. Server applications handling media — phone switches, streaming — can use WebSocket. A SIP option is also available for telephony. EU Data Residency is fully supported for EU-based applications.

Market context: Google and ElevenLabs as reference points

OpenAI is not alone in the voice AI model space. Google offers Gemini Live with native audio support in the Gemini 2.5 Pro model, while ElevenLabs specializes in high-quality speech synthesis. The difference lies in approach: while competitors focus primarily on voice quality or the conversational layer, GPT-Realtime-2 combines reasoning, tool calls, and tone adjustment in a single model. Until now, achieving this required either a separate LLM and traditional TTS pipeline, or accepting weaker reasoning in a voice-native model.

Why it matters

Voice interfaces have long operated on the periphery of the AI ecosystem — useful for narrow applications (dictation, simple commands) but too unreliable for complex workflows. The core barrier was the absence of real-time reasoning: the model had to hand off the query to a separate LLM, process it as text, and only then generate a voice response. GPT-Realtime-2 closes this loop — reasoning happens directly in the audio layer, without the cost of modal switching. For customer service, education, and in-car systems, this means building assistants that genuinely understand complex questions rather than matching them to templates. Expanding the context window to 128K tokens allows the model to sustain long, coherent sessions — something previously exclusive to text-based chat interfaces.

What's next

  • OpenAI has announced SIP support for telephony — a new connection path for call centers and enterprise solutions, available alongside WebRTC and WebSocket
  • EU AI Act enters further application stages in 2026 — voice AI agent providers will need to meet disclosure requirements for AI system identification toward users (a requirement already embedded in OpenAI's usage policy)
  • The per-minute translation price ($0.034) at scale of millions of conversations creates significant revenue potential — OpenAI has not disclosed a forecast, but Deutsche Telekom and Priceline as first partners signal the direction of commercialization

Sources

Share this article