Robots Atlas>ROBOTS ATLAS
GPT Realtime Whisper

GPT Realtime Whisper

gpt-realtime-whisper · Family: GPT
OpenAI streaming speech-to-text model for low-latency realtime transcription, served via the Realtime transcription API.
✓ Active✓ Public accessAudioAudio📁 GPT
Context window
16K tokens
tokens
Max output
2,000
tokens
Access:APIDeployment:☁ Cloud

Overview

GPT-Realtime-Whisper is a specialized OpenAI streaming speech-to-text model designed for realtime applications that require low-latency transcript deltas emitted while the speaker is still talking. The model lets developers tune the trade-off between latency and transcription accuracy.

Transcription sessions use a session type of 'transcription' and support both WebSocket transport (24 kHz mono PCM, base64-encoded) and WebRTC. Server VAD (voice activity detection) can be configured with threshold, prefix_padding_ms, and silence_duration_ms parameters, or audio buffers can be committed manually. The model emits conversation.item.input_audio_transcription.delta and .completed events.

Typical use cases include live captions, meeting transcription, lecture capture, telephony transcription, broadcast captioning, and dictation. Pricing is based on audio duration (USD per minute) rather than tokens. Within OpenAI's transcription model family, it is positioned as a natively streaming alternative to GPT-4o Transcribe, GPT-4o mini Transcribe, and Whisper-1.

Classification
AudioAudio
Family: GPT
Access & deployment
API
Cloud
Weights: Closed
Key parameters
📏 Context: 16K tokens
📥 Input: audio, text

Technical specification

Context window
16K tokens
tokens
Max output tokens
2,000
tokens per response
Knowledge cutoff
30 Sept 2024
Knowledge boundary
Modalities
⬇ Input
audiotext
⬆ Output
text

Capabilities and applications

Native model capabilities
Streaming Speech-to-Text
Real-time conversion of speech to text with immediate output as the speaker is talking.
Category: speech

Technical architecture

Core Architecture