Speech-to-speech AI
How it works
The model takes audio as input, analyzes its content and paralinguistic features, then generates a response directly as audio output. The native speech-to-speech variant operates as a single multimodal model, while the pipeline variant chains multiple components: ASR, LLM, and TTS.
Problem solved
Traditional voice pipelines increase latency and can lose information carried in speech, such as emotion, intent, accent, and prosodic nuances. Speech-to-speech AI reduces this problem by handling voice input and output directly.
Components
Module converting input audio signal (raw samples or mel spectrograms) into latent representations used by downstream components. In cascade architectures the encoder role is played by the ASR model producing text tokens. In end-to-end architectures the encoder processes audio into continuous representations preserving paralinguistic information.
Official
Component processing input representations and generating response representations. In cascade architectures this is a text-operating LLM. In end-to-end architectures it may be an LLM conditioned on audio tokens/embeddings or a seq2seq model trained directly on audio pairs.
Official
Component generating output audio signal from response representations. In cascade architectures this is a TTS module operating on text. In end-to-end architectures it decodes latent representations into spectrograms or audio tokens, subsequently converted to output by a vocoder model.
Official
Component detecting the start and end of user speech in the audio stream, critical for natural turn-taking management in conversation. Modern VAD models (e.g., Silero VAD) process 30ms audio frames in under 1ms on CPU.
Official
Implementation
In STT→LLM→TTS cascade, speech-to-text conversion irreversibly removes prosody, emotion, speaking rate, disfluencies, and voice characteristics. TTS must reconstruct expression from scratch, losing naturalness and emotional context.
ASR errors propagate and may be amplified by LLM (incorrect intent understanding) and TTS (generating incorrect or inadequate response). Error accumulation is particularly acute for key terms, proper names, and domain-specific jargon.
Direct (end-to-end) models require audio pairs (input→output) that are significantly harder to collect than text or audio-transcript pairs. Particularly problematic for low-resource languages. Often results in poorer generalization of direct models on rare languages.
Real-time S2S systems are sensitive to network jitter and connection quality. Standard telephone codecs (e.g., 8 kHz G.711 in PSTN/Twilio) degrade audio below modern model requirements (typically trained on 16 kHz audio). GPT-4o Realtime and Gemini Live achieve best results with 16 kHz wideband audio but lose advantage over cascade at 8 kHz telephony.
Evolution
Jia et al. published Translatotron (arXiv:1904.06037), the first seq2seq model for direct speech-to-speech translation without intermediate text. Model took source language mel spectrograms and generated target language mel spectrograms. Demonstrated voice characteristic preservation via speaker encoder. Translation quality was lower than cascade systems but feasibility was demonstrated.
Google published Translatotron 2, achieving quality comparable to cascade systems on standard benchmarks while eliminating the voice cloning vulnerability present in Translatotron 1.
Kyutai Labs published Moshi (2024), a speech-text foundation model for real-time dialogue. Model supports full-duplex — can listen and speak simultaneously. Published model weights and technical documentation. Achieved latency ~160ms theoretical, ~200ms practical.
OpenAI released GPT-4o with native speech-to-speech capabilities (May 2024 demo, October 2024 API). LLaMA-Omni (2024) demonstrated an open-source approach to end-to-end S2S based on LLaMA. End-to-end S2S architecture entered production commercial deployment at scale.
Technical details
Hyperparameters (configurable axes)
Fundamental choice between cascade architecture (STT→LLM→TTS) and direct end-to-end architecture. Determines latency, prosody preservation, debuggability, and training data requirements.
Whether the system supports half-duplex (turn-based, one side speaks at a time) or full-duplex (both sides can speak simultaneously, with barge-in capability). Full-duplex requires advanced VAD and barge-in mechanisms.
Audio input/output format used by the model: raw waveform, mel spectrogram, discrete audio tokens (from audio codec e.g., EnCodec, SoundStream), or continuous embeddings.
Compute bottleneck
In cascade architectures the bottleneck is the sum of stage latencies: ASR + LLM + TTS, typically 2–4 seconds end-to-end. In end-to-end architectures the bottleneck is autoregressive audio token generation by the LLM (similar to text generation but with higher token volume per second of speech).
Execution paradigm
Both cascaded and end-to-end S2S architectures use dense processing in each of their components. 'Stage-dependent' refers to the fact that different components (encoder, LLM, decoder) are activated sequentially during the processing of a single query. In full-duplex systems (e.g., Moshi), input and output are processed simultaneously by a model capable of concurrently listening and speaking.
Parallelism
Multiple parallel requests from different users can be handled concurrently by separate model instances across devices. Audio streaming and stage overlapping can reduce perceived latency.
Hardware requirements
Both the components of cascaded models (Whisper, LLM, TTS) and end-to-end S2S models (Moshi, LLaMA-Omni, GPT-4o backend) are Transformer architectures requiring GPUs with Tensor Cores for efficient inference. Real-time speech processing with latency <500 ms at production scale requires GPUs.
Smaller STT models (e.g., Whisper tiny/base) and VAD models (e.g., Silero VAD) can run efficiently on CPUs with AVX extensions. For full cascaded pipelines with large LLMs, CPU is insufficient to meet real-time latency requirements.