Speech-to-speech AI

A class of architectures enabling direct speech-to-speech processing — either by replacing the cascaded STT→LLM→TTS pipeline with a single end-to-end model operating on audio representations, or by tightly integrating pipeline components to minimize latency and preserve paralinguistic features (prosody, emotion, speaker characteristics).

Speech Encoder

Extracts semantic and paralinguistic representations from the input speech signal.

Modular

Module converting input audio signal (raw samples or mel spectrograms) into latent representations used by downstream components. In cascade architectures the encoder role is played by the ASR model producing text tokens. In end-to-end architectures the encoder processes audio into continuous representations preserving paralinguistic information.

Language/Reasoning Module

Intent understanding, response content generation, and conversation context management.

Modular

Component processing input representations and generating response representations. In cascade architectures this is a text-operating LLM. In end-to-end architectures it may be an LLM conditioned on audio tokens/embeddings or a seq2seq model trained directly on audio pairs.

Speech Decoder / Synthesizer

Synthesizes speech output with natural-sounding prosody, optionally preserving voice identity or emotional tone.

Modular

Component generating output audio signal from response representations. In cascade architectures this is a TTS module operating on text. In end-to-end architectures it decodes latent representations into spectrograms or audio tokens, subsequently converted to output by a vocoder model.

Voice Activity Detection (VAD)

Manages turn-taking order, triggers input processing, and handles barge-in interruptions.

Modular

Component detecting the start and end of user speech in the audio stream, critical for natural turn-taking management in conversation. Modern VAD models (e.g., Silero VAD) process 30ms audio frames in under 1ms on CPU.

Wąskie gardło: Pipeline latency (cascade) or audio token generation (end-to-end)

In cascade architectures the bottleneck is the sum of stage latencies: ASR + LLM + TTS, typically 2–4 seconds end-to-end. In end-to-end architectures the bottleneck is autoregressive audio token generation by the LLM (similar to text generation but with higher token volume per second of speech).

Parallelism

Partially parallel

Multiple parallel requests from different users can be handled concurrently by separate model instances across devices. Audio streaming and stage overlapping can reduce perceived latency.

Paradigm

Dense

Stage dependent

Both cascaded and end-to-end S2S architectures use dense processing in each of their components. 'Stage-dependent' refers to the fact that different components (encoder, LLM, decoder) are activated sequentially during the processing of a single query. In full-duplex systems (e.g., Moshi), input and output are processed simultaneously by a model capable of concurrently listening and speaking.

Architecture type (cascade vs. end-to-end)

Critical

cascade (STT→LLM→TTS)Modular, configurable, 2–4 s latency, best content control.
end-to-end (audio-to-audio)Prosody preservation, latency <1 s, requires paired audio data.
hybrid (tightly coupled pipeline)STT/LLM/TTS with overlapping and streaming, latency 250–500 ms.

Fundamental choice between cascade architecture (STT→LLM→TTS) and direct end-to-end architecture. Determines latency, prosody preservation, debuggability, and training data requirements.

Tryb dupleksowy

Standard

half-duplex (turn-based)Simpler to implement; the model waits for the user to finish speaking before responding.
full-duplexMore natural dialogue; the model can be interrupted while generating a response.

Whether the system supports half-duplex (turn-based, one side speaks at a time) or full-duplex (both sides can speak simultaneously, with barge-in capability). Full-duplex requires advanced VAD and barge-in mechanisms.

Reprezentacja audio

Standard

mel spectrogramUsed in Translatron — input and output are mel spectrogram sequences.
discrete audio tokens (codec)Used in models such as Moshi and LLaMA-Omni — audio is tokenized via EnCodec or SoundStream.
continuous audio embeddingsContinuous embeddings from a pretrained audio encoder (e.g., Whisper encoder).

Audio input/output format used by the model: raw waveform, mel spectrogram, discrete audio tokens (from audio codec e.g., EnCodec, SoundStream), or continuous embeddings.

Common pitfalls

Loss of paralinguistic information in cascaded architectures

HIGH

In STT→LLM→TTS cascade, speech-to-text conversion irreversibly removes prosody, emotion, speaking rate, disfluencies, and voice characteristics. TTS must reconstruct expression from scratch, losing naturalness and emotional context.

For applications requiring emotion/prosody preservation: use an end-to-end architecture, or augment the cascaded pipeline with an emotion analysis module running in parallel with STT and passing emotional metadata to TTS.

Error Propagation in Cascaded Architectures

HIGH

ASR errors propagate and may be amplified by LLM (incorrect intent understanding) and TTS (generating incorrect or inadequate response). Error accumulation is particularly acute for key terms, proper names, and domain-specific jargon.

Use domain-specific STT models trained on in-domain data; add correction and validation mechanisms between pipeline stages; monitor WER (Word Error Rate) in production environments.

Low availability of parallel audio data for end-to-end models

HIGH

Direct (end-to-end) models require audio pairs (input→output) that are significantly harder to collect than text or audio-transcript pairs. Particularly problematic for low-resource languages. Often results in poorer generalization of direct models on rare languages.

Using TTS-generated synthetic data as target training examples (as in Translatotron); applying multitask learning with available text data as an auxiliary signal; transfer learning from pretrained audio encoders.

Real-time network latency and audio quality issues

MEDIUM

Real-time S2S systems are sensitive to network jitter and connection quality. Standard telephone codecs (e.g., 8 kHz G.711 in PSTN/Twilio) degrade audio below modern model requirements (typically trained on 16 kHz audio). GPT-4o Realtime and Gemini Live achieve best results with 16 kHz wideband audio but lose advantage over cascade at 8 kHz telephony.

Use wideband audio (G.722, 16 kHz or higher) where possible. For telephony deployments, consider a cascaded architecture with telephony-optimized STT components. Implement client-side audio buffering to smooth jitter.

Reference implementations

Translatotron – official Google Research blogofficial

Google Research

Moshi – Kyutai Labs (open-source)official

Python · Kyutai Labs

LLaMA-Omni – open-source S2S model

Python · ICTNLP

2019

Translatotron (Google) – first end-to-end S2ST model without intermediate text representation

breakthrough

Jia et al. published Translatotron (arXiv:1904.06037), the first seq2seq model for direct speech-to-speech translation without intermediate text. Model took source language mel spectrograms and generated target language mel spectrograms. Demonstrated voice characteristic preservation via speaker encoder. Translation quality was lower than cascade systems but feasibility was demonstrated.

Direct speech-to-speech translation with a sequence-to-sequence model

2022

Translatotron 2 – end-to-end S2ST quality matching cascade systems

Google published Translatotron 2, achieving quality comparable to cascade systems on standard benchmarks while eliminating the voice cloning vulnerability present in Translatotron 1.

Translatotron 2: High-quality direct speech-to-speech translation with voice preservation

2024

Moshi (Kyutai Labs) – first publicly documented end-to-end S2S model for real-time full-duplex conversational dialogue

breakthrough

Kyutai Labs published Moshi (2024), a speech-text foundation model for real-time dialogue. Model supports full-duplex — can listen and speak simultaneously. Published model weights and technical documentation. Achieved latency ~160ms theoretical, ~200ms practical.

Moshi: a speech-text foundation model for real-time dialogue

2024

GPT-4o Realtime API (OpenAI) and LLaMA-Omni – commercialization of end-to-end S2S

breakthrough

OpenAI released GPT-4o with native speech-to-speech capabilities (May 2024 demo, October 2024 API). LLaMA-Omni (2024) demonstrated an open-source approach to end-to-end S2S based on LLaMA. End-to-end S2S architecture entered production commercial deployment at scale.

GPU Tensor CoresPRIMARY

Both the components of cascaded models (Whisper, LLM, TTS) and end-to-end S2S models (Moshi, LLaMA-Omni, GPT-4o backend) are Transformer architectures requiring GPUs with Tensor Cores for efficient inference. Real-time speech processing with latency <500 ms at production scale requires GPUs.

End-to-end S2S models (~7B–70B parameters) require GPUs with large VRAM (24–80 GB). Cascaded systems can distribute components across smaller GPUs or CPU, but the LLM component still requires a GPU to maintain low latency.

CPU AVXPOSSIBLE

Smaller STT models (e.g., Whisper tiny/base) and VAD models (e.g., Silero VAD) can run efficiently on CPUs with AVX extensions. For full cascaded pipelines with large LLMs, CPU is insufficient to meet real-time latency requirements.

VAD runs efficiently on CPU (~<1 ms per 30 ms audio frame). STT components for small models can run on CPU in resource-constrained environments at the cost of higher latency.

Title	Publisher	Type
Realtime API Documentation for realtime multimodal and speech-to-speech interactions.	OpenAI	documentation
Audio and speech Description of speech-to-speech approaches and voice pipelines.	OpenAI	documentation
Voice agents Description of S2S architecture and voice agent applications.	OpenAI	documentation

Realtime API

Documentation for realtime multimodal and speech-to-speech interactions.

documentationOpenAI

Audio and speech

Description of speech-to-speech approaches and voice pipelines.

documentationOpenAI

Voice agents

Description of S2S architecture and voice agent applications.

documentationOpenAI

Back to technology catalog

Speech-to-speech AI

Use cases

How it works

Problem solved

Main components

Speech Encoder

Language/Reasoning Module

Speech Decoder / Synthesizer

Voice Activity Detection (VAD)

Computational complexity

Configuration axes

Implementation

Common pitfalls

Reference implementations

History and evolution

Preferred hardware

Sources