Multimodal

Speech-to-speech AI

2024ActivePublished: 20 March 2026Updated: 20 March 2026Published

Key innovation

A class of architectures enabling direct speech-to-speech processing — either by replacing the cascaded STT→LLM→TTS pipeline with a single end-to-end model operating on audio representations, or by tightly integrating pipeline components to minimize latency and preserve paralinguistic features (prosody, emotion, speaker characteristics).

How it works

The model takes audio as input, analyzes its content and paralinguistic features, then generates a response directly as audio output. The native speech-to-speech variant operates as a single multimodal model, while the pipeline variant chains multiple components: ASR, LLM, and TTS.

Problem solved

Traditional voice pipelines increase latency and can lose information carried in speech, such as emotion, intent, accent, and prosodic nuances. Speech-to-speech AI reduces this problem by handling voice input and output directly.

Components

Speech EncoderExtracts semantic and paralinguistic representations from the input speech signal.

Module converting input audio signal (raw samples or mel spectrograms) into latent representations used by downstream components. In cascade architectures the encoder role is played by the ASR model producing text tokens. In end-to-end architectures the encoder processes audio into continuous representations preserving paralinguistic information.

ASR/STT (Automatic Speech Recognition)In cascade architecture: transcribes speech to text (e.g., Whisper, Google STT). Discrete output — loss of paralinguistic information.

End-to-end continuous audio encoderIn direct architectures: neural network (e.g., Transformer-based) processing mel spectrograms or discrete audio tokens into continuous embeddings without passing through text.

Official

Reasoning and Response Generation ModuleIntent understanding, response content generation, and conversation context management.

Component processing input representations and generating response representations. In cascade architectures this is a text-operating LLM. In end-to-end architectures it may be an LLM conditioned on audio tokens/embeddings or a seq2seq model trained directly on audio pairs.

Official

Speech Decoder / SynthesizerSynthesizes speech output with natural-sounding prosody, optionally preserving voice identity or emotional tone.

Component generating output audio signal from response representations. In cascade architectures this is a TTS module operating on text. In end-to-end architectures it decodes latent representations into spectrograms or audio tokens, subsequently converted to output by a vocoder model.

TTS (Text-to-Speech)In cascade architecture: synthesizes speech from text. Allows expressive synthesis but with limited preservation of original voice characteristics.

Spectrogram decoder + vocoderIn end-to-end architectures: decoder produces mel spectrograms converted to audio signal by a vocoder (e.g., WaveNet, WaveGlow, HiFi-GAN).

Official

Voice Activity Detector (VAD)Manages turn-taking order, triggers input processing, and handles barge-in interruptions.

Component detecting the start and end of user speech in the audio stream, critical for natural turn-taking management in conversation. Modern VAD models (e.g., Silero VAD) process 30ms audio frames in under 1ms on CPU.

Official

Implementation

Reference implementations

Translatotron – official Google Research blog

Google Research

Official

Moshi – Kyutai Labs (open-source)

Python · Kyutai Labs

Official

LLaMA-Omni – open-source S2S model

Python · ICTNLP

Implementation pitfalls

Loss of paralinguistic information in cascaded architecturesHigh

In STT→LLM→TTS cascade, speech-to-text conversion irreversibly removes prosody, emotion, speaking rate, disfluencies, and voice characteristics. TTS must reconstruct expression from scratch, losing naturalness and emotional context.

Fix:For applications requiring emotion/prosody preservation: use an end-to-end architecture, or augment the cascaded pipeline with an emotion analysis module running in parallel with STT and passing emotional metadata to TTS.

Error Propagation in Cascaded ArchitecturesHigh

ASR errors propagate and may be amplified by LLM (incorrect intent understanding) and TTS (generating incorrect or inadequate response). Error accumulation is particularly acute for key terms, proper names, and domain-specific jargon.

Fix:Use domain-specific STT models trained on in-domain data; add correction and validation mechanisms between pipeline stages; monitor WER (Word Error Rate) in production environments.

Low availability of parallel audio data for end-to-end modelsHigh

Direct (end-to-end) models require audio pairs (input→output) that are significantly harder to collect than text or audio-transcript pairs. Particularly problematic for low-resource languages. Often results in poorer generalization of direct models on rare languages.

Fix:Using TTS-generated synthetic data as target training examples (as in Translatotron); applying multitask learning with available text data as an auxiliary signal; transfer learning from pretrained audio encoders.

Real-time network latency and audio quality issuesMedium

Real-time S2S systems are sensitive to network jitter and connection quality. Standard telephone codecs (e.g., 8 kHz G.711 in PSTN/Twilio) degrade audio below modern model requirements (typically trained on 16 kHz audio). GPT-4o Realtime and Gemini Live achieve best results with 16 kHz wideband audio but lose advantage over cascade at 8 kHz telephony.

Fix:Use wideband audio (G.722, 16 kHz or higher) where possible. For telephony deployments, consider a cascaded architecture with telephony-optimized STT components. Implement client-side audio buffering to smooth jitter.

Evolution

2019

Translatotron (Google) – first end-to-end S2ST model without intermediate text representation

Inflection point

Jia et al. published Translatotron (arXiv:1904.06037), the first seq2seq model for direct speech-to-speech translation without intermediate text. Model took source language mel spectrograms and generated target language mel spectrograms. Demonstrated voice characteristic preservation via speaker encoder. Translation quality was lower than cascade systems but feasibility was demonstrated.

Direct speech-to-speech translation with a sequence-to-sequence model (paper)

2022

Translatotron 2 – end-to-end S2ST quality matching cascade systems

Google published Translatotron 2, achieving quality comparable to cascade systems on standard benchmarks while eliminating the voice cloning vulnerability present in Translatotron 1.

Translatotron 2: High-quality direct speech-to-speech translation with voice preservation (paper)

2024

Moshi (Kyutai Labs) – first publicly documented end-to-end S2S model for real-time full-duplex conversational dialogue

Inflection point

Kyutai Labs published Moshi (2024), a speech-text foundation model for real-time dialogue. Model supports full-duplex — can listen and speak simultaneously. Published model weights and technical documentation. Achieved latency ~160ms theoretical, ~200ms practical.

Moshi: a speech-text foundation model for real-time dialogue (paper)

2024

GPT-4o Realtime API (OpenAI) and LLaMA-Omni – commercialization of end-to-end S2S

Inflection point

OpenAI released GPT-4o with native speech-to-speech capabilities (May 2024 demo, October 2024 API). LLaMA-Omni (2024) demonstrated an open-source approach to end-to-end S2S based on LLaMA. End-to-end S2S architecture entered production commercial deployment at scale.

Technical details

Hyperparameters (configurable axes)

Architecture type (cascade vs. end-to-end)Critical

Fundamental choice between cascade architecture (STT→LLM→TTS) and direct end-to-end architecture. Determines latency, prosody preservation, debuggability, and training data requirements.

cascade (STT→LLM→TTS)Modular, configurable, 2–4 s latency, best content control.

end-to-end (audio-to-audio)Prosody preservation, latency <1 s, requires paired audio data.

hybrid (tightly coupled pipeline)STT/LLM/TTS with overlapping and streaming, latency 250–500 ms.

Tryb dupleksowyHigh

Whether the system supports half-duplex (turn-based, one side speaks at a time) or full-duplex (both sides can speak simultaneously, with barge-in capability). Full-duplex requires advanced VAD and barge-in mechanisms.

half-duplex (turn-based)Simpler to implement; the model waits for the user to finish speaking before responding.

full-duplexMore natural dialogue; the model can be interrupted while generating a response.

Reprezentacja audioHigh

Audio input/output format used by the model: raw waveform, mel spectrogram, discrete audio tokens (from audio codec e.g., EnCodec, SoundStream), or continuous embeddings.

mel spectrogramUsed in Translatron — input and output are mel spectrogram sequences.

discrete audio tokens (codec)Used in models such as Moshi and LLaMA-Omni — audio is tokenized via EnCodec or SoundStream.

continuous audio embeddingsContinuous embeddings from a pretrained audio encoder (e.g., Whisper encoder).

Compute bottleneck

Pipeline latency (cascade) or audio token generation (end-to-end)

In cascade architectures the bottleneck is the sum of stage latencies: ASR + LLM + TTS, typically 2–4 seconds end-to-end. In end-to-end architectures the bottleneck is autoregressive audio token generation by the LLM (similar to text generation but with higher token volume per second of speech).

Depends on

Czas do pierwszego tokenu LLM (TTFT)Przepustowość sieci (streaming audio)

Execution paradigm

Primary mode

dense

Both cascaded and end-to-end S2S architectures use dense processing in each of their components. 'Stage-dependent' refers to the fact that different components (encoder, LLM, decoder) are activated sequentially during the processing of a single query. In full-duplex systems (e.g., Moshi), input and output are processed simultaneously by a model capable of concurrently listening and speaking.

Activation pattern

stage_dependent

Parallelism

Parallelism level

partially_parallel

Multiple parallel requests from different users can be handled concurrently by separate model instances across devices. Audio streaming and stage overlapping can reduce perceived latency.

Scope

inference

Constraints

!In cascade architecture each stage must complete before passing results to the next. Acceleration possible via streaming partial results (e.g., streaming TTS after first LLM tokens), but stages cannot operate fully in parallel for a single request.

!End-to-end models generate audio tokens autoregressively — token by token sequentially — limiting inference parallelism.

Hardware requirements

Primary

Both the components of cascaded models (Whisper, LLM, TTS) and end-to-end S2S models (Moshi, LLaMA-Omni, GPT-4o backend) are Transformer architectures requiring GPUs with Tensor Cores for efficient inference. Real-time speech processing with latency <500 ms at production scale requires GPUs.

Possible

Smaller STT models (e.g., Whisper tiny/base) and VAD models (e.g., Silero VAD) can run efficiently on CPUs with AVX extensions. For full cascaded pipelines with large LLMs, CPU is insufficient to meet real-time latency requirements.

Sources

Realtime API

Documentation

OpenAI

Documentation for realtime multimodal and speech-to-speech interactions.

Audio and speech

Documentation

OpenAI

Description of speech-to-speech approaches and voice pipelines.

Voice agents

Documentation

OpenAI

Description of S2S architecture and voice agent applications.