GPU Tensor CoresPRIMARY
Both the components of cascaded models (Whisper, LLM, TTS) and end-to-end S2S models (Moshi, LLaMA-Omni, GPT-4o backend) are Transformer architectures requiring GPUs with Tensor Cores for efficient inference. Real-time speech processing with latency <500 ms at production scale requires GPUs.
End-to-end S2S models (~7B–70B parameters) require GPUs with large VRAM (24–80 GB). Cascaded systems can distribute components across smaller GPUs or CPU, but the LLM component still requires a GPU to maintain low latency.
CPU AVXPOSSIBLE
Smaller STT models (e.g., Whisper tiny/base) and VAD models (e.g., Silero VAD) can run efficiently on CPUs with AVX extensions. For full cascaded pipelines with large LLMs, CPU is insufficient to meet real-time latency requirements.
VAD runs efficiently on CPU (~<1 ms per 30 ms audio frame). STT components for small models can run on CPU in resource-constrained environments at the cost of higher latency.