TML-Interaction-Small

Small (research preview)

Natively interactive full-duplex 276B MoE model (12B active) from Thinking Machines Lab; processes audio, video and text in 200 ms micro-turns.

⏳ Preview⏳ Limited accessMultimodalAudioSpecialized AI

Parameters

276B (12B active, MoE)

parameters

Release date

11 May 2026

Access:APIDeployment:☁ Cloud

Overview

TML-Interaction-Small is an interaction model unveiled on 11 May 2026 by Thinking Machines Lab as a research preview. It uses a Mixture-of-Experts architecture with 276B total parameters and 12B active parameters. The model processes continuous audio, video and text streams in 200 ms micro-turns, generating text and audio concurrently without external voice-activity-detection components.

The architecture uses encoder-free early fusion: audio is represented as dMel, images are split into 40×40 patches encoded by an hMLP, and the audio decoder uses a flow head. All components are co-trained from scratch with the transformer. On FD-bench V1 the model reaches 0.40 s turn-taking latency, and 43.4% on Audio MultiChallenge APR. The system pairs with an asynchronous background model handling longer reasoning and tool use.

Classification

MultimodalAudioSpecialized AI

Applications

Chatbots Meeting / Note assistance Translation Knowledge work Q&A / Question answering Tutoring / Education Search assistance

Access & deployment

API

Cloud

Weights: Closed

Key parameters

🧩 Parameters: 276B (12B active, MoE)

✓ Tools

📥 Input: text, audio, video

Technical specification

Parameters

276B (12B active, MoE)

parameters

Features:✓ Tool use

Modalities

⬇ Input

textaudiovideo

⬆ Output

textaudio

Capabilities and applications

Native model capabilities

Voice Conversation

Ability to conduct multi-turn real-time voice conversations with context retention and natural speech pacing.

Category: speech

Speech to text

Category: speech

Text to speech

Category: speech

Streaming Speech-to-Text

Real-time conversion of speech to text with immediate output as the speaker is talking.

Category: speech

Live Translation

Real-time speech translation between multiple languages without interrupting the audio stream.

Category: speech

Audio understanding

Category: audio

Video Understanding

Category: video

Multimodal understanding

Category: multimodal

Streaming output

Category: reasoning

Function Calling

Category: planning

Multilingual

Understanding and generating text in many languages.

Category: language

Reasoning

The model's ability to reason logically and solve complex problems.

Category: reasoning

Application domains

Chatbots Meeting / Note assistance Translation Knowledge work Q&A / Question answering Tutoring / Education Search assistance

Benchmark results

13 benchmarks

FD-bench V1 (turn-taking latency)

latency

0.40s