Robots Atlas>ROBOTS ATLAS

Thinking Machines debuts: TML-Interaction-Small merges intelligence with real-time responsiveness

Thinking Machines debuts: TML-Interaction-Small merges intelligence with real-time responsiveness

Thinking Machines Lab — the startup founded by former OpenAI leader Mira Murati and co-founder Lilian Weng — released its first foundational model on May 11, 2026: TML-Interaction-Small. The 276-billion-parameter MoE model with 12 billion active parameters is designed so that real-time interactivity is built into the architecture itself, rather than bolted on as an external layer. The company classifies it as a research preview of a new class of models intended to close the gap between autonomous agents and natural human-AI collaboration.

Key takeaways

  • TML-Interaction-Small: 276B MoE with 12B active parameters, trained from scratch
  • FD-bench v1.5: 77.8 points vs. 54.3 (Gemini-3.1-flash-live) and 46.8 (GPT-Realtime-2.0)
  • Response latency: 0.40 s vs. 1.18 s (GPT-Realtime-2.0) and 0.57 s (Gemini)
  • 200 ms micro-turns: model processes audio, video, and text simultaneously without VAD
  • Seed round valued at $120 million, investors include Accel, AMD, ServiceNow

The problem with turn-based models

Since the first voice assistants, the standard has been an alternating communication architecture: the user speaks, the model waits — the model responds, the user waits. Most commercial real-time systems layer VAD (Voice Activity Detection) on top of a base LLM to detect turn boundaries and simulate interactivity. OpenAI and other leading companies ship real-time models built on this approach — Thinking Machines states plainly in its technical blog that such a harness-based architecture is inherently less intelligent than the model itself, because boundary decisions are made by a component with far lower capabilities.

The practical consequences are significant: a model cannot interrupt the user based on a "visual cue" (e.g., spotting a bug before the user finishes a sentence), cannot run live translation simultaneously, and cannot respond to non-verbal signals. As the company puts it: "People listen, speak, watch, and think at the same time. In real time. We designed AI that collaborates with people the same way."

Architecture: three design pillars

200 ms micro-turns

Instead of a flat token sequence, the model operates on a continuous stream of micro-turns. Every 200 ms is a discrete unit: the model receives input (audio, video, text) while simultaneously generating output. Turn boundaries are not imposed by an external component — the model itself decides whether to produce a speech token, a silence token (backchanneling), or a filler utterance. As the authors describe it: the model treats "time and overlapping speech as part of natural context," rather than as an exception requiring special handling.

Encoder-free early fusion

The dominant approach in omni-models involves pre-training separate audio encoders (e.g., Whisper) and image encoders, then attaching them to an LLM. Thinking Machines rejected this path. Audio enters the model as a dMel representation via a lightweight embedding layer, and video frames are split into 40×40 patches encoded by a minimal MLP (hMLP). The Mixture of Experts architecture — all components, including the main transformer, are co-trained from scratch together. This means the model learns to coordinate all three modalities from the very first training step, with no interface between previously separately optimized modules.

Dual architecture: interaction front-end + background model

The interaction model handles the live conversation in real time. When a task requires deeper reasoning, it delegates to an asynchronous background model that handles search, web browsing, or tool calls. Results stream back, and the interaction model weaves them into the conversation at a natural moment — without an abrupt context switch. The creators describe this as "the response latency of a non-thinking model with the intelligence of a thinking one."

Benchmarks: leading on interactivity, not just intelligence

On FD-bench v1.5 (measuring interaction quality: interruptions, backchanneling, background speech) TML-Interaction-Small scored 77.8. By comparison, Gemini-3.1-flash-live (minimal) scored 54.3, and GPT-Realtime-2.0 (minimal) — 46.8. Turn latency (FD-bench v1) was 0.40 s vs. 1.18 s for GPT-Realtime-2.0. On the Audio MultiChallenge intelligence benchmark, TML-Interaction-Small achieved 43.4% — higher than all instant models (without thinking mode). With the background agent enabled (FD-bench v3, tools): 68.0% Pass@1 vs. 52.0% for GPT-Realtime-2.0 (minimal).

Thinking Machines also published internal benchmarks measuring time-awareness and visual proactivity. TimeSpeak (model initiates speech at user-specified times): 64.7 vs. 4.3 for GPT. CueSpeak (response to verbal cues): 81.7 vs. 2.9. RepCount-A (video repetition counting): 35.4 vs. 1.3. Reference model baselines for Audio MultiChallenge are reported by Scale AI. The company states that none of the evaluated models — including high-reasoning thinking models — can meaningfully perform these tasks.

Latency engineering and trainer-sampler alignment

Achieving 200 ms latency required several non-trivial engineering decisions. The team implemented a streaming sessions mechanism, where the client sends each 200-ms chunk as a separate request, while the inference server appends them into a persistent sequence in GPU memory — eliminating costly memory reallocations. Thinking Machines upstreamed an open-source version of this feature to the SGLang project. For MoE kernels, they adopted a gather+gemv strategy instead of standard grouped gemm, which better fits the shape of bidirectional serving.

A separate challenge was achieving bitwise trainer-sampler alignment. Inconsistency between these components causes "drift" during long-horizon RL training. The team rewrote critical kernels (including Attention Split-KV, all-reduce, and reduce-scatter with NVLS), bringing end-to-end overhead below 5%. One unexpected finding: batch-invariant kernels were actually faster than standard ones for a period, due to lower communication latency from the custom communication kernels.

Why it matters

TML-Interaction-Small is the first public demonstration of the thesis that interactivity should scale alongside intelligence — not be an afterthought layer. Existing real-time systems (GPT-Realtime, Gemini Live) achieved fluency at the cost of harness-based compromises: VAD made decisions the model had no control over. Thinking Machines demonstrates that training from scratch on micro-turns enables qualitatively different capabilities: proactive interruptions on visual cues, simultaneous translation, and reaction to non-verbal signals. From a market perspective, the company enters territory occupied by the OpenAI Realtime API and Google Gemini Live with its own architectural approach — rather than adapting an existing LLM. This is a meaningful shift: if the thesis about interactivity scaling with intelligence proves correct, current harness-based systems will lose ground with each new model generation. TML-Interaction-Small is still a research preview with limitations — particularly around long sessions and the compute requirements of audio-video streaming.

What's next?

  • A larger model (beyond 276B parameters) announced for 2026 — the current TML-Interaction-Small is too slow to serve at larger scale
  • Limited research preview for external users announced "in the coming months" per the May 11, 2026 blog post
  • Research grant for the community to develop new interactivity benchmarks — details to follow soon

Sources

Share this article