Native Multimodal
Components
Module responsible for converting data from all modalities into a shared token space. Images are typically quantized using a vector quantizer (VQ-VAE) producing discrete visual tokens; text is tokenized standardly; audio is converted to spectrograms or discrete acoustic tokens.
Official
A single stack of transformer layers processing interleaved token sequences from all modalities. The self-attention mechanism operates on the combined sequence, allowing tokens from different modalities to attend to each other.
Training objective applied simultaneously to data from all modalities. Typically autoregressive next-token prediction on interleaved multimodal sequences, without separate per-modality pretraining phases.
Official
Separate output heads mapping the transformer's internal representation to the output space of each modality. May include a language head (softmax over text vocabulary) and a visual head (softmax over image token vocabulary or image decoder).
Official
Implementation
Early-fusion native multimodal models trained from scratch on mixed-modal data are prone to training instability, including loss spikes and gradient issues, due to the heterogeneity of token distributions across modalities.
Training a native multimodal model from scratch on all modalities simultaneously requires substantially more compute than fine-tuning a pre-trained LLM with a grafted vision encoder, making the approach inaccessible without significant infrastructure.
Imbalance in the quantity and quality of training data across modalities can cause the model to underperform on underrepresented modalities while excelling on the dominant one (typically text).
Enabling the model to generate outputs in multiple modalities (e.g., interleaved text and images) requires additional architectural support (separate output heads, image decoders) and alignment training that significantly increases engineering complexity.
Evolution
BEiT (Bao et al., 2021) introduced self-supervised vision representation learning using discrete image patch tokens, establishing the conceptual foundation for treating image patches as tokens analogous to text tokens.
Aghajanyan et al. (2022) extended token-based modeling to mixed-modal documents with interleaved image and text tokens, enabling joint reasoning over both modalities in a unified architecture.
Google Gemini (2023) introduced a large-scale native multimodal model trained from the ground up on text, image, audio, and video, using a unified token stream and shared transformer backbone โ establishing native multimodality as a viable paradigm at frontier scale.
Meta's Chameleon (2024) formalized the early-fusion token-based native multimodal paradigm in an open model, demonstrating stable training from scratch on ~10 trillion interleaved tokens using a unified discrete vocabulary for text and images.
OpenAI's GPT-4o (2024) adopted end-to-end training across text, audio, and visual modalities without separate cascaded models for speech recognition and synthesis, reducing latency and improving cross-modal reasoning.
Shukor et al. (2025, Apple/Sorbonne) established scaling laws for native multimodal models, showing that early-fusion architectures trained from scratch match or outperform late-fusion designs at equivalent compute, and that MoE integration enables implicit modality-specific specialization.
Technical details
Hyperparameters (configurable axes)
Whether modalities are fused at the input level (early fusion) or after separate encoding (late fusion). Determines when cross-modal attention can first occur.
Whether non-text modalities are represented as discrete tokens (via VQ-VAE) or as continuous embeddings projected into the shared space.
Which modalities are included in joint pretraining: text only + image, or also audio, video, sensor data.
Whether MoE layers are incorporated to enable implicit modality-specific expert specialization, improving parameter efficiency.
Execution paradigm
The base execution pattern is dense: all transformer parameters are activated for every token, regardless of modality. MoE variants introduce conditional expert activation, but the native multimodal paradigm does not inherently require routing.
In the baseline native multimodal architecture (e.g., Chameleon), there is no explicit routing mechanism โ a single dense transformer processes all token types uniformly. When MoE layers are incorporated (e.g., Gemini-style or Aria), modality-specific expert specialization emerges implicitly.
Parallelism
Training on interleaved multimodal data can be parallelized across devices (data parallelism, tensor parallelism), but the sequential nature of autoregressive decoding limits parallelism during inference. Training from scratch on multi-modal data requires a large number of GPUs/TPUs.
Hardware requirements
Training and inference of native multimodal models relies on large matrix operations within the transformer (QKV projections, FFN), optimized for tensor cores on GPUs such as NVIDIA H100, A100, and GB200. Chameleon was trained on A100 GPU clusters.
Google Gemini โ one of the key native multimodal models โ was trained on TPU v4/v5. The Transformer architecture is well suited to the matrix-oriented TPU accelerators.