Multimodal

Native Multimodal

Key innovation

A model trained from scratch simultaneously on data from all modalities, eliminating the need to combine separate pre-trained modality encoders and enabling the learning of joint cross-modal representations.

Components

Unified Multimodal TokenizerCreates a unified input for the shared Transformer backbone, enabling sequential processing of data from multiple modalities.

Module responsible for converting data from all modalities into a shared token space. Images are typically quantized using a vector quantizer (VQ-VAE) producing discrete visual tokens; text is tokenized standardly; audio is converted to spectrograms or discrete acoustic tokens.

VQ-VAE image tokenizerDiscrete image tokenizer based on vector quantization; used in Chameleon.

Continuous patch embeddingsImages encoded as continuous patch embeddings projected into the model's hidden dimension; used in Gemini-style architectures.

Official

Shared Transformer BackboneCentral computational unit of the model; handles unified representation and cross-modal reasoning.

A single stack of transformer layers processing interleaved token sequences from all modalities. The self-attention mechanism operates on the combined sequence, allowing tokens from different modalities to attend to each other.

Joint Pretraining ObjectiveProvides a unified gradient signature across all modalities during training, enforcing cross-modal representation learning.

Training objective applied simultaneously to data from all modalities. Typically autoregressive next-token prediction on interleaved multimodal sequences, without separate per-modality pretraining phases.

Official

Modality-specific Output HeadsEnables generating outputs across multiple modalities while maintaining a shared backbone.

Separate output heads mapping the transformer's internal representation to the output space of each modality. May include a language head (softmax over text vocabulary) and a visual head (softmax over image token vocabulary or image decoder).

Official

Implementation

Reference implementations

Chameleon (FAIR at Meta)

Python · FAIR at Meta

Official

Implementation pitfalls

Training instability with early fusionHigh

Early-fusion native multimodal models trained from scratch on mixed-modal data are prone to training instability, including loss spikes and gradient issues, due to the heterogeneity of token distributions across modalities.

Fix:Applying QK-Norm, scaling the learning rate to model size and modality, and careful curation of interleaved data.

Training from scratch costHigh

Training a native multimodal model from scratch on all modalities simultaneously requires substantially more compute than fine-tuning a pre-trained LLM with a grafted vision encoder, making the approach inaccessible without significant infrastructure.

Fix:Staged curriculum training, efficient data mixing with modality ratio control, and MoE to reduce active FLOPs per token.

Modality imbalance in training dataMedium

Imbalance in the quantity and quality of training data across modalities can cause the model to underperform on underrepresented modalities while excelling on the dominant one (typically text).

Fix:Careful tuning of per-modality data ratios; use of separate tokenizers to balance token distributions.

Difficulty Generating Across Multiple Modalities SimultaneouslyMedium

Enabling the model to generate outputs in multiple modalities (e.g., interleaved text and images) requires additional architectural support (separate output heads, image decoders) and alignment training that significantly increases engineering complexity.

Fix:Separate output heads per modality; staged alignment (SFT, RLHF) using data that includes mixed-modal outputs.

Evolution

Original paper · 2024 · arXiv 2024 (arXiv:2405.09818) · Chameleon Team (FAIR at Meta)

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team (FAIR at Meta)

2021

BEiT and visual tokenization

BEiT (Bao et al., 2021) introduced self-supervised vision representation learning using discrete image patch tokens, establishing the conceptual foundation for treating image patches as tokens analogous to text tokens.

BEiT: BERT Pre-Training of Image Transformers (paper)

2022

Training on mixed-document data (Aghajanyan et al.)

Aghajanyan et al. (2022) extended token-based modeling to mixed-modal documents with interleaved image and text tokens, enabling joint reasoning over both modalities in a unified architecture.

2023

Gemini — first large natively multimodal model

Inflection point

Google Gemini (2023) introduced a large-scale native multimodal model trained from the ground up on text, image, audio, and video, using a unified token stream and shared transformer backbone — establishing native multimodality as a viable paradigm at frontier scale.

Gemini: A Family of Highly Capable Multimodal Models (paper)

2024

Chameleon — open early-fusion model trained from scratch

Inflection point

Meta's Chameleon (2024) formalized the early-fusion token-based native multimodal paradigm in an open model, demonstrating stable training from scratch on ~10 trillion interleaved tokens using a unified discrete vocabulary for text and images.

Chameleon: Mixed-Modal Early-Fusion Foundation Models (paper)

2024

GPT-4o — end-to-end training across text/audio/image modalities

Inflection point

OpenAI's GPT-4o (2024) adopted end-to-end training across text, audio, and visual modalities without separate cascaded models for speech recognition and synthesis, reducing latency and improving cross-modal reasoning.

2025

Scaling laws for native multimodal models (Apple/Sorbonne)

Inflection point

Shukor et al. (2025, Apple/Sorbonne) established scaling laws for native multimodal models, showing that early-fusion architectures trained from scratch match or outperform late-fusion designs at equivalent compute, and that MoE integration enables implicit modality-specific specialization.

Scaling Laws for Native Multimodal Models (paper)

Technical details

Hyperparameters (configurable axes)

Modality Fusion DepthCritical

Whether modalities are fused at the input level (early fusion) or after separate encoding (late fusion). Determines when cross-modal attention can first occur.

early fusionTokens from all modalities are concatenated directly into a single input sequence — an approach used in Chameleon and GPT-4o.

late fusionSeparate encoders process each modality, with outputs fused at a later stage — e.g., LLaVA, Flamingo.

Modality Token RepresentationHigh

Whether non-text modalities are represented as discrete tokens (via VQ-VAE) or as continuous embeddings projected into the shared space.

discrete (VQ-VAE)Images quantized to discrete tokens — the approach used in Chameleon.

continuous patch embeddingsImages represented as continuous patch embeddings — the approach used in Gemini.

Modality RangeHigh

Which modalities are included in joint pretraining: text only + image, or also audio, video, sensor data.

text + imageMost common range — Chameleon, Aria.

text + image + audio + videoFull modality range — Gemini, GPT-4o.

MoE IntegrationMedium

Whether MoE layers are incorporated to enable implicit modality-specific expert specialization, improving parameter efficiency.

dense (bez MoE)Chameleon — a fully dense model.

sparse MoEAria, Gemini-style — MoE enables per-modality expert specialization.

Execution paradigm

Primary mode

dense

The base execution pattern is dense: all transformer parameters are activated for every token, regardless of modality. MoE variants introduce conditional expert activation, but the native multimodal paradigm does not inherently require routing.

Activation pattern

all_paths_active

Routing mechanism

In the baseline native multimodal architecture (e.g., Chameleon), there is no explicit routing mechanism — a single dense transformer processes all token types uniformly. When MoE layers are incorporated (e.g., Gemini-style or Aria), modality-specific expert specialization emerges implicitly.

Parallelism

Parallelism level

partially_parallel

Training on interleaved multimodal data can be parallelized across devices (data parallelism, tensor parallelism), but the sequential nature of autoregressive decoding limits parallelism during inference. Training from scratch on multi-modal data requires a large number of GPUs/TPUs.

Scope

traininginferenceacross_layersacross_devices

Hardware requirements

Primary

Training and inference of native multimodal models relies on large matrix operations within the transformer (QKV projections, FFN), optimized for tensor cores on GPUs such as NVIDIA H100, A100, and GB200. Chameleon was trained on A100 GPU clusters.

Good fit

Google Gemini — one of the key native multimodal models — was trained on TPU v4/v5. The Transformer architecture is well suited to the matrix-oriented TPU accelerators.