Robots Atlas>ROBOTS ATLAS

DreamLite: ByteDance's model generates images on-device in 3 seconds

DreamLite: ByteDance's model generates images on-device in 3 seconds

ByteDance has released DreamLite — a lightweight 0.39 billion parameter diffusion model that is the first known on-device model combining text-to-image generation and text-guided image editing in a single network. On iPhone 17 Pro, the model generates or edits a 1024×1024 image in approximately 3 seconds, running entirely locally with no cloud connection required. Inference code and the research paper are available as open-source on GitHub and arXiv.

Key takeaways

  • Model size: 0.39B parameters (pruned SDXL U-Net as the backbone network)
  • Latency on iPhone 17 Pro: ~3 seconds for a 1024×1024 image, fully offline
  • Benchmarks: GenEval 0.72 / DPG 85.8 / ImgEdit 4.11 — above all known on-device models of comparable parameter count
  • DMD2 step distillation: sampling compressed from tens of denoising steps down to 4 inference steps
  • Code and paper open-source: GitHub ByteVisionLab/DreamLite, arXiv 2603.28713

The problem: two models on one device

The mobile diffusion model ecosystem has until now required separate networks for two core tasks. Text-to-image generation from scratch and text-guided editing of existing images relied on independent pipelines. For devices with constrained RAM, simultaneously loading two model sets — each with its own weight tensors, memory buffers, and download cycle — was practically unacceptable from a product engineering perspective.

The second constraint is the quality-latency trade-off. Existing lightweight on-device models such as SnapGen++ (0.4B) and SANA-0.6B (0.6B) achieved GenEval scores of 0.66 and 0.64 respectively — noticeably below server-side models. Pushing for higher image quality extended inference time to 10–15 seconds, negating the practical value of real-time interaction. Neither model supported text-guided editing.

Architecture: one U-Net, two modes

DreamLite is built on a pruned U-Net from SDXL. The core unification mechanism is In-Context Spatial Concatenation: the model input is always a pair of spatial latent tensors concatenated along the width axis (left-right). In T2I mode, the right tensor is a fully black placeholder indicating no visual conditioning. In editing mode, the right tensor is the encoded original image to be modified.

Routing between modes is handled by parameter-free task tokens prepended to the text prompt: [Generate] for text-to-image and [Edit] for editing. This allows a single network to distinguish the task without additional adapters, architecture branches, or routing modules — which is critical for maintaining the low parameter count.

Training proceeds in three stages. Stage 1 is large-scale T2I pretraining on text-image pairs. Stage 2 activates in-context conditioning and trains the model on instruction-following editing while preserving original image structure. Stage 3 is joint optimization of both tasks under a unified in-context paradigm. The authors report that direct joint training without these preliminary stages led to instability for sub-gigabyte models.

Quality alignment: RLHF and DMD2

After pretraining, the model undergoes two alignment rounds. The first is supervised fine-tuning (SFT) on high-quality curated data followed by RLHF preference optimization. For the T2I task, the reward model is HPSv3 (Human Preference Score v3). For editing, the reward model is EditReward. Preference optimization uses ReFL (Reward Feedback Learning), a reinforcement learning variant operating directly in the diffusion model's latent space.

The second round is DMD2 (Distribution Matching Distillation 2) — a step distillation technique that compresses sampling from tens of denoising steps down to 4. Together, both rounds enable quality comparable to models 10–30 times larger in parameter count, at inference latency on the order of seconds on iPhone 17 Pro class hardware.

Results: benchmarks and on-device tests

Across four key benchmarks, DreamLite (0.39B) scores: GenEval 0.72, DPG 85.8, ImgEdit 4.11, GEdit-EN-Q 6.88. For comparison: SnapGen++ (small, 0.4B) achieves GenEval 0.66 and DPG 85.2 with no editing results. SANA-0.6B (0.6B) scores GenEval 0.64 and DPG 83.6. EditMGT (0.96B), an editing-only specialist, achieves ImgEdit 2.89 and GEdit 6.33 — lower than DreamLite on both editing metrics despite 2.5× more parameters.

Comparing against server-side models: FLUX.1-Dev/Kontext (12B) scores GenEval 0.67 and DPG 84.0 — DreamLite with 30× fewer parameters scores higher on GenEval (0.72 vs 0.67). OmniGen2 (4B) records ImgEdit 3.44 and GEdit 6.79 — DreamLite with 10× fewer parameters achieves better ImgEdit (4.11) and comparable GEdit (6.88). Exceptions are LongCat-Image/Edit (6B) with ImgEdit 4.49 and BAGEL (7B) with GEdit 7.20, which maintain an advantage over DreamLite.

The on-device demo on iPhone 17 Pro covers three typical workflows: portrait generation with oil painting style transfer, landscape generation with seasonal background replacement (winter-to-summer), and product scene generation with flexible object substitution. All operations run entirely on-device — user data does not leave the device, which is particularly relevant in the context of data privacy regulations.

Why it matters

DreamLite demonstrates that unifying T2I generation and image editing is achievable below the 0.4B parameter threshold without significant quality loss on standard metrics. For on-device engineering this has direct consequences: one model instead of two means one download cycle, one memory budget, one maintenance point. For product teams building mobile creative tools, this simplifies application architecture without functional compromise.

In-Context Spatial Concatenation is conceptually simple and requires no adapters or routing modules. Parameter-free task token routing does not increase the model's parameter count when switching modes. The combination of DMD2 with RLHF alignment via ReFL establishes a practical optimization scheme for lightweight diffusion models — replicable by other research teams working on on-device deployment. Zero-transfer privacy as an architectural property is increasingly becoming a regulatory requirement under the EU AI Act and equivalent global frameworks.

What comes next

  • Inference code and paper are already available on GitHub (ByteVisionLab/DreamLite) and arXiv (2603.28713) — the community can test the model on a range of mobile devices beyond the presented iPhone 17 Pro
  • An interactive demo is available on HuggingFace Spaces (carlofkl/DreamLite) — enabling quality verification without physical Apple hardware
  • ByteDance has not announced DreamLite integration in any commercial product (e.g., CapCut) — availability as an in-app feature remains unannounced

Sources

Share this article

Related articles