Memory

MSA

Key innovation

MSA (Memory Sparse Attention) introduces an end-to-end differentiable sparse latent memory layer embedded directly within the Transformer attention mechanism, achieving near-linear O(L) complexity while scaling to 100 million context tokens without an external retrieval system.

Components

Memory Sparse Attention LayerReplaces the standard full-attention mechanism in the upper Transformer layers; for each query it selects the top-k documents from a compressed memory bank and appends their K/V pairs to the local context.

The core attention layer that replaces full attention in upper transformer layers. For each query, a routing projector computes cosine similarity against all stored routing keys (Kᵣ), selects the top-k most relevant document blocks, and concatenates their compressed K/V with the local short-context K/V for standard autoregressive decoding. Lower layers retain independent per-document attention for hierarchical alignment.

Official

Router (Routing Projector)Routing key projection Kᵣ — a compressed document representation used to select top-k documents based on cosine similarity with the current query.

A lightweight projector that maps each document's token-level keys to a compressed routing key Kᵣ (via chunked mean pooling). At inference time, cosine similarity between the query vector and all stored Kᵣ vectors is computed to select the top-k most relevant document blocks. Stored in GPU VRAM for fast scoring.

Document-wise RoPEPositioning mechanism that resets the position counter to zero at the start of each document, enabling positional extrapolation from short training contexts to 100M inference tokens.

A modified Rotary Position Embedding (RoPE) scheme where positional indices reset to zero at each document boundary (Parallel RoPE). The active query context uses Global RoPE with an offset of k (number of retrieved blocks) to maintain causal order. This decouples positional encoding from global sequence length, enabling zero-shot extrapolation from short training contexts (e.g., 64K tokens) to 100M-token inference without additional training.

Parallel (Document-level) RoPEPositional counter resets to zero at the start of each document in the memory bank, preventing positional drift.

Global RoPEApplied to the active query context, offset by k (number of retrieved top-k blocks) to maintain correct causal ordering.

KV Cache Memory StoreHierarchical storage of compressed document latent states: routing keys Kᵣ in GPU VRAM (fast access for scoring), full K/V tensors in CPU RAM (memory efficiency).

A hierarchical key-value memory store that holds compressed document representations. Routing keys (Kᵣ) reside in GPU VRAM for fast similarity scoring. Full K/V tensors are stored in CPU RAM and transferred on-demand for the top-k selected documents. This hierarchical layout enables 100M-token throughput on 2×A800 GPUs using the Memory Parallel inference engine.

Memory InterleaveMulti-step reasoning mechanism that iteratively interleaves document identifier generation and context expansion, enabling multi-hop retrieval across distributed memory chunks.

A reasoning mechanism that adaptively interleaves three modes: generative retrieval (model generates document IDs), context expansion (retrieved content is appended), and generation (final answer synthesis). This enables multi-hop reasoning across scattered memory fragments that single-round retrieval cannot handle.

Official

Implementation

Reference implementations

EverMind-AI/MSA (GitHub)

Python · EverMind

Official

EverMind-AI/MSA-4B (Hugging Face)

Python · EverMind

Official

Implementation pitfalls

Cold Start — Building a Memory Bank Before InferenceMedium

MSA requires pre-encoding all documents into compressed latent states (Kᵣ, K, V) before inference can begin. For very large memory banks (100M tokens), this pre-encoding phase requires significant compute and storage management.

Fix:Pre-encode documents offline and store the resulting routing keys and K/V tensors. Use chunked mean pooling efficiently. Ensure adequate CPU RAM and GPU VRAM for hierarchical storage before serving.

Incorrect separation of routing and local layersMedium

MSA applies routing only in upper transformer layers, while lower layers maintain independent per-document processing. Applying routing to too few or too many layers affects the balance between local reasoning and memory retrieval quality.

Fix:Follow the layer split configuration specified in the paper and official implementation. Ablate on a held-out long-context validation set when adapting to new backbone architectures.

RoPE inconsistency between training and inference modesHigh

Document-wise RoPE requires that positional indices reset at each document boundary during both training and inference. Failure to apply this scheme consistently — or mixing with standard global RoPE — causes positional drift and severe performance degradation at long contexts.

Fix:Apply Parallel RoPE (document-level reset) to all memory-bank documents and Global RoPE with k-offset to the active query context, exactly as specified in the paper. Verify implementation against the official codebase.

CPU-GPU bandwidth bottleneck with large memory banksMedium

On-demand transfer of top-k document K/V tensors from CPU RAM to GPU VRAM during each decoding step can become a latency bottleneck when k is large or K/V tensors are not compressed aggressively enough.

Fix:Keep k small (as validated in the paper). Apply aggressive K/V compression (chunked mean pooling) to reduce transfer volume. Use pinned CPU memory and asynchronous prefetching where possible.

AI models using this technology

Related producers

EVEverMind

Evolution

Original paper · 2026 · NeurIPS 2026 · Yu Chen

MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens

Yu Chen, Runkai Chen, Sheng Yi, Xinda Zhao, Xiaohong Li, Jianjin Zhang, Jun Sun, Chuanrui Hu, Yunyun Han, Lidong Bing, Yafeng Deng, Tianqiao Chen

2026

MSA paper published on arXiv and at NeurIPS 2026 (EverMind)

Inflection point

Yu Chen et al. (EverMind / Shanda Group) submitted 'MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens' to arXiv on March 6, 2026 (arXiv:2603.23516), accepted to NeurIPS 2026. The paper introduces the MSA architecture and demonstrates less than 9% performance degradation scaling from 16K to 100M tokens.

MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens (paper)

2026

Open-source release of MSA code and MSA-4B model on GitHub and Hugging Face

EverMind open-sourced the MSA codebase (github.com/EverMind-AI/MSA) and released the MSA-4B model checkpoint (based on Qwen3-4B-Instruct-2507) on Hugging Face (EverMind-AI/MSA-4B). The repository accumulated over 2,500 GitHub stars within one day of release.

Sources

EverMind-AI/MSA - GitHub repository

EverMind official website

Breaking the 100M Token Limit: EverMind's MSA Architecture Achieves Efficient End-to-End Long-Term Memory for LLMs

MSA

Components

Implementation

AI models using this technology

Evolution

Sources

Computational complexity

Compute bottleneck

Hyperparameters (configurable axes)

Execution paradigm

Parallelism

Hardware requirements