The core attention layer that replaces full attention in upper transformer layers. For each query, a routing projector computes cosine similarity against all stored routing keys (Kᵣ), selects the top-k most relevant document blocks, and concatenates their compressed K/V with the local short-context K/V for standard autoregressive decoding. Lower layers retain independent per-document attention for hierarchical alignment.
Official
A lightweight projector that maps each document's token-level keys to a compressed routing key Kᵣ (via chunked mean pooling). At inference time, cosine similarity between the query vector and all stored Kᵣ vectors is computed to select the top-k most relevant document blocks. Stored in GPU VRAM for fast scoring.
A modified Rotary Position Embedding (RoPE) scheme where positional indices reset to zero at each document boundary (Parallel RoPE). The active query context uses Global RoPE with an offset of k (number of retrieved blocks) to maintain causal order. This decouples positional encoding from global sequence length, enabling zero-shot extrapolation from short training contexts (e.g., 64K tokens) to 100M-token inference without additional training.
A hierarchical key-value memory store that holds compressed document representations. Routing keys (Kᵣ) reside in GPU VRAM for fast similarity scoring. Full K/V tensors are stored in CPU RAM and transferred on-demand for the top-k selected documents. This hierarchical layout enables 100M-token throughput on 2×A800 GPUs using the Memory Parallel inference engine.
A reasoning mechanism that adaptively interleaves three modes: generative retrieval (model generates document IDs), context expansion (retrieved content is appended), and generation (final answer synthesis). This enables multi-hop reasoning across scattered memory fragments that single-round retrieval cannot handle.
Official
MSA requires pre-encoding all documents into compressed latent states (Kᵣ, K, V) before inference can begin. For very large memory banks (100M tokens), this pre-encoding phase requires significant compute and storage management.
MSA applies routing only in upper transformer layers, while lower layers maintain independent per-document processing. Applying routing to too few or too many layers affects the balance between local reasoning and memory retrieval quality.
Document-wise RoPE requires that positional indices reset at each document boundary during both training and inference. Failure to apply this scheme consistently — or mixing with standard global RoPE — causes positional drift and severe performance degradation at long contexts.
On-demand transfer of top-k document K/V tensors from CPU RAM to GPU VRAM during each decoding step can become a latency bottleneck when k is large or K/V tensors are not compressed aggressively enough.
Yu Chen et al. (EverMind / Shanda Group) submitted 'MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens' to arXiv on March 6, 2026 (arXiv:2603.23516), accepted to NeurIPS 2026. The paper introduces the MSA architecture and demonstrates less than 9% performance degradation scaling from 16K to 100M tokens.
EverMind open-sourced the MSA codebase (github.com/EverMind-AI/MSA) and released the MSA-4B model checkpoint (based on Qwen3-4B-Instruct-2507) on Hugging Face (EverMind-AI/MSA-4B). The repository accumulated over 2,500 GitHub stars within one day of release.
Time complexity: O(L · k · d) per layer. Space complexity: O(L · d_r) GPU VRAM + O(L · n · d) CPU RAM.
Computing cosine similarity between the query and all stored routing keys Kᵣ scales linearly with the number of documents L in the memory bank. This is the dominant inference bottleneck at very large memory bank sizes (100M tokens).
Number of document blocks retrieved per query step. Controls the precision-cost tradeoff: higher k improves recall at the cost of more K/V transfers and attention computation.
The per-document token window size used during training. Due to Document-wise RoPE, models trained on short contexts (e.g., 4K–64K tokens) extrapolate to 100M+ token memory banks at inference without retraining.
Routing (MSA layer) is applied only in upper transformer layers; lower layers process documents independently. The split between local and memory-routing layers affects both memory capacity and reasoning depth.
Total number of tokens stored in the long-term memory bank. MSA has been validated up to 100M tokens on 2×A800 GPUs.
Only the top-k selected document K/V blocks contribute to attention computation. Routing is applied in upper layers only; lower layers process each document independently (dense, per-document).
A routing projector computes cosine similarity between the current query and all stored compressed routing keys (Kᵣ). The top-k highest-scoring document blocks are selected and their K/V tensors are retrieved for concatenation with the local context.
Training is parallelizable across documents (each document processed independently in lower layers). Inference uses Memory Parallel engine for distributed router scoring across devices. Top-k selection and subsequent K/V retrieval are sequential within each decoding step.
MSA requires GPU Tensor Cores for efficient transformer attention computation and router scoring. The validated configuration uses 2×A800 GPUs for 100M-token inference, with routing keys stored in GPU VRAM and full K/V in CPU RAM.