Robots Atlas>ROBOTS ATLAS
Architecture

Self-Attention

2017ActiveUpdated: 6 May 2026Published
Key innovation
Replaced fixed context representations with a dynamic, global attention mechanism that computes dependencies between every pair of tokens in a sequence in a single pass.
Category
Architecture
Abstraction level
Primitive
Operation level
LayerArchitecture block
Use cases
Language modelingMachine translationText understandingCode generationImage processing (ViT)

How it works

For input X, three matrices are computed: Q = X*W_Q, K = X*W_K, V = X*W_V. The result is Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) * V. Dividing by sqrt(d_k) prevents excessively large dot-product values. In Multi-Head Attention, the process runs in parallel across h independent heads, and the results are concatenated.

Problem solved

Recurrent neural networks (RNN, LSTM) process sequences step by step, making it difficult to model long-range dependencies and preventing full parallelization of training.

Implementation

Implementation pitfalls
Quadratic memory complexityHigh

The n x n attention matrix requires O(n^2) memory, making it prohibitive for sequences longer than ~4k tokens without approximations like FlashAttention.

No positional inductive biasHigh

Self-attention is permutation-invariant; without explicit positional encoding, token order is invisible to the model.

Dot-product scalingMedium

Without dividing by sqrt(d_k), dot-products grow large, pushing softmax into near-zero gradient regions.

Evolution

Original paper · 2017 · NeurIPS 2017 · Ashish Vaswani
Attention Is All You Need
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
2014
Attention mechanism for seq2seq

Bahdanau et al. introduced soft attention for neural machine translation, precursor to self-attention.

2017
Self-Attention in Transformer
Inflection point

Vaswani et al. propose self-attention as the sole mechanism, replacing recurrence entirely.

2020
Efficient attention variants (Linformer, Performer)

Multiple works propose sub-quadratic approximations to full self-attention for long sequences.

2022
Flash Attention - IO-aware implementation
Inflection point

Dao et al. introduce FlashAttention, achieving 2-4x speedup via tiled computation without approximation.

Technical details

Computational complexity

Time complexity: O(n^2 * d). Space complexity: O(n^2 + n*d).

Compute bottleneck

Attention matrix

Computing and storing the n x n attention matrix is the primary bottleneck for long sequences.

Execution paradigm

Primary mode
dense
Activation pattern
all_paths_active

Parallelism

Parallelism level
fully_parallel
Scope
traininginferenceacross_tokens

Hardware requirements

Primary

Matrix multiplications for Q, K, V projections and attention score computation are highly optimized on GPU tensor cores.

Primary

TPUs are optimized for large matrix multiplications present in attention computation.