Robots AtlasRobots Atlas

Self-Attention

Replaced fixed context representations with a dynamic, global attention mechanism that computes dependencies between every pair of tokens in a sequence in a single pass.

Category
Abstraction level
Operation level
Language modelingMachine translationText understandingcode generationImage processing (ViT)

For input X, three matrices are computed: Q = X*W_Q, K = X*W_K, V = X*W_V. The result is Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) * V. Dividing by sqrt(d_k) prevents excessively large dot-product values. In Multi-Head Attention, the process runs in parallel across h independent heads, and the results are concatenated.

Recurrent neural networks (RNN, LSTM) process sequences step by step, making it difficult to model long-range dependencies and preventing full parallelization of training.

Time

n = sequence length, d = model dimension; quadratic in sequence length

Memory complexity

Attention matrix requires O(n^2) memory

Wąskie gardło: Attention matrix

Computing and storing the n x n attention matrix is the primary bottleneck for long sequences.

Parallelism

Fully parallel

Paradigm

Dense

All paths active

Common pitfalls

Quadratic memory complexity
HIGH

The n x n attention matrix requires O(n^2) memory, making it prohibitive for sequences longer than ~4k tokens without approximations like FlashAttention.

No positional inductive bias
HIGH

Self-attention is permutation-invariant; without explicit positional encoding, token order is invisible to the model.

Dot-product scaling
MEDIUM

Without dividing by sqrt(d_k), dot-products grow large, pushing softmax into near-zero gradient regions.

GENESIS · Source paper

Attention Is All You Need
2017NeurIPS 2017Ashish Vaswani, Noam Shazeer, Niki Parmar et al.
2014

Attention mechanism for seq2seq

Bahdanau et al. introduced soft attention for neural machine translation, precursor to self-attention.

2017

Self-Attention in Transformer

breakthrough

Vaswani et al. propose self-attention as the sole mechanism, replacing recurrence entirely.

2020

Efficient attention variants (Linformer, Performer)

Multiple works propose sub-quadratic approximations to full self-attention for long sequences.

2022

Flash Attention - IO-aware implementation

breakthrough

Dao et al. introduce FlashAttention, achieving 2-4x speedup via tiled computation without approximation.

GPU Tensor CoresPRIMARY

Matrix multiplications for Q, K, V projections and attention score computation are highly optimized on GPU tensor cores.

TPUPRIMARY

TPUs are optimized for large matrix multiplications present in attention computation.

Commonly used with

Transformer

Transformer is a neural network architecture proposed by Vaswani et al. in „Attention Is All You Need" (NeurIPS 2017). It replaced earlier approaches based on recurrent (RNN, LSTM) and convolutional (CNN) networks in sequential tasks. The key element is the multi-head self-attention mechanism, which allows every position in a sequence to directly participate in computations involving every other position, enabling the model to learn long-range dependencies in constant (not linear, as in RNNs) time. The architecture consists of encoder and decoder blocks (or encoder-only / decoder-only variants) containing: multi-head attention layers, feed-forward networks, residual connections, and layer normalization (LayerNorm). Sequence positions are encoded via positional encoding (sinusoidal or learned). Transformer has become the foundation of LLMs (GPT, BERT, T5, LLaMA, Claude, Gemini), Vision Transformers (ViT), multimodal models (CLIP, Flamingo), and tabular foundation models (TabPFN). The main limitation — quadratic attention complexity with respect to sequence length (O(n²)) — is an active research direction (FlashAttention, sliding window, linear attention, SSM).

GO TO CONCEPT
Attention Is All You Need
scientific article