Self-Attention
Replaced fixed context representations with a dynamic, global attention mechanism that computes dependencies between every pair of tokens in a sequence in a single pass.
For input X, three matrices are computed: Q = X*W_Q, K = X*W_K, V = X*W_V. The result is Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) * V. Dividing by sqrt(d_k) prevents excessively large dot-product values. In Multi-Head Attention, the process runs in parallel across h independent heads, and the results are concatenated.
Recurrent neural networks (RNN, LSTM) process sequences step by step, making it difficult to model long-range dependencies and preventing full parallelization of training.
n = sequence length, d = model dimension; quadratic in sequence length
Attention matrix requires O(n^2) memory
Computing and storing the n x n attention matrix is the primary bottleneck for long sequences.
Fully parallel
Dense
All paths active
Common pitfalls
Quadratic memory complexityHIGH
The n x n attention matrix requires O(n^2) memory, making it prohibitive for sequences longer than ~4k tokens without approximations like FlashAttention.
No positional inductive biasHIGH
Self-attention is permutation-invariant; without explicit positional encoding, token order is invisible to the model.
Dot-product scalingMEDIUM
Without dividing by sqrt(d_k), dot-products grow large, pushing softmax into near-zero gradient regions.
GENESIS · Source paper
Attention Is All You NeedAttention mechanism for seq2seq
Bahdanau et al. introduced soft attention for neural machine translation, precursor to self-attention.
Self-Attention in Transformer
breakthroughVaswani et al. propose self-attention as the sole mechanism, replacing recurrence entirely.
Efficient attention variants (Linformer, Performer)
Multiple works propose sub-quadratic approximations to full self-attention for long sequences.
Flash Attention - IO-aware implementation
breakthroughDao et al. introduce FlashAttention, achieving 2-4x speedup via tiled computation without approximation.
Matrix multiplications for Q, K, V projections and attention score computation are highly optimized on GPU tensor cores.
TPUs are optimized for large matrix multiplications present in attention computation.
Commonly used with
Transformer
Transformer is a neural network architecture proposed by Vaswani et al. in „Attention Is All You Need" (NeurIPS 2017). It replaced earlier approaches based on recurrent (RNN, LSTM) and convolutional (CNN) networks in sequential tasks. The key element is the multi-head self-attention mechanism, which allows every position in a sequence to directly participate in computations involving every other position, enabling the model to learn long-range dependencies in constant (not linear, as in RNNs) time. The architecture consists of encoder and decoder blocks (or encoder-only / decoder-only variants) containing: multi-head attention layers, feed-forward networks, residual connections, and layer normalization (LayerNorm). Sequence positions are encoded via positional encoding (sinusoidal or learned). Transformer has become the foundation of LLMs (GPT, BERT, T5, LLaMA, Claude, Gemini), Vision Transformers (ViT), multimodal models (CLIP, Flamingo), and tabular foundation models (TabPFN). The main limitation — quadratic attention complexity with respect to sequence length (O(n²)) — is an active research direction (FlashAttention, sliding window, linear attention, SSM).
GO TO CONCEPT| Title | Publisher | Type |
|---|---|---|
| Attention Is All You Need | — | scientific article |