Architecture

Self-Attention

2017ActiveUpdated: 6 May 2026Published

Key innovation

Replaced fixed context representations with a dynamic, global attention mechanism that computes dependencies between every pair of tokens in a sequence in a single pass.

How it works

For input X, three matrices are computed: Q = X*W_Q, K = X*W_K, V = X*W_V. The result is Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) * V. Dividing by sqrt(d_k) prevents excessively large dot-product values. In Multi-Head Attention, the process runs in parallel across h independent heads, and the results are concatenated.

Problem solved

Recurrent neural networks (RNN, LSTM) process sequences step by step, making it difficult to model long-range dependencies and preventing full parallelization of training.

Implementation

Implementation pitfalls

Quadratic memory complexityHigh

The n x n attention matrix requires O(n^2) memory, making it prohibitive for sequences longer than ~4k tokens without approximations like FlashAttention.

No positional inductive biasHigh

Self-attention is permutation-invariant; without explicit positional encoding, token order is invisible to the model.

Dot-product scalingMedium

Without dividing by sqrt(d_k), dot-products grow large, pushing softmax into near-zero gradient regions.

Evolution

Original paper · 2017 · NeurIPS 2017 · Ashish Vaswani

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

2014

Attention mechanism for seq2seq

Bahdanau et al. introduced soft attention for neural machine translation, precursor to self-attention.

2017

Self-Attention in Transformer

Inflection point

Vaswani et al. propose self-attention as the sole mechanism, replacing recurrence entirely.

2020

Efficient attention variants (Linformer, Performer)

Multiple works propose sub-quadratic approximations to full self-attention for long sequences.

2022

Flash Attention - IO-aware implementation

Inflection point

Dao et al. introduce FlashAttention, achieving 2-4x speedup via tiled computation without approximation.

Technical details

Computational complexity

Time complexity: O(n^2 * d). Space complexity: O(n^2 + n*d).

Compute bottleneck

Attention matrix

Computing and storing the n x n attention matrix is the primary bottleneck for long sequences.

Execution paradigm

Primary mode

dense

Activation pattern

all_paths_active

Parallelism

Parallelism level

fully_parallel

Scope

traininginferenceacross_tokens

Hardware requirements

Primary

Matrix multiplications for Q, K, V projections and attention score computation are highly optimized on GPU tensor cores.

Primary

TPUs are optimized for large matrix multiplications present in attention computation.

Sources

Attention Is All You Need

Paper