Self-Attention

Replaced fixed context representations with a dynamic, global attention mechanism that computes dependencies between every pair of tokens in a sequence in a single pass.

Common pitfalls

Quadratic memory complexity

HIGH

The n x n attention matrix requires O(n^2) memory, making it prohibitive for sequences longer than ~4k tokens without approximations like FlashAttention.

No positional inductive bias

HIGH

Self-attention is permutation-invariant; without explicit positional encoding, token order is invisible to the model.

Dot-product scaling

MEDIUM

Without dividing by sqrt(d_k), dot-products grow large, pushing softmax into near-zero gradient regions.

GENESIS · Source paper

Attention Is All You Need

2017NeurIPS 2017Ashish Vaswani, Noam Shazeer, Niki Parmar et al.

2014

Attention mechanism for seq2seq

Bahdanau et al. introduced soft attention for neural machine translation, precursor to self-attention.

2017

Self-Attention in Transformer

breakthrough

Vaswani et al. propose self-attention as the sole mechanism, replacing recurrence entirely.

2020

Efficient attention variants (Linformer, Performer)

Multiple works propose sub-quadratic approximations to full self-attention for long sequences.

2022

Flash Attention - IO-aware implementation

breakthrough

Dao et al. introduce FlashAttention, achieving 2-4x speedup via tiled computation without approximation.

GPU Tensor CoresPRIMARY

Matrix multiplications for Q, K, V projections and attention score computation are highly optimized on GPU tensor cores.

TPUPRIMARY

TPUs are optimized for large matrix multiplications present in attention computation.

Commonly used with

Transformer

Transformer is a neural network architecture proposed by Vaswani et al. in „Attention Is All You Need" (NeurIPS 2017). It replaced earlier approaches based on recurrent (RNN, LSTM) and convolutional (CNN) networks in sequential tasks. The key element is the multi-head self-attention mechanism, which allows every position in a sequence to directly participate in computations involving every other position, enabling the model to learn long-range dependencies in constant (not linear, as in RNNs) time. The architecture consists of encoder and decoder blocks (or encoder-only / decoder-only variants) containing: multi-head attention layers, feed-forward networks, residual connections, and layer normalization (LayerNorm). Sequence positions are encoded via positional encoding (sinusoidal or learned). Transformer has become the foundation of LLMs (GPT, BERT, T5, LLaMA, Claude, Gemini), Vision Transformers (ViT), multimodal models (CLIP, Flamingo), and tabular foundation models (TabPFN). The main limitation — quadratic attention complexity with respect to sequence length (O(n²)) — is an active research direction (FlashAttention, sliding window, linear attention, SSM).

GO TO CONCEPT

Title	Publisher	Type
Attention Is All You Need	—	scientific article

Attention Is All You Need

scientific article

Back to technology catalog

Self-Attention

Use cases

How it works

Problem solved

Computational complexity

Implementation

Common pitfalls

History and evolution

Preferred hardware

Semantic relations

Commonly used with

Sources