Transformer (System)

Replacing recurrent networks (RNN, LSTM) with an architecture based solely on attention, enabling full training parallelism and direct modeling of long-range dependencies without gradient degradation.

Token Embedding Layer and Positional Encoding

Converts a sequence of tokens into continuous representations with positional information.

Modular

Converts discrete input tokens into dense vectors (d_model) and adds positional encodings. The original paper used sinusoidal positional encodings; later variants use learned positional embeddings or rotary encodings (RoPE, ALiBi).

i/o

[B, T]

out

[B, T, d_model]

Multi-Head Self-Attention

Computes contextual token representations by attending to all other positions in the sequence.

Modular

The core mechanism of the Transformer. Attention is computed as a weighted sum of values (V), where the weights derive from the compatibility between queries (Q) and keys (K) across all sequence positions. Multi-head attention allows the model to learn different types of dependencies in separate subspaces simultaneously. Complexity: O(n²·d) with respect to sequence length n.

i/o

[B, T, d_model]

out

[B, T, d_model]

Feed-Forward Network (FFN)

Nonlinear transformation of each token's representation, applied independently across positions.

Modular

Two-layer fully connected network applied position-wise (independently to each token) after the self-attention layer. In the original paper: Linear(d_model → d_ff) + ReLU + Linear(d_ff → d_model), where d_ff = 4 · d_model. It serves as the primary repository of the model's parametric knowledge.

i/o

[B, T, d_model]

out

[B, T, d_model]

Layer Normalization

Stabilizes activation distributions and facilitates gradient flow through deep Transformer stacks.

Modular

Activation statistics are normalized independently for each token along the d_model dimension. This stabilizes training in deep Transformer networks. The original paper used Post-LN (applied after attention and FFN); newer implementations adopt Pre-LN (applied before attention/FFN) for improved training stability.

Residual Connections

Ensuring gradient stability in deep Transformer layer stacks.

Each sublayer (attention and FFN) has a residual connection that adds the input to the sublayer's output: output = LayerNorm(x + Sublayer(x)). This enables training of very deep networks by providing a direct gradient path.

Time

…

n = input sequence length (number of tokens); d = d_model (embedding dimension). Complexity refers to a single self-attention layer. Full model: O(N · n² · d), where N = number of layers. For feed-forward layers: O(n · d · d_ff) = O(n · d²) when d_ff = 4d.

Quadratic complexity with respect to sequence length (n²) is a fundamental limitation for scaling to long contexts. When n > d, attention dominates FFN in compute cost. Numerous works have attempted to reduce this complexity (Longformer, BigBird; FlashAttention does not reduce asymptotic complexity but dramatically lowers the IO constant).

Memory complexity

…

The attention matrix is stored per head per layer, requiring O(n²) memory in a standard implementation, plus O(n·d) for intermediate activations. With Flash Attention, HBM memory usage drops to O(n) instead of O(n²).

The n×n attention matrix is the dominant memory component for long sequences in standard implementations. Flash Attention eliminates the need to materialize the full matrix by using block-wise computation.

Wąskie gardło: Quadratic self-attention complexity

The attention matrix computation Q·Kᵀ requires O(n²·d) floating-point operations and O(n²) memory per head per layer. For long sequences (n > 4096), these costs grow quadratically, becoming the dominant factor limiting context scalability.

Parallelism

Partially parallel

Training is fully parallel across tokens (all tokens processed simultaneously via attention masking). Inference is sequential token-by-token during autoregressive generation, but parallel during prompt processing (prefill). Tensor and pipeline parallelism enable scaling across multiple GPUs/TPUs.

Paradigm

Dense

All paths active

A standard Transformer activates all parameters for every token — no routing, sparsity, or conditional activation. This contrasts with MoE (Mixture of Experts), which activates only a subset of FFN experts. The dense nature is both a strength (predictability, simplicity) and a scalability limitation (cost scales linearly with parameter count).

Number of Layers (Depth)

Critical

6Original Transformer (Vaswani 2017).
12BERT-base, GPT-2 small.
32–96Typical range for large language models (LLaMA 2/3, GPT-4 class).

The number of Transformer blocks (N) in the encoder and/or decoder. The original paper used N=6. Increasing depth improves the model's ability to capture complex dependencies.

Embedding Dimension (Width)

Critical

512Transformer base (Vaswani 2017).
768BERT-base.
4096–8192Typical range for large language models.

The dimensionality of token representation vectors (d_model). Original paper: d_model=512 (base). Controls the representational capacity of the model.

Number of Attention Heads

Standard

8Transformer base (Vaswani 2017).
12BERT-base, GPT-2 small.
32–128Typical range for large models.

The number of parallel attention heads (h). In the original paper: h=8. Each head operates in dimension d_k = d_model / h. More heads allow the model to capture different types of dependencies simultaneously.

Feed-Forward Layer Dimension

Standard

4 × d_modelStandard — e.g. 2048 with d_model=512.
8/3 × d_modelUsed with SwiGLU (e.g., LLaMA).

The hidden dimension of the FFN sublayer. In the original paper: d_ff = 4 · d_model = 2048. The FFN layer stores a substantial portion of the model's parametric knowledge.

Context Length (Sequence Window)

Critical

512–1024Early BERT and GPT-2 models.
4096–128kModern models with extended context windows (LLaMA 3, Claude 3, GPT-4 Turbo).
1M+Experimental models with very long context windows (Gemini 1.5 Pro).

The maximum number of tokens processed simultaneously. This directly determines the quadratic cost of attention. The original paper does not specify a fixed context limit — it depends on available computational resources.

Dropout Rate

Standard

0.0No dropout — used in very large language models.
0.1Original Transformer base configuration.

Regularization via random activation dropout during training. The original paper uses P_drop = 0.1 for the base model and 0.3 for the big model.

Common pitfalls

Training instability in Post-LN without warmup

HIGH

The original Post-LN variant (normalization after residual summation) is susceptible to gradient explosion without careful learning rate warmup. Deep Post-LN models frequently fail to converge under standard hyperparameters.

Use Pre-LN (normalization before attention and FFN) or RMSNorm Pre-LN — more stable and less sensitive to learning rate warmup. Alternatively, apply careful learning rate warmup (warmup steps = 4000 as in the original paper).

GPU memory overflow with long sequences

CRITICAL

The standard attention implementation stores an n×n matrix (or h matrices of size n/h × n) in GPU memory. For n=4096 with h=32 layers in float16, this amounts to tens of gigabytes for attention activations alone, exceeding the memory capacity of typical GPUs.

Use Flash Attention or Flash Attention 2/3 — reduces HBM memory from O(n²) to O(n). Alternatively, gradient checkpointing lowers activation memory at the cost of additional computation.

Incorrect attention mask implementation in batch processing

HIGH

When batch processing sequences of varying lengths, padding and masking of padding positions in the attention matrix are required. An incorrect mask implementation causes information leakage from padding positions into real tokens, degrading model quality.

Apply attention masks carefully — the padding mask must be added to attention logits (not multiplied) using a value of -inf. Verify correctness by inspecting outputs at padding token positions.

Numerical instability of softmax with large attention logits

HIGH

The softmax function in the attention mechanism can produce NaN values due to overflow when applied to large dot products Q·Kᵀ — particularly when d_k is not scaled or when float16 arithmetic is used. The original paper scales by 1/√d_k to address this.

Always scale attention logits by 1/√d_k before softmax. Use numerically stable softmax implementations (log-sum-exp trick). Apply Flash Attention, which incorporates stable numerics by design.

Linear KV-cache cost growth over long sessions

MEDIUM

During autoregressive inference, the keys (K) and values (V) of each previously generated token must be stored in memory (KV-cache) to avoid recomputation. Cache size grows linearly with sequence length, limiting the number of concurrent sessions.

Use Grouped Query Attention (GQA) or Multi-Query Attention (MQA) to reduce KV-cache size by sharing K/V across heads. Implement KV-cache memory management techniques such as paging (e.g., vLLM's PagedAttention).

Reference implementations

The Annotated Transformer (Harvard NLP)

Python · Harvard NLP Group

Hugging Face Transformers

Python · Hugging Face

tensor2tensor (Google, original implementation)official

Python · Google Brain

nanoGPT (Andrej Karpathy)

Python · Andrej Karpathy

GENESIS · Source paper

Attention Is All You Need

2017NeurIPS 2017Ashish Vaswani, Noam Shazeer, Niki Parmar et al.

2017

Original Transformer — Attention Is All You Need

breakthrough

Vaswani et al. (Google Brain/Research) propose the Transformer architecture for English-German and English-French machine translation. It eliminates RNNs and CNNs in favor of an attention-only mechanism, achieving a new state of the art on WMT benchmarks.

Attention Is All You Need

2018

GPT-1 — unidirectional decoding for language generation

breakthrough

OpenAI adapts the Transformer decoder architecture to language via pretraining on large corpora and fine-tuning on downstream tasks. This establishes the decoder-only Transformer paradigm for language modeling.

Improving Language Understanding by Generative Pre-Training

2018

BERT — bidirectional Transformer encoder

breakthrough

Devlin et al. (Google) introduce BERT (Bidirectional Encoder Representations from Transformers), which pretrains a Transformer encoder via masked token prediction and next-sentence prediction. The model advances NLP benchmarks across 11 tasks simultaneously, establishing new state-of-the-art results across the field.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

2020

GPT-3 — scaling to 175 billion parameters and emergent few-shot capabilities

breakthrough

OpenAI scales the Transformer decoder architecture to 175 billion parameters. The demonstration of few-shot and in-context learning capabilities without fine-tuning was a landmark result revealing emergent properties of scaling.

Language Models are Few-Shot Learners

2020

Vision Transformer (ViT) — Transformer for Images

breakthrough

Dosovitskiy et al. (Google Brain) apply the Transformer architecture directly to sequences of image patches, achieving results competitive with CNN-based networks on ImageNet when sufficient training data is available.

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

2022

Flash Attention — IO-aware attention implementation

breakthrough

Dao et al. (Stanford) publish FlashAttention — an implementation that computes attention in tiles without materializing the full n×n matrix, reducing HBM memory usage from O(n²) to O(n) while producing mathematically identical results. This makes practical training on long contexts feasible.

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

2023

LLaMA — open-source Transformer models in the GPT-3 class

breakthrough

Meta AI releases LLaMA — a family of open-source Transformer decoder models (7B–65B parameters) trained longer on more data, matching GPT-3 performance with fewer parameters. It accelerates open LLM research.

LLaMA: Open and Efficient Foundation Language Models

GPU Tensor CoresPRIMARY

Matrix operations dominating attention computations (Q·Kᵀ, attention·V) and FFN layers are GEMM operations optimally executed by GPU tensor cores (NVIDIA A100/H100/H200). The Transformer architecture is the de facto benchmark driving tensor core requirements in modern GPUs.

NVIDIA designs tensor cores and libraries (cuBLAS, cuDNN, Flash Attention CUDA kernels) around the operations that dominate Transformer workloads. Training large models is performed almost exclusively on A100/H100-class GPUs or newer.

TPUGOOD

Google TPU v4/v5 are designed for efficient execution of Transformer matrix operations. Several key models (PaLM, Gemini) were trained exclusively on TPUs.

XLA compilation on TPU requires static tensor shapes, which constrains the implementation of dynamic padding and variable sequence lengths.