PEFT / LoRA
How it works
In the PEFT approach, the base model remains largely frozen, and training covers only a small set of additional parameters, typically in the form of adapters. With LoRA, the update to a large weight matrix is approximated by the product of two smaller low-rank matrices, which are added to selected layers — most commonly attention or linear projection layers. This reduces the number of trainable parameters dramatically. After training, the adapter can be stored separately from the base model and loaded only when needed, simplifying deployment, sharing, and experimentation with multiple fine-tuned variants.
Problem solved
PEFT / LoRA addresses the high cost of full fine-tuning of large models. Traditional full-parameter fine-tuning requires substantial GPU memory, training time, and storage for multiple model variants. Deploying separate, fully fine-tuned copies of a large model for different tasks is also operationally expensive. PEFT techniques reduce these costs by training only a small number of additional parameters that can be easily saved, swapped, and reused.
Key mechanisms
Strengths & limitations
Components
Matrix A ∈ ℝ^(r×k) maps input from dimension k to the low-rank space of dimension r. Initialized randomly (Kaiming uniform or Gaussian). Trainable.
Matrix B ∈ ℝ^(d×r) maps from the low-rank space of dimension r back to the original output dimension d. Initialized to zeros, ensuring identical model output at the start of training. Trainable.
Original pretrained weight matrix of dimensions d×k. During LoRA training, W₀ parameters are frozen (requires_grad=False) and not updated by the optimizer. Post-training: W = W₀ + (α/r)·BA.
Scaling hyperparameter controlling the contribution of the weight update ΔW = (α/r)·BA to the layer output. α (lora_alpha) is a tunable scaling coefficient; r (rank) determines the adaptation space dimension. The α/r ratio makes optimal learning rate approximately rank-independent.
Official
Implementation
Too low rank (r=1–4) for a complex task prevents the adapter from learning required transformations. The model only learns style/format, not content. Manifests as high training loss and poor task performance.
If matrix B is not initialized to zeros, the model at training start is not identical to the base model, disrupting stability and hindering convergence. Per the paper, B must be zero-initialized.
Omitting the W = W₀ + (α/r)·BA merge before deployment results in two forward passes instead of one, increasing latency. Manual merging without the (α/r) scaling factor produces incorrect results.
LoRA adapters require higher learning rate than full fine-tuning (typically 10–100× higher) since far fewer parameters are trained. Too low LR leads to slow convergence or no adaptation.
If the (α/r) factor is not correctly applied in the forward pass (h = W₀x + (α/r)·BAx), adaptation outputs are unscaled, leading to unpredictable model behavior, especially when changing r without adjusting α.
Evolution
Hu et al. published arXiv:2106.09685. Method freezes base weights and adds low-rank matrix pairs A and B to selected Transformer layers. Reduced trainable parameters by 10,000× with performance matching full fine-tuning of GPT-3 175B.
Paper accepted at ICLR 2022. Microsoft released the official reference implementation. The Stable Diffusion community adopted LoRA for fine-tuning diffusion models.
Hugging Face integrated LoRA into the PEFT library, making it accessible to millions of developers. Dettmers et al. published QLoRA enabling fine-tuning of 65B models on a single 48 GB GPU via 4-bit quantization combined with LoRA adapters.
Zhang et al. published AdaLoRA, dynamically allocating rank across layers based on singular value importance scores.
Liu et al. published DoRA decomposing weights into directional and magnitude components, applying LoRA only to direction. Kalajdzievski proposed rsLoRA with α/√r scaling for more stable training at higher ranks.
Technical details
Hyperparameters (configurable axes)
Dimension of the low-rank adaptation space. Directly controls the number of trainable parameters (r·(d+k) per layer) and the expressiveness of the adapter.
Scaling coefficient in the formula (α/r)·BA. Controls the effective strength of adaptation. Common heuristic: alpha = r or alpha = 2r.
Which weight matrices of the model are adapted by LoRA. In the original paper: Wq and Wv in self-attention. In practice: often all linear layers for maximum performance.
Dropout rate applied to LoRA adapter outputs during training as regularization.
Computational complexity
Space complexity: O(r·(d + k)). PEFT / LoRA zmniejsza koszt obliczeniowy dostrajania przez trenowanie tylko niewielkiej części parametrów. LoRA według oryginalnej pracy może redukować liczbę trenowanych parametrów o rzędy wielkości i obniżać wymagania pamięciowe względem pełnego fine-tuningu, zachowując konkurencyjną jakość.
PEFT / LoRA is not a benchmark but a family of training techniques. Evaluation typically involves comparison with full fine-tuning across final model quality, number of trainable parameters, memory consumption, training speed, and ease of adapter deployment.
Execution paradigm
LoRA is dense: both the original path W₀x and the adaptation path (α/r)·BAx are computed for every input. There is no conditional activation or sparsity. At inference time (after weight merging), the overhead is zero.
Parallelism
The BA computations are fully parallelizable within and across layers, analogously to standard matrix multiplication in the Transformer. There are no sequential dependencies between LoRA adapters in different layers.
Hardware requirements
LoRA operates on matrix multiplication (BA) and addition to linear layer outputs — operations fully accelerated by GPU Tensor Cores (NVIDIA A100, H100, RTX 3090+). Reducing trainable parameters yields approximately 3× lower GPU memory consumption for optimizer states.
TPU v4/v5 efficiently handle the GEMM operations used in LoRA. The Flax/JAX library enables LoRA implementation on TPU. Used by Google when fine-tuning Gemini and PaLM models with PEFT.