Robots Atlas>ROBOTS ATLAS
Training

PEFT / LoRA

2021ActivePublished: 20 March 2026Updated: 20 March 2026Published
Key innovation
Freezing the original model weights and adding a pair of small low-rank matrices (A and B) as trainable weight updates, reducing the number of trainable parameters by orders of magnitude without adding inference-time computational overhead.
Category
Training
Abstraction level
Pattern
Operation level
Post-trainingTraining
Use cases
Fine-tuning large language models for domain-specific tasksInstruction fine-tuning of open-weight modelsFine-tuning multimodal and vision-language modelsEfficient fine-tuning of diffusion and generative modelsBuilding multiple lightweight variants of a single base modelResearch experiments on limited hardwareDeploying adapters for different clients, languages, or specializations

How it works

In the PEFT approach, the base model remains largely frozen, and training covers only a small set of additional parameters, typically in the form of adapters. With LoRA, the update to a large weight matrix is approximated by the product of two smaller low-rank matrices, which are added to selected layers — most commonly attention or linear projection layers. This reduces the number of trainable parameters dramatically. After training, the adapter can be stored separately from the base model and loaded only when needed, simplifying deployment, sharing, and experimentation with multiple fine-tuned variants.

Problem solved

PEFT / LoRA addresses the high cost of full fine-tuning of large models. Traditional full-parameter fine-tuning requires substantial GPU memory, training time, and storage for multiple model variants. Deploying separate, fully fine-tuned copies of a large model for different tasks is also operationally expensive. PEFT techniques reduce these costs by training only a small number of additional parameters that can be easily saved, swapped, and reused.

Key mechanisms

Freezing the majority of base model parameters
Training a small number of additional parameters
Using adapters instead of full fine-tuning
LoRA: decomposing weight updates into two low-rank matrices
Significant reduction in GPU memory usage during training
Separate saving and loading of lightweight adapters
Easy sharing of multiple fine-tuned variants of a single base model

Strengths & limitations

Strengths
Significantly reduces memory and training cost compared to full fine-tuning
Enables training large models on cheaper or lower-end hardware
Simplifies storage and distribution of lightweight adapters
Can achieve quality comparable to full fine-tuning
Accelerates experimentation and iteration on model adaptation
Integrates well with the Hugging Face ecosystem
Allows a single base model to be shared across multiple adapters
Limitations
Does not always match full fine-tuning across every task and architecture
Effectiveness depends on the choice of layers, adapter rank, and hyperparameters
May be less flexible for tasks requiring deep changes to the model's representations
Complex adapter stacks can complicate variant management
Performance and quality depend on the specific implementation and framework integration.
PEFT is a broad family of methods, so the term is often used imprecisely

Components

Matrix A (down-projection)Projects the input into a low-rank space; the first of the two active decomposition matrices.

Matrix A ∈ ℝ^(r×k) maps input from dimension k to the low-rank space of dimension r. Initialized randomly (Kaiming uniform or Gaussian). Trainable.

INInput vector or batch of vectors with dimension k (input dimension of the adapted weight matrix).
OUTProjected representation in low-rank space of dimension r.
Matrix B (up-projection)Up-projection from the low-rank space back to the output dimension; zero initialization ensures a stable training start.

Matrix B ∈ ℝ^(d×r) maps from the low-rank space of dimension r back to the original output dimension d. Initialized to zeros, ensuring identical model output at the start of training. Trainable.

INLow-rank representation from matrix A of dimension r.
OUTReconstructed weight update in the original output dimension d.
Frozen base weights (W₀)Stores pretrained model knowledge; serves as the starting point for adaptation.

Original pretrained weight matrix of dimensions d×k. During LoRA training, W₀ parameters are frozen (requires_grad=False) and not updated by the optimizer. Post-training: W = W₀ + (α/r)·BA.

α/r scaling factorControls adaptation strength and stabilizes training dynamics.

Scaling hyperparameter controlling the contribution of the weight update ΔW = (α/r)·BA to the layer output. α (lora_alpha) is a tunable scaling coefficient; r (rank) determines the adaptation space dimension. The α/r ratio makes optimal learning rate approximately rank-independent.

rsLoRA (rank-stabilized scaling)Uses α/√r instead of α/r, improving training stability at higher ranks.

Official

Implementation

Implementation pitfalls
Rank too low — insufficient adapter expressive capacityHigh

Too low rank (r=1–4) for a complex task prevents the adapter from learning required transformations. The model only learns style/format, not content. Manifests as high training loss and poor task performance.

Fix:Start with r=8–16 and increase gradually. Monitor validation loss. For complex multi-domain tasks, consider r=64 or AdaLoRA.
Non-zero initialization of matrix B at the startHigh

If matrix B is not initialized to zeros, the model at training start is not identical to the base model, disrupting stability and hindering convergence. Per the paper, B must be zero-initialized.

Fix:Use a standard implementation (HF PEFT or microsoft/LoRA) that guarantees zero initialization of B. For manual implementations: B = torch.zeros(d, r).
Incorrect weight merging before inferenceMedium

Omitting the W = W₀ + (α/r)·BA merge before deployment results in two forward passes instead of one, increasing latency. Manual merging without the (α/r) scaling factor produces incorrect results.

Fix:Use `merge_and_unload()` from PEFT or apply the update manually: `model.weight.data += lora_B @ lora_A * (alpha / r)`. Verify results on a test set after merging.
Using the same learning rate as in full fine-tuningMedium

LoRA adapters require higher learning rate than full fine-tuning (typically 10–100× higher) since far fewer parameters are trained. Too low LR leads to slow convergence or no adaptation.

Fix:Use a LR in the range of 1e-4 to 3e-4 (vs. typical 1e-5 to 5e-5 for full fine-tuning). Apply a cosine LR schedule with warmup.
Omitting the α/r scaling factor in the implementationHigh

If the (α/r) factor is not correctly applied in the forward pass (h = W₀x + (α/r)·BAx), adaptation outputs are unscaled, leading to unpredictable model behavior, especially when changing r without adjusting α.

Fix:Use established implementations. When implementing manually, verify: output = x @ W0.T + (alpha / r) * (x @ A.T) @ B.T.

Evolution

Original paper · 2022 · ICLR 2022 · Edward J. Hu
LoRA: Low-Rank Adaptation of Large Language Models
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen
2021
LoRA preprint published on arXiv (June 2021)
Inflection point

Hu et al. published arXiv:2106.09685. Method freezes base weights and adds low-rank matrix pairs A and B to selected Transformer layers. Reduced trainable parameters by 10,000× with performance matching full fine-tuning of GPT-3 175B.

2022
Acceptance at ICLR 2022 and microsoft/LoRA implementation
Inflection point

Paper accepted at ICLR 2022. Microsoft released the official reference implementation. The Stable Diffusion community adopted LoRA for fine-tuning diffusion models.

2023
Hugging Face PEFT (February 2023) and QLoRA (May 2023)
Inflection point

Hugging Face integrated LoRA into the PEFT library, making it accessible to millions of developers. Dettmers et al. published QLoRA enabling fine-tuning of 65B models on a single 48 GB GPU via 4-bit quantization combined with LoRA adapters.

2023
AdaLoRA (ICLR 2023) – adaptive rank allocation

Zhang et al. published AdaLoRA, dynamically allocating rank across layers based on singular value importance scores.

2024
DoRA (ICML 2024) and rsLoRA – further LoRA variants

Liu et al. published DoRA decomposing weights into directional and magnitude components, applying LoRA only to direction. Kalajdzievski proposed rsLoRA with α/√r scaling for more stable training at higher ranks.

Technical details

Hyperparameters (configurable axes)

Adapter RankCritical

Dimension of the low-rank adaptation space. Directly controls the number of trainable parameters (r·(d+k) per layer) and the expressiveness of the adapter.

1–4Very low capacity; useful for adjusting output style and format.
8–16General purpose; recommended as a starting point by the paper's authors (r=8 as baseline).
32–64Complex tasks requiring larger behavioral changes in the model.
128–256Rarely necessary; risk of overfitting on small datasets.
Scaling coefficient αHigh

Scaling coefficient in the formula (α/r)·BA. Controls the effective strength of adaptation. Common heuristic: alpha = r or alpha = 2r.

8When r=8, the alpha/r ratio equals 1.
16When r=8, the alpha/r ratio = 2 (adaptation gain).
32Popular default value in many implementations.
Target modules / layersHigh

Which weight matrices of the model are adapted by LoRA. In the original paper: Wq and Wv in self-attention. In practice: often all linear layers for maximum performance.

q_proj, v_projMinimal configuration consistent with Hu et al. 2021.
q_proj, k_proj, v_proj, o_projThe entire self-attention block.
all-linearAll linear layers; used in QLoRA for highest quality.
Dropout adapteraLow

Dropout rate applied to LoRA adapter outputs during training as regularization.

0No dropout; the default setting in many implementations.
0.05Light regularization.
0.1Stronger regularization for high overfitting risk scenarios.

Computational complexity

Computational characteristics
Significantly fewer trainable parameters than in full fine-tuning
Lower GPU memory usage and reduced storage costs
Higher training throughput in many scenarios
Low memory overhead of adapters relative to the base model
Adapter training runs on consumer-grade or low-cost GPU hardware

Space complexity: O(r·(d + k)). PEFT / LoRA zmniejsza koszt obliczeniowy dostrajania przez trenowanie tylko niewielkiej części parametrów. LoRA według oryginalnej pracy może redukować liczbę trenowanych parametrów o rzędy wielkości i obniżać wymagania pamięciowe względem pełnego fine-tuningu, zachowując konkurencyjną jakość.

Benchmark notes

PEFT / LoRA is not a benchmark but a family of training techniques. Evaluation typically involves comparison with full fine-tuning across final model quality, number of trainable parameters, memory consumption, training speed, and ease of adapter deployment.

Execution paradigm

Primary mode
dense

LoRA is dense: both the original path W₀x and the adaptation path (α/r)·BAx are computed for every input. There is no conditional activation or sparsity. At inference time (after weight merging), the overhead is zero.

Activation pattern
all_paths_active

Parallelism

Parallelism level
fully_parallel

The BA computations are fully parallelizable within and across layers, analogously to standard matrix multiplication in the Transformer. There are no sequential dependencies between LoRA adapters in different layers.

Scope
traininginferenceacross_layersacross_devices

Hardware requirements

Primary

LoRA operates on matrix multiplication (BA) and addition to linear layer outputs — operations fully accelerated by GPU Tensor Cores (NVIDIA A100, H100, RTX 3090+). Reducing trainable parameters yields approximately 3× lower GPU memory consumption for optimizer states.

Good fit

TPU v4/v5 efficiently handle the GEMM operations used in LoRA. The Flax/JAX library enables LoRA implementation on TPU. Used by Google when fine-tuning Gemini and PaLM models with PEFT.