Robots AtlasRobots Atlas

PEFT / LoRA

Freezing the original model weights and adding a pair of small low-rank matrices (A and B) as trainable weight updates, reducing the number of trainable parameters by orders of magnitude without adding inference-time computational overhead.

Category
Abstraction level
Operation level
Mechanisms7
Fine-tuning large language models for domain-specific tasksInstruction fine-tuning of open-weight modelsFine-tuning multimodal and vision-language modelsEfficient fine-tuning of diffusion and generative modelsBuilding multiple lightweight variants of a single base modelResearch experiments on limited hardwareDeploying adapters for different clients, languages, or specializations

In the PEFT approach, the base model remains largely frozen, and training covers only a small set of additional parameters, typically in the form of adapters. With LoRA, the update to a large weight matrix is approximated by the product of two smaller low-rank matrices, which are added to selected layers — most commonly attention or linear projection layers. This reduces the number of trainable parameters dramatically. After training, the adapter can be stored separately from the base model and loaded only when needed, simplifying deployment, sharing, and experimentation with multiple fine-tuned variants.

PEFT / LoRA addresses the high cost of full fine-tuning of large models. Traditional full-parameter fine-tuning requires substantial GPU memory, training time, and storage for multiple model variants. Deploying separate, fully fine-tuned copies of a large model for different tasks is also operationally expensive. PEFT techniques reduce these costs by training only a small number of additional parameters that can be easily saved, swapped, and reused.

Freezing the majority of base model parameters
Training a small number of additional parameters
Using adapters instead of full fine-tuning
LoRA: decomposing weight updates into two low-rank matrices
Significant reduction in GPU memory usage during training
Separate saving and loading of lightweight adapters
Easy sharing of multiple fine-tuned variants of a single base model
01

Matrix A (down-projection)

Projects the input into a low-rank space; the first of the two active decomposition matrices.

Matrix A ∈ ℝ^(r×k) maps input from dimension k to the low-rank space of dimension r. Initialized randomly (Kaiming uniform or Gaussian). Trainable.

i/o
in
[batch, k]Input vector or batch of vectors with dimension k (input dimension of the adapted weight matrix).
out
[batch, r]Projected representation in low-rank space of dimension r.
02

Matrix B (up-projection)

Up-projection from the low-rank space back to the output dimension; zero initialization ensures a stable training start.

Matrix B ∈ ℝ^(d×r) maps from the low-rank space of dimension r back to the original output dimension d. Initialized to zeros, ensuring identical model output at the start of training. Trainable.

i/o
in
[batch, r]Low-rank representation from matrix A of dimension r.
out
[batch, d]Reconstructed weight update in the original output dimension d.
03

Frozen pretrained weights W₀

Stores pretrained model knowledge; serves as the starting point for adaptation.

Original pretrained weight matrix of dimensions d×k. During LoRA training, W₀ parameters are frozen (requires_grad=False) and not updated by the optimizer. Post-training: W = W₀ + (α/r)·BA.

04

Scaling Factor α/r

Controls adaptation strength and stabilizes training dynamics.

Modular

Scaling hyperparameter controlling the contribution of the weight update ΔW = (α/r)·BA to the layer output. α (lora_alpha) is a tunable scaling coefficient; r (rank) determines the adaptation space dimension. The α/r ratio makes optimal learning rate approximately rank-independent.

rsLoRA (rank-stabilized scaling)
Memory complexity

r — LoRA adapter rank; d — layer output dimension; k — layer input dimension. A full weight update would require O(d·k) parameters. For typical Transformer layers (d = k = 1024, r = 8): 16,384 LoRA parameters instead of 1,048,576 (a ~64× reduction).

Base weights W₀ are not counted as trainable parameters (they occupy memory, but require bfloat16/fp16 rather than fp32 gradients). Total GPU memory is reduced by the absence of optimizer states (Adam moments) for frozen weights.

Parallelism

Fully parallel

The BA computations are fully parallelizable within and across layers, analogously to standard matrix multiplication in the Transformer. There are no sequential dependencies between LoRA adapters in different layers.

Paradigm

Dense

All paths active

LoRA is dense: both the original path W₀x and the adaptation path (α/r)·BAx are computed for every input. There is no conditional activation or sparsity. At inference time (after weight merging), the overhead is zero.

Adapter Rank

Critical
  • 1–4Very low capacity; useful for adjusting output style and format.
  • 8–16General purpose; recommended as a starting point by the paper's authors (r=8 as baseline).
  • 32–64Complex tasks requiring larger behavioral changes in the model.
  • 128–256Rarely necessary; risk of overfitting on small datasets.

Dimension of the low-rank adaptation space. Directly controls the number of trainable parameters (r·(d+k) per layer) and the expressiveness of the adapter.

Scaling coefficient α

Standard
  • 8When r=8, the alpha/r ratio equals 1.
  • 16When r=8, the alpha/r ratio = 2 (adaptation gain).
  • 32Popular default value in many implementations.

Scaling coefficient in the formula (α/r)·BA. Controls the effective strength of adaptation. Common heuristic: alpha = r or alpha = 2r.

Target modules / layers

Standard
  • q_proj, v_projMinimal configuration consistent with Hu et al. 2021.
  • q_proj, k_proj, v_proj, o_projThe entire self-attention block.
  • all-linearAll linear layers; used in QLoRA for highest quality.

Which weight matrices of the model are adapted by LoRA. In the original paper: Wq and Wv in self-attention. In practice: often all linear layers for maximum performance.

Dropout adaptera

Standard
  • 0No dropout; the default setting in many implementations.
  • 0.05Light regularization.
  • 0.1Stronger regularization for high overfitting risk scenarios.

Dropout rate applied to LoRA adapter outputs during training as regularization.

Strengths

  • Significantly reduces memory and training cost compared to full fine-tuning
  • Enables training large models on cheaper or lower-end hardware
  • Simplifies storage and distribution of lightweight adapters
  • Can achieve quality comparable to full fine-tuning
  • Accelerates experimentation and iteration on model adaptation
  • Integrates well with the Hugging Face ecosystem
  • Allows a single base model to be shared across multiple adapters

Limitations

  • Does not always match full fine-tuning across every task and architecture
  • Effectiveness depends on the choice of layers, adapter rank, and hyperparameters
  • May be less flexible for tasks requiring deep changes to the model's representations
  • Complex adapter stacks can complicate variant management
  • Performance and quality depend on the specific implementation and framework integration.
  • PEFT is a broad family of methods, so the term is often used imprecisely

Computational characteristics

  • Significantly fewer trainable parameters than in full fine-tuning
  • Lower GPU memory usage and reduced storage costs
  • Higher training throughput in many scenarios
  • Low memory overhead of adapters relative to the base model
  • Adapter training runs on consumer-grade or low-cost GPU hardware
PEFT / LoRA is not a benchmark but a family of training techniques. Evaluation typically involves comparison with full fine-tuning across final model quality, number of trainable parameters, memory consumption, training speed, and ease of adapter deployment.

Common pitfalls

Rank too low — insufficient adapter expressive capacity
HIGH

Too low rank (r=1–4) for a complex task prevents the adapter from learning required transformations. The model only learns style/format, not content. Manifests as high training loss and poor task performance.

Start with r=8–16 and increase gradually. Monitor validation loss. For complex multi-domain tasks, consider r=64 or AdaLoRA.

Non-zero initialization of matrix B at the start
HIGH

If matrix B is not initialized to zeros, the model at training start is not identical to the base model, disrupting stability and hindering convergence. Per the paper, B must be zero-initialized.

Use a standard implementation (HF PEFT or microsoft/LoRA) that guarantees zero initialization of B. For manual implementations: B = torch.zeros(d, r).

Incorrect weight merging before inference
MEDIUM

Omitting the W = W₀ + (α/r)·BA merge before deployment results in two forward passes instead of one, increasing latency. Manual merging without the (α/r) scaling factor produces incorrect results.

Use `merge_and_unload()` from PEFT or apply the update manually: `model.weight.data += lora_B @ lora_A * (alpha / r)`. Verify results on a test set after merging.

Using the same learning rate as in full fine-tuning
MEDIUM

LoRA adapters require higher learning rate than full fine-tuning (typically 10–100× higher) since far fewer parameters are trained. Too low LR leads to slow convergence or no adaptation.

Use a LR in the range of 1e-4 to 3e-4 (vs. typical 1e-5 to 5e-5 for full fine-tuning). Apply a cosine LR schedule with warmup.

Omitting the α/r scaling factor in the implementation
HIGH

If the (α/r) factor is not correctly applied in the forward pass (h = W₀x + (α/r)·BAx), adaptation outputs are unscaled, leading to unpredictable model behavior, especially when changing r without adjusting α.

Use established implementations. When implementing manually, verify: output = x @ W0.T + (alpha / r) * (x @ A.T) @ B.T.

GENESIS · Source paper

LoRA: Low-Rank Adaptation of Large Language Models
2022ICLR 2022Edward J. Hu, Yelong Shen, Phillip Wallis et al.
2021

LoRA preprint published on arXiv (June 2021)

breakthrough

Hu et al. published arXiv:2106.09685. Method freezes base weights and adds low-rank matrix pairs A and B to selected Transformer layers. Reduced trainable parameters by 10,000× with performance matching full fine-tuning of GPT-3 175B.

2022

Acceptance at ICLR 2022 and microsoft/LoRA implementation

breakthrough

Paper accepted at ICLR 2022. Microsoft released the official reference implementation. The Stable Diffusion community adopted LoRA for fine-tuning diffusion models.

2023

Hugging Face PEFT (February 2023) and QLoRA (May 2023)

breakthrough

Hugging Face integrated LoRA into the PEFT library, making it accessible to millions of developers. Dettmers et al. published QLoRA enabling fine-tuning of 65B models on a single 48 GB GPU via 4-bit quantization combined with LoRA adapters.

2023

AdaLoRA (ICLR 2023) – adaptive rank allocation

Zhang et al. published AdaLoRA, dynamically allocating rank across layers based on singular value importance scores.

2024

DoRA (ICML 2024) and rsLoRA – further LoRA variants

Liu et al. published DoRA decomposing weights into directional and magnitude components, applying LoRA only to direction. Kalajdzievski proposed rsLoRA with α/√r scaling for more stable training at higher ranks.

GPU Tensor CoresPRIMARY

LoRA operates on matrix multiplication (BA) and addition to linear layer outputs — operations fully accelerated by GPU Tensor Cores (NVIDIA A100, H100, RTX 3090+). Reducing trainable parameters yields approximately 3× lower GPU memory consumption for optimizer states.

QLoRA (4-bit base model + LoRA adapters) enables fine-tuning of 65B models on a 48 GB GPU or 7B models on a 16 GB GPU. LoRA adapters can be trained in fp16/bf16 while base weights remain in INT8/INT4 (bitsandbytes).

TPUGOOD

TPU v4/v5 efficiently handle the GEMM operations used in LoRA. The Flax/JAX library enables LoRA implementation on TPU. Used by Google when fine-tuning Gemini and PaLM models with PEFT.

LoRA memory savings are less critical on TPUs (HBM), but reducing trainable parameters still accelerates training.

Related AI models

Gemma

1
LoRA: Low-Rank Adaptation of Large Language Models

Originally introducing LoRA, first published in 2021.

scientific articlearXiv
PEFT

Official documentation of the PEFT library describing parameter-efficient fine-tuning.

documentationHugging Face
PEFT: State-of-the-art Parameter-Efficient Fine-Tuning

Official repository of the Hugging Face PEFT library.

repositoryGitHub
PEFT in Transformers

Documentation for integrating PEFT adapters with the Transformers library.

documentationHugging Face
Parameter-Efficient Fine-Tuning using PEFT

Entry introducing the PEFT library in the Hugging Face ecosystem in 2023.

blogHugging Face
LoRA

Documentation of LoRA as one of the methods supported by PEFT.

documentationHugging Face