Training

PEFT / LoRA

2021ActivePublished: 20 March 2026Updated: 20 March 2026Published

Key innovation

Freezing the original model weights and adding a pair of small low-rank matrices (A and B) as trainable weight updates, reducing the number of trainable parameters by orders of magnitude without adding inference-time computational overhead.

How it works

In the PEFT approach, the base model remains largely frozen, and training covers only a small set of additional parameters, typically in the form of adapters. With LoRA, the update to a large weight matrix is approximated by the product of two smaller low-rank matrices, which are added to selected layers — most commonly attention or linear projection layers. This reduces the number of trainable parameters dramatically. After training, the adapter can be stored separately from the base model and loaded only when needed, simplifying deployment, sharing, and experimentation with multiple fine-tuned variants.

Problem solved

PEFT / LoRA addresses the high cost of full fine-tuning of large models. Traditional full-parameter fine-tuning requires substantial GPU memory, training time, and storage for multiple model variants. Deploying separate, fully fine-tuned copies of a large model for different tasks is also operationally expensive. PEFT techniques reduce these costs by training only a small number of additional parameters that can be easily saved, swapped, and reused.

Key mechanisms

Freezing the majority of base model parameters

Training a small number of additional parameters

Using adapters instead of full fine-tuning

LoRA: decomposing weight updates into two low-rank matrices

Significant reduction in GPU memory usage during training

Separate saving and loading of lightweight adapters

Easy sharing of multiple fine-tuned variants of a single base model

Strengths & limitations

Strengths

✓Significantly reduces memory and training cost compared to full fine-tuning

✓Enables training large models on cheaper or lower-end hardware

✓Simplifies storage and distribution of lightweight adapters

✓Can achieve quality comparable to full fine-tuning

✓Accelerates experimentation and iteration on model adaptation

✓Integrates well with the Hugging Face ecosystem

✓Allows a single base model to be shared across multiple adapters

Limitations

✗Does not always match full fine-tuning across every task and architecture

✗Effectiveness depends on the choice of layers, adapter rank, and hyperparameters

✗May be less flexible for tasks requiring deep changes to the model's representations

✗Complex adapter stacks can complicate variant management

✗Performance and quality depend on the specific implementation and framework integration.

✗PEFT is a broad family of methods, so the term is often used imprecisely

Components

Matrix A (down-projection)Projects the input into a low-rank space; the first of the two active decomposition matrices.

Matrix A ∈ ℝ^(r×k) maps input from dimension k to the low-rank space of dimension r. Initialized randomly (Kaiming uniform or Gaussian). Trainable.

INInput vector or batch of vectors with dimension k (input dimension of the adapted weight matrix).

OUTProjected representation in low-rank space of dimension r.

Matrix B (up-projection)Up-projection from the low-rank space back to the output dimension; zero initialization ensures a stable training start.

Matrix B ∈ ℝ^(d×r) maps from the low-rank space of dimension r back to the original output dimension d. Initialized to zeros, ensuring identical model output at the start of training. Trainable.

INLow-rank representation from matrix A of dimension r.

OUTReconstructed weight update in the original output dimension d.

Frozen base weights (W₀)Stores pretrained model knowledge; serves as the starting point for adaptation.

Original pretrained weight matrix of dimensions d×k. During LoRA training, W₀ parameters are frozen (requires_grad=False) and not updated by the optimizer. Post-training: W = W₀ + (α/r)·BA.

α/r scaling factorControls adaptation strength and stabilizes training dynamics.

Scaling hyperparameter controlling the contribution of the weight update ΔW = (α/r)·BA to the layer output. α (lora_alpha) is a tunable scaling coefficient; r (rank) determines the adaptation space dimension. The α/r ratio makes optimal learning rate approximately rank-independent.

rsLoRA (rank-stabilized scaling)Uses α/√r instead of α/r, improving training stability at higher ranks.

Official

Implementation

Reference implementations

microsoft/LoRA – official implementation by the paper's authors

Python · Microsoft (paper authors)

Official

Hugging Face PEFT – LoRA implementation in the PEFT library

Python · Hugging Face

Hugging Face PEFT – LoRA Documentation

Python · Hugging Face

Implementation pitfalls

Rank too low — insufficient adapter expressive capacityHigh

Too low rank (r=1–4) for a complex task prevents the adapter from learning required transformations. The model only learns style/format, not content. Manifests as high training loss and poor task performance.

Fix:Start with r=8–16 and increase gradually. Monitor validation loss. For complex multi-domain tasks, consider r=64 or AdaLoRA.

Non-zero initialization of matrix B at the startHigh

If matrix B is not initialized to zeros, the model at training start is not identical to the base model, disrupting stability and hindering convergence. Per the paper, B must be zero-initialized.

Fix:Use a standard implementation (HF PEFT or microsoft/LoRA) that guarantees zero initialization of B. For manual implementations: B = torch.zeros(d, r).

Incorrect weight merging before inferenceMedium

Omitting the W = W₀ + (α/r)·BA merge before deployment results in two forward passes instead of one, increasing latency. Manual merging without the (α/r) scaling factor produces incorrect results.

Fix:Use `merge_and_unload()` from PEFT or apply the update manually: `model.weight.data += lora_B @ lora_A * (alpha / r)`. Verify results on a test set after merging.

Using the same learning rate as in full fine-tuningMedium

LoRA adapters require higher learning rate than full fine-tuning (typically 10–100× higher) since far fewer parameters are trained. Too low LR leads to slow convergence or no adaptation.

Fix:Use a LR in the range of 1e-4 to 3e-4 (vs. typical 1e-5 to 5e-5 for full fine-tuning). Apply a cosine LR schedule with warmup.

Omitting the α/r scaling factor in the implementationHigh

If the (α/r) factor is not correctly applied in the forward pass (h = W₀x + (α/r)·BAx), adaptation outputs are unscaled, leading to unpredictable model behavior, especially when changing r without adjusting α.

Fix:Use established implementations. When implementing manually, verify: output = x @ W0.T + (alpha / r) * (x @ A.T) @ B.T.

Evolution

Original paper · 2022 · ICLR 2022 · Edward J. Hu

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen

2021

LoRA preprint published on arXiv (June 2021)

Inflection point

Hu et al. published arXiv:2106.09685. Method freezes base weights and adds low-rank matrix pairs A and B to selected Transformer layers. Reduced trainable parameters by 10,000× with performance matching full fine-tuning of GPT-3 175B.

LoRA: Low-Rank Adaptation of Large Language Models (paper)

2022

Acceptance at ICLR 2022 and microsoft/LoRA implementation

Inflection point

Paper accepted at ICLR 2022. Microsoft released the official reference implementation. The Stable Diffusion community adopted LoRA for fine-tuning diffusion models.

LoRA: Low-Rank Adaptation of Large Language Models (paper)

2023

Hugging Face PEFT (February 2023) and QLoRA (May 2023)

Inflection point

Hugging Face integrated LoRA into the PEFT library, making it accessible to millions of developers. Dettmers et al. published QLoRA enabling fine-tuning of 65B models on a single 48 GB GPU via 4-bit quantization combined with LoRA adapters.

QLoRA: Efficient Finetuning of Quantized LLMs (paper)

2023

AdaLoRA (ICLR 2023) – adaptive rank allocation

Zhang et al. published AdaLoRA, dynamically allocating rank across layers based on singular value importance scores.

AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning (paper)

2024

DoRA (ICML 2024) and rsLoRA – further LoRA variants

Liu et al. published DoRA decomposing weights into directional and magnitude components, applying LoRA only to direction. Kalajdzievski proposed rsLoRA with α/√r scaling for more stable training at higher ranks.

DoRA: Weight-Decomposed Low-Rank Adaptation (paper)

Technical details

Hyperparameters (configurable axes)

Adapter RankCritical

Dimension of the low-rank adaptation space. Directly controls the number of trainable parameters (r·(d+k) per layer) and the expressiveness of the adapter.

1–4Very low capacity; useful for adjusting output style and format.

8–16General purpose; recommended as a starting point by the paper's authors (r=8 as baseline).

32–64Complex tasks requiring larger behavioral changes in the model.

128–256Rarely necessary; risk of overfitting on small datasets.

Scaling coefficient αHigh

Scaling coefficient in the formula (α/r)·BA. Controls the effective strength of adaptation. Common heuristic: alpha = r or alpha = 2r.

8When r=8, the alpha/r ratio equals 1.

16When r=8, the alpha/r ratio = 2 (adaptation gain).

32Popular default value in many implementations.

Target modules / layersHigh

Which weight matrices of the model are adapted by LoRA. In the original paper: Wq and Wv in self-attention. In practice: often all linear layers for maximum performance.

q_proj, v_projMinimal configuration consistent with Hu et al. 2021.

q_proj, k_proj, v_proj, o_projThe entire self-attention block.

all-linearAll linear layers; used in QLoRA for highest quality.

Dropout adapteraLow

Dropout rate applied to LoRA adapter outputs during training as regularization.

0No dropout; the default setting in many implementations.

0.05Light regularization.

0.1Stronger regularization for high overfitting risk scenarios.

Computational complexity

Computational characteristics

→Significantly fewer trainable parameters than in full fine-tuning

→Lower GPU memory usage and reduced storage costs

→Higher training throughput in many scenarios

→Low memory overhead of adapters relative to the base model

→Adapter training runs on consumer-grade or low-cost GPU hardware

Space complexity: O(r·(d + k)). PEFT / LoRA zmniejsza koszt obliczeniowy dostrajania przez trenowanie tylko niewielkiej części parametrów. LoRA według oryginalnej pracy może redukować liczbę trenowanych parametrów o rzędy wielkości i obniżać wymagania pamięciowe względem pełnego fine-tuningu, zachowując konkurencyjną jakość.

Benchmark notes

PEFT / LoRA is not a benchmark but a family of training techniques. Evaluation typically involves comparison with full fine-tuning across final model quality, number of trainable parameters, memory consumption, training speed, and ease of adapter deployment.

Execution paradigm

Primary mode

dense

LoRA is dense: both the original path W₀x and the adaptation path (α/r)·BAx are computed for every input. There is no conditional activation or sparsity. At inference time (after weight merging), the overhead is zero.

Activation pattern

all_paths_active

Parallelism

Parallelism level

fully_parallel

The BA computations are fully parallelizable within and across layers, analogously to standard matrix multiplication in the Transformer. There are no sequential dependencies between LoRA adapters in different layers.

Scope

traininginferenceacross_layersacross_devices

Hardware requirements

Primary

LoRA operates on matrix multiplication (BA) and addition to linear layer outputs — operations fully accelerated by GPU Tensor Cores (NVIDIA A100, H100, RTX 3090+). Reducing trainable parameters yields approximately 3× lower GPU memory consumption for optimizer states.

Good fit

TPU v4/v5 efficiently handle the GEMM operations used in LoRA. The Flax/JAX library enables LoRA implementation on TPU. Used by Google when fine-tuning Gemini and PaLM models with PEFT.