PEFT / LoRA

Freezing the original model weights and adding a pair of small low-rank matrices (A and B) as trainable weight updates, reducing the number of trainable parameters by orders of magnitude without adding inference-time computational overhead.

Matrix A (down-projection)

Projects the input into a low-rank space; the first of the two active decomposition matrices.

Matrix A ∈ ℝ^(r×k) maps input from dimension k to the low-rank space of dimension r. Initialized randomly (Kaiming uniform or Gaussian). Trainable.

i/o

[batch, k]Input vector or batch of vectors with dimension k (input dimension of the adapted weight matrix).

out

[batch, r]Projected representation in low-rank space of dimension r.

Matrix B (up-projection)

Up-projection from the low-rank space back to the output dimension; zero initialization ensures a stable training start.

Matrix B ∈ ℝ^(d×r) maps from the low-rank space of dimension r back to the original output dimension d. Initialized to zeros, ensuring identical model output at the start of training. Trainable.

i/o

[batch, r]Low-rank representation from matrix A of dimension r.

out

[batch, d]Reconstructed weight update in the original output dimension d.

Frozen pretrained weights W₀

Stores pretrained model knowledge; serves as the starting point for adaptation.

Original pretrained weight matrix of dimensions d×k. During LoRA training, W₀ parameters are frozen (requires_grad=False) and not updated by the optimizer. Post-training: W = W₀ + (α/r)·BA.

Scaling Factor α/r

Controls adaptation strength and stabilizes training dynamics.

Modular

Scaling hyperparameter controlling the contribution of the weight update ΔW = (α/r)·BA to the layer output. α (lora_alpha) is a tunable scaling coefficient; r (rank) determines the adaptation space dimension. The α/r ratio makes optimal learning rate approximately rank-independent.

Memory complexity

…

r — LoRA adapter rank; d — layer output dimension; k — layer input dimension. A full weight update would require O(d·k) parameters. For typical Transformer layers (d = k = 1024, r = 8): 16,384 LoRA parameters instead of 1,048,576 (a ~64× reduction).

Base weights W₀ are not counted as trainable parameters (they occupy memory, but require bfloat16/fp16 rather than fp32 gradients). Total GPU memory is reduced by the absence of optimizer states (Adam moments) for frozen weights.

Parallelism

Fully parallel

The BA computations are fully parallelizable within and across layers, analogously to standard matrix multiplication in the Transformer. There are no sequential dependencies between LoRA adapters in different layers.

Paradigm

Dense

All paths active

LoRA is dense: both the original path W₀x and the adaptation path (α/r)·BAx are computed for every input. There is no conditional activation or sparsity. At inference time (after weight merging), the overhead is zero.

Adapter Rank

Critical

1–4Very low capacity; useful for adjusting output style and format.
8–16General purpose; recommended as a starting point by the paper's authors (r=8 as baseline).
32–64Complex tasks requiring larger behavioral changes in the model.
128–256Rarely necessary; risk of overfitting on small datasets.

Dimension of the low-rank adaptation space. Directly controls the number of trainable parameters (r·(d+k) per layer) and the expressiveness of the adapter.

Scaling coefficient α

Standard

8When r=8, the alpha/r ratio equals 1.
16When r=8, the alpha/r ratio = 2 (adaptation gain).
32Popular default value in many implementations.

Scaling coefficient in the formula (α/r)·BA. Controls the effective strength of adaptation. Common heuristic: alpha = r or alpha = 2r.

Target modules / layers

Standard

q_proj, v_projMinimal configuration consistent with Hu et al. 2021.
q_proj, k_proj, v_proj, o_projThe entire self-attention block.
all-linearAll linear layers; used in QLoRA for highest quality.

Which weight matrices of the model are adapted by LoRA. In the original paper: Wq and Wv in self-attention. In practice: often all linear layers for maximum performance.

Dropout adaptera

Standard

0No dropout; the default setting in many implementations.
0.05Light regularization.
0.1Stronger regularization for high overfitting risk scenarios.

Dropout rate applied to LoRA adapter outputs during training as regularization.

Strengths

Significantly reduces memory and training cost compared to full fine-tuning
Enables training large models on cheaper or lower-end hardware
Simplifies storage and distribution of lightweight adapters
Can achieve quality comparable to full fine-tuning
Accelerates experimentation and iteration on model adaptation
Integrates well with the Hugging Face ecosystem
Allows a single base model to be shared across multiple adapters

Limitations

Does not always match full fine-tuning across every task and architecture
Effectiveness depends on the choice of layers, adapter rank, and hyperparameters
May be less flexible for tasks requiring deep changes to the model's representations
Complex adapter stacks can complicate variant management
Performance and quality depend on the specific implementation and framework integration.
PEFT is a broad family of methods, so the term is often used imprecisely

Computational characteristics

Significantly fewer trainable parameters than in full fine-tuning
Lower GPU memory usage and reduced storage costs
Higher training throughput in many scenarios
Low memory overhead of adapters relative to the base model
Adapter training runs on consumer-grade or low-cost GPU hardware

PEFT / LoRA is not a benchmark but a family of training techniques. Evaluation typically involves comparison with full fine-tuning across final model quality, number of trainable parameters, memory consumption, training speed, and ease of adapter deployment.

Common pitfalls

Rank too low — insufficient adapter expressive capacity

HIGH

Too low rank (r=1–4) for a complex task prevents the adapter from learning required transformations. The model only learns style/format, not content. Manifests as high training loss and poor task performance.

Start with r=8–16 and increase gradually. Monitor validation loss. For complex multi-domain tasks, consider r=64 or AdaLoRA.

Non-zero initialization of matrix B at the start

HIGH

If matrix B is not initialized to zeros, the model at training start is not identical to the base model, disrupting stability and hindering convergence. Per the paper, B must be zero-initialized.

Use a standard implementation (HF PEFT or microsoft/LoRA) that guarantees zero initialization of B. For manual implementations: B = torch.zeros(d, r).

Incorrect weight merging before inference

MEDIUM

Omitting the W = W₀ + (α/r)·BA merge before deployment results in two forward passes instead of one, increasing latency. Manual merging without the (α/r) scaling factor produces incorrect results.

Use `merge_and_unload()` from PEFT or apply the update manually: `model.weight.data += lora_B @ lora_A * (alpha / r)`. Verify results on a test set after merging.

Using the same learning rate as in full fine-tuning

MEDIUM

LoRA adapters require higher learning rate than full fine-tuning (typically 10–100× higher) since far fewer parameters are trained. Too low LR leads to slow convergence or no adaptation.

Use a LR in the range of 1e-4 to 3e-4 (vs. typical 1e-5 to 5e-5 for full fine-tuning). Apply a cosine LR schedule with warmup.

Omitting the α/r scaling factor in the implementation

HIGH

If the (α/r) factor is not correctly applied in the forward pass (h = W₀x + (α/r)·BAx), adaptation outputs are unscaled, leading to unpredictable model behavior, especially when changing r without adjusting α.

Use established implementations. When implementing manually, verify: output = x @ W0.T + (alpha / r) * (x @ A.T) @ B.T.

Reference implementations

microsoft/LoRA – official implementation by the paper's authorsofficial

Python · Microsoft (paper authors)

Hugging Face PEFT – LoRA implementation in the PEFT library

Python · Hugging Face

Hugging Face PEFT – LoRA Documentation

Python · Hugging Face

GENESIS · Source paper

LoRA: Low-Rank Adaptation of Large Language Models

2022ICLR 2022Edward J. Hu, Yelong Shen, Phillip Wallis et al.

2021

LoRA preprint published on arXiv (June 2021)

breakthrough

Hu et al. published arXiv:2106.09685. Method freezes base weights and adds low-rank matrix pairs A and B to selected Transformer layers. Reduced trainable parameters by 10,000× with performance matching full fine-tuning of GPT-3 175B.

LoRA: Low-Rank Adaptation of Large Language Models

2022

Acceptance at ICLR 2022 and microsoft/LoRA implementation

breakthrough

Paper accepted at ICLR 2022. Microsoft released the official reference implementation. The Stable Diffusion community adopted LoRA for fine-tuning diffusion models.

LoRA: Low-Rank Adaptation of Large Language Models

2023

Hugging Face PEFT (February 2023) and QLoRA (May 2023)

breakthrough

Hugging Face integrated LoRA into the PEFT library, making it accessible to millions of developers. Dettmers et al. published QLoRA enabling fine-tuning of 65B models on a single 48 GB GPU via 4-bit quantization combined with LoRA adapters.

QLoRA: Efficient Finetuning of Quantized LLMs

2023

AdaLoRA (ICLR 2023) – adaptive rank allocation

Zhang et al. published AdaLoRA, dynamically allocating rank across layers based on singular value importance scores.

AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

2024

DoRA (ICML 2024) and rsLoRA – further LoRA variants

Liu et al. published DoRA decomposing weights into directional and magnitude components, applying LoRA only to direction. Kalajdzievski proposed rsLoRA with α/√r scaling for more stable training at higher ranks.

DoRA: Weight-Decomposed Low-Rank Adaptation

GPU Tensor CoresPRIMARY

LoRA operates on matrix multiplication (BA) and addition to linear layer outputs — operations fully accelerated by GPU Tensor Cores (NVIDIA A100, H100, RTX 3090+). Reducing trainable parameters yields approximately 3× lower GPU memory consumption for optimizer states.

QLoRA (4-bit base model + LoRA adapters) enables fine-tuning of 65B models on a 48 GB GPU or 7B models on a 16 GB GPU. LoRA adapters can be trained in fp16/bf16 while base weights remain in INT8/INT4 (bitsandbytes).

TPUGOOD

TPU v4/v5 efficiently handle the GEMM operations used in LoRA. The Flax/JAX library enables LoRA implementation on TPU. Used by Google when fine-tuning Gemini and PaLM models with PEFT.

LoRA memory savings are less critical on TPUs (HBM), but reducing trainable parameters still accelerates training.

Related AI models

Gemma

Gemma 4

Title	Publisher	Type
LoRA: Low-Rank Adaptation of Large Language Models Originally introducing LoRA, first published in 2021.	arXiv	scientific article
PEFT Official documentation of the PEFT library describing parameter-efficient fine-tuning.	Hugging Face	documentation
PEFT: State-of-the-art Parameter-Efficient Fine-Tuning Official repository of the Hugging Face PEFT library.	GitHub	repository
PEFT in Transformers Documentation for integrating PEFT adapters with the Transformers library.	Hugging Face	documentation
Parameter-Efficient Fine-Tuning using PEFT Entry introducing the PEFT library in the Hugging Face ecosystem in 2023.	Hugging Face	blog
LoRA Documentation of LoRA as one of the methods supported by PEFT.	Hugging Face	documentation