QLoRA has three technical components. (1) 4-bit NormalFloat (NF4) — a new data type for weight quantisation. Neural network weights are approximately normally distributed (zero-centered), so NF4 uses normal-distribution quantiles as quantisation levels — information-theoretically optimal for such data, better than plain 4-bit int or float. (2) Double Quantization (DQ) — the quantisation constants themselves are also quantised. Each block of 64 weights has its own 32-bit scaling constant; DQ quantises these constants to 8-bit, saving ~0.37 bit/parameter (for 65B = ~3 GB). (3) Paged Optimizers — use NVIDIA unified memory to automatically page optimizer states between GPU and CPU RAM when a memory spike occurs (e.g. a long sequence), preventing OOM. Forward/backward: frozen NF4 weights are dequantised to bf16 on the fly during matmul, gradient flows only through the LoRA adapters (W + BA, where B∈R^{d×r}, A∈R^{r×k}, r≪d). The base is never updated. Result: Guanaco quality (QLoRA on Llama 65B) matches ChatGPT on the Vicuna benchmark, trained on a single GPU in 24h.
Full LLM fine-tuning requires holding weights, gradients, and optimizer states in 16-bit — for a 65B model that is >780 GB VRAM, i.e. a multi-GPU cluster. LoRA alone reduces the number of trained parameters, but the frozen base model still must sit in 16-bit (130 GB for 65B). QLoRA attacks this last cost: it quantises the frozen base to 4-bit, so 65B fits in ~35 GB and the whole fine-tuning in 48 GB. The key was showing that a 4-bit base + LoRA adapters does NOT degrade quality — previously aggressive quantisation during training was assumed to hurt results.
The base model quantised to 4-bit NormalFloat. Quantisation levels are normal-distribution quantiles — optimal for zero-centered weights. Dequantised to bf16 on the fly during matmul, never updated.
Official
Trained matrices B∈R^{d×r}, A∈R^{r×k} (r≪d) in bf16. The only part updated by gradient. For best quality attached to ALL linear layers.
The NF4 block scaling constants (normally 32-bit) are themselves quantised to 8-bit. Saves ~0.37 bit/parameter (~3 GB for 65B) without quality loss.
Official
Optimizer states (e.g. 8-bit AdamW) are paged between GPU and CPU RAM via NVIDIA unified memory during memory spikes, preventing OOM.
Official
Naive LoRA attaches adapters only to q/v. QLoRA shows that to reach full-fine-tuning quality, adapters MUST be on all linear layers (including MLP gate/up/down). Skipping this leaves quality on the table.
NF4 matches the normal distribution of weights and empirically beats int4/fp4. Using a worse quantisation type lowers quality for no reason.
LoRA adapters are bf16, the base is 4-bit. A naive merge (W + BA) into a 4-bit base loses adapter precision. Merge into a 16-bit dequantised base or keep adapters separate.
Memory spikes on long sequences (gradient checkpointing reload) cause OOM even though average usage fits in VRAM. The paged optimizer solves this via unified memory.
Low-Rank Adaptation — training only low-rank adapters instead of full weights. The foundation of QLoRA.
The same author introduces 8-bit LLM inference quantisation without quality loss (the bitsandbytes library). The direct precursor of QLoRA's 4-bit quantisation.
Dettmers, Pagnoni, Holtzman, Zettlemoyer publish QLoRA (arXiv:2305.14314, NeurIPS 2023). NF4 + Double Quantization + Paged Optimizers enable fine-tuning 65B on a single 48 GB GPU. The Guanaco model matches ChatGPT on the Vicuna benchmark.
QLoRA lands in PEFT and Transformers within weeks — `load_in_4bit=True` + LoRA becomes a one-line recipe. Mass adoption by the open-source community.
Optimised frameworks emerge (Unsloth with custom Triton kernels gives 2× faster QLoRA), making 4-bit fine-tuning the standard on consumer hardware.
Successors improve QLoRA: DoRA (decomposes magnitude/direction), LoftQ (better adapter initialisation for a quantised base), QA-LoRA (quantisation-aware adapters for 4-bit deployment after fine-tuning).
Time complexity: O(T · |θ|) + narzut dekwantyzacji NF4→bf16 (~5–30% nad bf16 LoRA). Space complexity: O(0.5 · |θ|) baza (4-bit) + O(2 · r · d · L) adaptery + O(optimizer 8-bit).
Each matmul requires dequantising a weight block from NF4 to bf16. This is QLoRA's main time overhead vs plain LoRA. Custom kernels (Unsloth Triton) reduce it from ~30% to ~5–10%.
Rank of the adapter matrices B and A. Determines the number of trained parameters (2·r·d per layer). QLoRA shows even small r=8–64 suffices — larger r does not improve quality, since the limit is in the data, not adapter capacity.
Base quantisation type. NF4 (4-bit NormalFloat) is the flagship contribution — optimal for normally distributed weights. FP4 is an alternative, empirically worse. The paper proves NF4's advantage over FP4 and int4.
Whether to quantise the quantisation constants. Saves ~0.37 bit/parameter (~3 GB for 65B) at zero quality loss. Almost always enabled.
Number of weights sharing one scaling constant. Smaller block = more accurate quantisation but more constants. The paper uses 64 for NF4 and 256 for DQ.
Whether to use unified memory to page optimizer states during memory spikes. Prevents OOM on long sequences. No quality cost, small time overhead during spikes.
Which layers get adapters. The paper shows that for best quality LoRA should attach to ALL linear layers (q,k,v,o,gate,up,down), not just attention — an important finding distinguishing QLoRA from naive LoRA.
QLoRA modifies weight representation and the training path, not the model structure — the network stays dense. It can be combined with MoE (quantising experts) but that is orthogonal.
QLoRA is standard supervised training with on-the-fly dequantisation. It scales via DDP/FSDP; NF4→bf16 dequantisation is cheap and parallel. The main gain is memory reduction, not a change to the parallelism profile.
QLoRA was designed for NVIDIA GPUs with bitsandbytes (custom CUDA kernels for NF4 dequantisation) and unified memory (paged optimizers). Works from consumer RTX 3090/4090 to data-center A100/H100.
The idea is portable, but efficient 4-bit NF4 kernels are CUDA-first. ROCm/Metal support exists but is less mature.