Training

QLoRA

2023ActivePublished: 10 June 2026Updated: 10 June 2026Published

Key innovation

Enables fine-tuning a 65B LLM on a single 48 GB GPU by quantising the frozen base model to a 4-bit NF4 format and training only LoRA adapters in higher precision. It combines 4-bit NormalFloat, double quantisation, and paged optimizers, cutting fine-tuning memory ~3× with no quality loss versus full 16-bit.

How it works

QLoRA has three technical components. (1) 4-bit NormalFloat (NF4) — a new data type for weight quantisation. Neural network weights are approximately normally distributed (zero-centered), so NF4 uses normal-distribution quantiles as quantisation levels — information-theoretically optimal for such data, better than plain 4-bit int or float. (2) Double Quantization (DQ) — the quantisation constants themselves are also quantised. Each block of 64 weights has its own 32-bit scaling constant; DQ quantises these constants to 8-bit, saving ~0.37 bit/parameter (for 65B = ~3 GB). (3) Paged Optimizers — use NVIDIA unified memory to automatically page optimizer states between GPU and CPU RAM when a memory spike occurs (e.g. a long sequence), preventing OOM. Forward/backward: frozen NF4 weights are dequantised to bf16 on the fly during matmul, gradient flows only through the LoRA adapters (W + BA, where B∈R^{d×r}, A∈R^{r×k}, r≪d). The base is never updated. Result: Guanaco quality (QLoRA on Llama 65B) matches ChatGPT on the Vicuna benchmark, trained on a single GPU in 24h.

Problem solved

Full LLM fine-tuning requires holding weights, gradients, and optimizer states in 16-bit — for a 65B model that is >780 GB VRAM, i.e. a multi-GPU cluster. LoRA alone reduces the number of trained parameters, but the frozen base model still must sit in 16-bit (130 GB for 65B). QLoRA attacks this last cost: it quantises the frozen base to 4-bit, so 65B fits in ~35 GB and the whole fine-tuning in 48 GB. The key was showing that a 4-bit base + LoRA adapters does NOT degrade quality — previously aggressive quantisation during training was assumed to hurt results.

Components

4-bit NormalFloat (NF4) baseCompressed base model knowledge

The base model quantised to 4-bit NormalFloat. Quantisation levels are normal-distribution quantiles — optimal for zero-centered weights. Dequantised to bf16 on the fly during matmul, never updated.

INOriginal base-model weights.

OUT4-bit weights + per-block scaling constants.

NF4Default — optimal for normal distribution.

FP4Alternative, empirically worse.

Official

LoRA adapters (B, A)Trainable weight delta

Trained matrices B∈R^{d×r}, A∈R^{r×k} (r≪d) in bf16. The only part updated by gradient. For best quality attached to ALL linear layers.

Double Quantization (DQ)Additional memory reduction

The NF4 block scaling constants (normally 32-bit) are themselves quantised to 8-bit. Saves ~0.37 bit/parameter (~3 GB for 65B) without quality loss.

Official

Paged OptimizersOOM protection on long sequences

Optimizer states (e.g. 8-bit AdamW) are paged between GPU and CPU RAM via NVIDIA unified memory during memory spikes, preventing OOM.

Official

Implementation

Reference implementations

artidoro/qlora (official repo)

Python (PyTorch) · Artidoro Pagnoni and Tim Dettmers (UW)

Official

bitsandbytes (4-bit NF4 kernels)

Python / CUDA · Tim Dettmers / bitsandbytes Foundation

Official

Hugging Face PEFT (load_in_4bit + LoRA)

Python · Hugging Face

Unsloth (2× faster QLoRA, custom Triton)

Python / Triton · Unsloth AI

Implementation pitfalls

LoRA only on attention instead of all linear layersHigh

Naive LoRA attaches adapters only to q/v. QLoRA shows that to reach full-fine-tuning quality, adapters MUST be on all linear layers (including MLP gate/up/down). Skipping this leaves quality on the table.

Fix:Set target_modules to all linear layers (`all-linear` in PEFT).

Using plain int4/fp4 instead of NF4Medium

NF4 matches the normal distribution of weights and empirically beats int4/fp4. Using a worse quantisation type lowers quality for no reason.

Fix:Use `bnb_4bit_quant_type="nf4"` in the bitsandbytes config.

Merging adapters into a 4-bit base → quality lossMedium

LoRA adapters are bf16, the base is 4-bit. A naive merge (W + BA) into a 4-bit base loses adapter precision. Merge into a 16-bit dequantised base or keep adapters separate.

Fix:Dequantise the base to fp16 before merging, or keep adapters as a separate artifact (recommended for deployment).

No paged optimizer on long sequences → OOMLow

Memory spikes on long sequences (gradient checkpointing reload) cause OOM even though average usage fits in VRAM. The paged optimizer solves this via unified memory.

Fix:Enable `optim="paged_adamw_8bit"` and gradient checkpointing.

Evolution

Original paper · 2023 · NeurIPS 2023 (University of Washington) · Tim Dettmers

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer

2021

LoRA (Hu et al., Microsoft)

Low-Rank Adaptation — training only low-rank adapters instead of full weights. The foundation of QLoRA.

PEFT / LoRA (concept)

2022

LLM.int8() (Dettmers et al.)

The same author introduces 8-bit LLM inference quantisation without quality loss (the bitsandbytes library). The direct precursor of QLoRA's 4-bit quantisation.

2023

QLoRA — UW paper

Inflection point

Dettmers, Pagnoni, Holtzman, Zettlemoyer publish QLoRA (arXiv:2305.14314, NeurIPS 2023). NF4 + Double Quantization + Paged Optimizers enable fine-tuning 65B on a single 48 GB GPU. The Guanaco model matches ChatGPT on the Vicuna benchmark.

QLoRA: Efficient Finetuning of Quantized LLMs (paper)

2023

Integration into Hugging Face PEFT / bitsandbytes

QLoRA lands in PEFT and Transformers within weeks — `load_in_4bit=True` + LoRA becomes a one-line recipe. Mass adoption by the open-source community.

2024

Unsloth, Axolotl, Llama-Factory — production tools

Optimised frameworks emerge (Unsloth with custom Triton kernels gives 2× faster QLoRA), making 4-bit fine-tuning the standard on consumer hardware.

2024

Variants: DoRA, QA-LoRA, LoftQ

Successors improve QLoRA: DoRA (decomposes magnitude/direction), LoftQ (better adapter initialisation for a quantised base), QA-LoRA (quantisation-aware adapters for 4-bit deployment after fine-tuning).

QLoRA

How it works

Problem solved

Components

Implementation

Evolution

QLoRA

How it works

Problem solved

Components

Implementation

Evolution

Computational complexity

Compute bottleneck

Hyperparameters (configurable axes)

Execution paradigm

Parallelism

Hardware requirements