Interactive explainer

Fine-tune an LLM
on a single GPU

QLoRA combines 4-bit quantization with low-rank adapters to make fine-tuning large language models accessible to everyone.

Memory reduction
0.06%
Params trained
1 GPU
All you need
01

The problem with full fine-tuning

Fine-tuning means updating a pre-trained model on your own data. But with billions of parameters, every single weight needs a gradient, an optimizer state, and memory — it adds up fast.

Pre-trained LLM — every parameter must be updated
~28 GB
GPU memory (7B, FP32)
7 billion
Trainable parameters
Multi-GPU
Hardware required
The bottleneck isn't compute — it's memory. Storing weights, gradients, and optimizer states for billions of parameters requires expensive multi-GPU setups that most teams don't have.
02

LoRA: low-rank adapters

Instead of updating the full weight matrix W, freeze it entirely and train two tiny matrices A and B. The adapted output is W + A×B — a low-rank update that captures task-specific knowledge with a fraction of the parameters.

Frozen base weights W 7B parameters — not updated Adapter A d × r (down-project) Adapter B r × d (up-project) W' = W + A × B r is tiny (4–64), making A×B very small
8
~14 GB
GPU memory (FP16 base)
~4.2M
Trainable parameters
0.06%
Of total params
Adapter A (down-project) Adapter B (up-project) Frozen base
03

NF4 quantization

Neural network weights follow a roughly Gaussian distribution. NormalFloat4 maps them to 16 optimally-spaced fixed points, compressing from 16 bits down to 4 bits per weight — a 4× memory reduction with minimal quality loss.

FP16 weights 16 bits per parameter NF4 4-bit weights 4 bits per parameter Values mapped to 16 fixed points on a normal distribution Optimal for neural net weights (roughly Gaussian) 4× memory reduction with minimal quality loss
~3.5 GB
GPU memory (7B, NF4)
Memory savings
~1%
Quality loss
04

QLoRA: the full picture

Combine NF4 quantization with LoRA adapters. The base model is stored in 4-bit, keeping memory tiny. During forward passes, weights are dequantized to FP16 on the fly. Gradients only flow through the small A and B matrices — so you get full fine-tuning quality at a fraction of the cost.

4-bit quantized base (NF4) Frozen — ~3.5 GB for 7B model Adapter A FP16 precision Adapter B FP16 precision Three QLoRA innovations 1. NF4 data type — quantization levels optimal for Gaussian weights 2. Double quantization — quantize the quantization constants too 3. Paged optimizers — offload states to CPU when GPU memory spikes
~4 GB
Total GPU memory
~4M
Trainable params
1 GPU
Hardware needed
97%
Of full fine-tune quality
05

Side-by-side comparison

Three approaches to fine-tuning a 7B-parameter model, measured by GPU memory. QLoRA achieves near-identical quality to full fine-tuning at a fraction of the cost.

Full fine-tune
28 GB
LoRA (FP16)
14 GB
QLoRA (NF4)
4 GB
The sweet spot. QLoRA gives you 97% of full fine-tuning quality while using 7× less memory. A 65B model that previously required a multi-node cluster can now be fine-tuned on a single 48 GB GPU.
06

Deploying for inference

After training, you have two deployment strategies depending on whether you're serving one task or many.

W 4-bit base dequant W FP16 A × B = ΔW + W' = W + A×B Merged FP16 re-quantize Single merged model Same speed as the original — adapters are gone
Best for single-task deployment. Merge A×B into the base weights, export one model file. Inference cost is identical to the original — no adapter overhead at all.
Shared 4-bit base model Loaded once in GPU memory Chat adapter ~8 MB Code adapter ~8 MB Medical adapter ~8 MB "tell me a joke" "fix this bug" "diagnose this" Each adapter is a few MB — hot-swap per request type One GPU serves many specialised models. PEFT and vLLM support this natively.
Best for multi-task serving. Load the base model once, swap tiny adapter files on demand per request. One GPU effectively serves many specialised models — each adapter adds only a few megabytes.