QLoRA combines 4-bit quantization with low-rank adapters to make fine-tuning large language models accessible to everyone.
7×
Memory reduction
0.06%
Params trained
1 GPU
All you need
01
The problem with full fine-tuning
Fine-tuning means updating a pre-trained model on your own data. But with billions of parameters, every single weight needs a gradient, an optimizer state, and memory — it adds up fast.
~28 GB
GPU memory (7B, FP32)
7 billion
Trainable parameters
Multi-GPU
Hardware required
The bottleneck isn't compute — it's memory. Storing weights, gradients, and optimizer states for billions of parameters requires expensive multi-GPU setups that most teams don't have.
02
LoRA: low-rank adapters
Instead of updating the full weight matrix W, freeze it entirely and train two tiny matrices A and B. The adapted output is W + A×B — a low-rank update that captures task-specific knowledge with a fraction of the parameters.
8
~14 GB
GPU memory (FP16 base)
~4.2M
Trainable parameters
0.06%
Of total params
Adapter A (down-project) Adapter B (up-project) Frozen base
03
NF4 quantization
Neural network weights follow a roughly Gaussian distribution. NormalFloat4 maps them to 16 optimally-spaced fixed points, compressing from 16 bits down to 4 bits per weight — a 4× memory reduction with minimal quality loss.
~3.5 GB
GPU memory (7B, NF4)
4×
Memory savings
~1%
Quality loss
04
QLoRA: the full picture
Combine NF4 quantization with LoRA adapters. The base model is stored in 4-bit, keeping memory tiny. During forward passes, weights are dequantized to FP16 on the fly. Gradients only flow through the small A and B matrices — so you get full fine-tuning quality at a fraction of the cost.
~4 GB
Total GPU memory
~4M
Trainable params
1 GPU
Hardware needed
97%
Of full fine-tune quality
05
Side-by-side comparison
Three approaches to fine-tuning a 7B-parameter model, measured by GPU memory. QLoRA achieves near-identical quality to full fine-tuning at a fraction of the cost.
Full fine-tune
28 GB
LoRA (FP16)
14 GB
QLoRA (NF4)
4 GB
The sweet spot. QLoRA gives you 97% of full fine-tuning quality while using 7× less memory. A 65B model that previously required a multi-node cluster can now be fine-tuned on a single 48 GB GPU.
06
Deploying for inference
After training, you have two deployment strategies depending on whether you're serving one task or many.
Best for single-task deployment. Merge A×B into the base weights, export one model file. Inference cost is identical to the original — no adapter overhead at all.
Best for multi-task serving. Load the base model once, swap tiny adapter files on demand per request. One GPU effectively serves many specialised models — each adapter adds only a few megabytes.