Interactive explainer

Fine-tune an LLM
on a single GPU

QLoRA combines 4-bit quantization with low-rank adapters to make fine-tuning large language models accessible to everyone.

7×

Memory reduction

0.06%

Params trained

1 GPU

All you need

The problem with full fine-tuning

Fine-tuning means updating a pre-trained model on your own data. But with billions of parameters, every single weight needs a gradient, an optimizer state, and memory — it adds up fast.

~28 GB

GPU memory (7B, FP32)

7 billion

Trainable parameters

Multi-GPU

Hardware required

The bottleneck isn't compute — it's memory. Storing weights, gradients, and optimizer states for billions of parameters requires expensive multi-GPU setups that most teams don't have.

LoRA: low-rank adapters

Instead of updating the full weight matrix W, freeze it entirely and train two tiny matrices A and B. The adapted output is W + A×B — a low-rank update that captures task-specific knowledge with a fraction of the parameters.

Rank r 8

~14 GB

GPU memory (FP16 base)

~4.2M

Trainable parameters

0.06%

Of total params

Adapter A (down-project) Adapter B (up-project) Frozen base

NF4 quantization

Neural network weights follow a roughly Gaussian distribution. NormalFloat4 maps them to 16 optimally-spaced fixed points, compressing from 16 bits down to 4 bits per weight — a 4× memory reduction with minimal quality loss.

~3.5 GB

GPU memory (7B, NF4)

4×

Memory savings

~1%

Quality loss

QLoRA: the full picture

Combine NF4 quantization with LoRA adapters. The base model is stored in 4-bit, keeping memory tiny. During forward passes, weights are dequantized to FP16 on the fly. Gradients only flow through the small A and B matrices — so you get full fine-tuning quality at a fraction of the cost.

~4 GB

Total GPU memory

~4M

Trainable params

1 GPU

Hardware needed

97%

Of full fine-tune quality

Side-by-side comparison

Three approaches to fine-tuning a 7B-parameter model, measured by GPU memory. QLoRA achieves near-identical quality to full fine-tuning at a fraction of the cost.

Full fine-tune

28 GB

LoRA (FP16)

14 GB

QLoRA (NF4)

4 GB

The sweet spot. QLoRA gives you 97% of full fine-tuning quality while using 7× less memory. A 65B model that previously required a multi-node cluster can now be fine-tuned on a single 48 GB GPU.

Deploying for inference

After training, you have two deployment strategies depending on whether you're serving one task or many.

Best for single-task deployment. Merge A×B into the base weights, export one model file. Inference cost is identical to the original — no adapter overhead at all.

Best for multi-task serving. Load the base model once, swap tiny adapter files on demand per request. One GPU effectively serves many specialised models — each adapter adds only a few megabytes.

Fine-tune an LLMon a single GPU

The problem with full fine-tuning

LoRA: low-rank adapters

NF4 quantization

QLoRA: the full picture

Side-by-side comparison

Deploying for inference

Fine-tune an LLM
on a single GPU