TL;DR — Training memory is dominated by optimizer state and activations, not weights. A 7B model needs ~112 GB to train naively in mixed precision, but fits on a single 24 GB card with activation checkpointing, an 8-bit optimizer, and (if needed) 4-bit quantized weights with LoRA. This post walks through the memory math and the exact order I apply these techniques.
“CUDA out of memory” is the most common error in deep learning, and the most commonly misdiagnosed. The instinct is to shrink the batch size and hope. The better move is to know where every byte goes — because once you can account for the memory, you can negotiate with it.
Where does GPU memory go during training?
For standard mixed-precision training with Adam, per-parameter memory looks like this:
| Component | Precision | Bytes/param |
|---|---|---|
| Weights | bf16 | 2 |
| Gradients | bf16 | 2 |
| Master weights | fp32 | 4 |
| Adam momentum | fp32 | 4 |
| Adam variance | fp32 | 4 |
| Total (static) | 16 |
Sixteen bytes per parameter. A 7B model is 112 GB before a single token is processed. That is the number that matters — not the 14 GB you see quoted for “a 7B model”, which is inference-only weight memory.
On top of the static cost come activations: every intermediate tensor the backward pass needs. Activation memory scales with batch_size × sequence_length × hidden_dim × layers, so it explodes exactly when you want longer contexts or bigger batches. For a 7B model at 4k context with a modest batch, activations can add tens of gigabytes.
Technique 1: Activation checkpointing — trade compute for memory
Activation checkpointing (also called gradient checkpointing) stores only a subset of activations — typically the inputs to each transformer block — and recomputes the rest during the backward pass.
model.gradient_checkpointing_enable() # HF Transformers
# or in raw PyTorch:
from torch.utils.checkpoint import checkpoint
out = checkpoint(block, hidden_states, use_reentrant=False)
The deal you’re making: roughly one extra forward pass of compute (~30–35% slower steps) in exchange for activation memory dropping from O(layers) to roughly O(√layers) with optimal placement. In practice this is the single highest-leverage flag in memory-constrained training, and I enable it before anything else. A slower step that runs beats a fast step that OOMs.
Technique 2: 8-bit optimizers — attack the biggest line item
Adam’s momentum and variance are 8 of the 16 bytes per parameter. Quantizing optimizer state to 8-bit (block-wise, with outlier handling) cuts that to ~2 bytes with negligible quality loss in most fine-tuning runs:
import bitsandbytes as bnb
optimizer = bnb.optim.AdamW8bit(model.parameters(), lr=2e-5)
Static memory drops from 16 to ~10 bytes/param. For a 7B model that’s 42 GB back. Alternatives worth knowing: Adafactor (factorized second moment) and fused optimizers that keep master weights in bf16 — each shaves bytes with different stability tradeoffs.
Technique 3: Quantized weights + LoRA — the QLoRA recipe
If full fine-tuning still doesn’t fit, stop updating all the weights. The QLoRA recipe: freeze the base model in 4-bit NF4, train small LoRA adapters in bf16 on top.
- Base weights: ~0.5 bytes/param → a 7B base in ~3.5 GB
- Trainable params: ~0.1–1% of the model, so optimizer state is negligible
- Result: 7B fine-tuning fits comfortably on a 12–24 GB consumer GPU
The catch people skip: 4-bit training quality is task-dependent. For instruction tuning and style adaptation, QLoRA is usually indistinguishable from full fine-tuning. For heavy domain shift or continued pretraining, measure before you commit.
Technique 4: CPU offloading — when the GPU is simply too small
Offloading (DeepSpeed ZeRO-Offload / ZeRO-3, FSDP with CPU offload) parks optimizer state — and optionally parameters — in system RAM, streaming them to the GPU as needed.
It works, and it’s the reason a 13B full fine-tune is possible on a single 24 GB card at all. But be honest about the cost: every offloaded step crosses PCIe at ~32 GB/s (Gen4 x16), against HBM at 2–3 TB/s. Offloading is a capacity tool, not a throughput tool. My rule: exhaust checkpointing, 8-bit optimizers, and quantization first; offload only what still doesn’t fit.
The order I actually apply things
| Step | Technique | Cost |
|---|---|---|
| 1 | Mixed precision (bf16) | none — do it always |
| 2 | Activation checkpointing | ~30% step time |
| 3 | 8-bit optimizer | negligible |
| 4 | Gradient accumulation (smaller micro-batch) | none at fixed tokens/step |
| 5 | QLoRA (4-bit base + adapters) | task-dependent quality |
| 6 | CPU offloading | PCIe-bound throughput |
Each step down the table trades more away. The skill isn’t knowing the techniques — it’s stopping at the highest row that fits.
FAQ
How much GPU memory do I need to fine-tune a 7B model? Full fine-tuning in bf16 with Adam: ~112 GB plus activations (multi-GPU territory). With checkpointing + 8-bit Adam: ~80 GB. QLoRA: 12–24 GB on a single card.
Does activation checkpointing change model quality? No. It recomputes identical values; the loss curve is bit-for-bit unaffected (up to nondeterminism in recomputed kernels). It only costs step time.
Is 4-bit quantization safe for inference too? Mostly yes for weights (NF4/GPTQ/AWQ at 4-bit typically costs a fraction of a point on standard evals), but KV-cache and activation quantization are where quality gets fragile. Different problem, different post.
Why not just use gradient accumulation for everything? Accumulation reduces activation memory by shrinking the micro-batch, but does nothing for the 16 bytes/param of static state — which is usually the wall.