2026-06-12 systems/ 4 min read

How to Fit Large Language Models on Small GPUs

Where GPU memory actually goes during LLM training, and how activation checkpointing, quantization, 8-bit optimizers, and CPU offloading win it back.

table of contents

Where does GPU memory go during training?
Technique 1: Activation checkpointing — trade compute for memory
Technique 2: 8-bit optimizers — attack the biggest line item
Technique 3: Quantized weights + LoRA — the QLoRA recipe
Technique 4: CPU offloading — when the GPU is simply too small
The order I actually apply things
FAQ

TL;DR — Training memory is dominated by optimizer state and activations, not weights. A 7B model needs ~112 GB to train naively in mixed precision, but fits on a single 24 GB card with activation checkpointing, an 8-bit optimizer, and (if needed) 4-bit quantized weights with LoRA. This post walks through the memory math and the exact order I apply these techniques.

“CUDA out of memory” is the most common error in deep learning, and the most commonly misdiagnosed. The instinct is to shrink the batch size and hope. The better move is to know where every byte goes — because once you can account for the memory, you can negotiate with it.

Where does GPU memory go during training?

For standard mixed-precision training with Adam, per-parameter memory looks like this:

Component	Precision	Bytes/param
Weights	bf16	2
Gradients	bf16	2
Master weights	fp32	4
Adam momentum	fp32	4
Adam variance	fp32	4
Total (static)		16

Sixteen bytes per parameter. A 7B model is 112 GB before a single token is processed. That is the number that matters — not the 14 GB you see quoted for “a 7B model”, which is inference-only weight memory.

On top of the static cost come activations: every intermediate tensor the backward pass needs. Activation memory scales with batch_size × sequence_length × hidden_dim × layers, so it explodes exactly when you want longer contexts or bigger batches. For a 7B model at 4k context with a modest batch, activations can add tens of gigabytes.

Technique 1: Activation checkpointing — trade compute for memory

Activation checkpointing (also called gradient checkpointing) stores only a subset of activations — typically the inputs to each transformer block — and recomputes the rest during the backward pass.

model.gradient_checkpointing_enable()  # HF Transformers
# or in raw PyTorch:
from torch.utils.checkpoint import checkpoint
out = checkpoint(block, hidden_states, use_reentrant=False)

The deal you’re making: roughly one extra forward pass of compute (~30–35% slower steps) in exchange for activation memory dropping from O(layers) to roughly O(√layers) with optimal placement. In practice this is the single highest-leverage flag in memory-constrained training, and I enable it before anything else. A slower step that runs beats a fast step that OOMs.

Technique 2: 8-bit optimizers — attack the biggest line item

Adam’s momentum and variance are 8 of the 16 bytes per parameter. Quantizing optimizer state to 8-bit (block-wise, with outlier handling) cuts that to ~2 bytes with negligible quality loss in most fine-tuning runs:

import bitsandbytes as bnb
optimizer = bnb.optim.AdamW8bit(model.parameters(), lr=2e-5)

Static memory drops from 16 to ~10 bytes/param. For a 7B model that’s 42 GB back. Alternatives worth knowing: Adafactor (factorized second moment) and fused optimizers that keep master weights in bf16 — each shaves bytes with different stability tradeoffs.

Technique 3: Quantized weights + LoRA — the QLoRA recipe

If full fine-tuning still doesn’t fit, stop updating all the weights. The QLoRA recipe: freeze the base model in 4-bit NF4, train small LoRA adapters in bf16 on top.

Base weights: ~0.5 bytes/param → a 7B base in ~3.5 GB
Trainable params: ~0.1–1% of the model, so optimizer state is negligible
Result: 7B fine-tuning fits comfortably on a 12–24 GB consumer GPU

The catch people skip: 4-bit training quality is task-dependent. For instruction tuning and style adaptation, QLoRA is usually indistinguishable from full fine-tuning. For heavy domain shift or continued pretraining, measure before you commit.

Technique 4: CPU offloading — when the GPU is simply too small

Offloading (DeepSpeed ZeRO-Offload / ZeRO-3, FSDP with CPU offload) parks optimizer state — and optionally parameters — in system RAM, streaming them to the GPU as needed.

It works, and it’s the reason a 13B full fine-tune is possible on a single 24 GB card at all. But be honest about the cost: every offloaded step crosses PCIe at ~32 GB/s (Gen4 x16), against HBM at 2–3 TB/s. Offloading is a capacity tool, not a throughput tool. My rule: exhaust checkpointing, 8-bit optimizers, and quantization first; offload only what still doesn’t fit.

The order I actually apply things

Step	Technique	Cost
1	Mixed precision (bf16)	none — do it always
2	Activation checkpointing	~30% step time
3	8-bit optimizer	negligible
4	Gradient accumulation (smaller micro-batch)	none at fixed tokens/step
5	QLoRA (4-bit base + adapters)	task-dependent quality
6	CPU offloading	PCIe-bound throughput

Each step down the table trades more away. The skill isn’t knowing the techniques — it’s stopping at the highest row that fits.

FAQ

How much GPU memory do I need to fine-tune a 7B model? Full fine-tuning in bf16 with Adam: ~112 GB plus activations (multi-GPU territory). With checkpointing + 8-bit Adam: ~80 GB. QLoRA: 12–24 GB on a single card.

Does activation checkpointing change model quality? No. It recomputes identical values; the loss curve is bit-for-bit unaffected (up to nondeterminism in recomputed kernels). It only costs step time.

Is 4-bit quantization safe for inference too? Mostly yes for weights (NF4/GPTQ/AWQ at 4-bit typically costs a fraction of a point on standard evals), but KV-cache and activation quantization are where quality gets fragile. Different problem, different post.

Why not just use gradient accumulation for everything? Accumulation reduces activation memory by shrinking the micro-batch, but does nothing for the 16 bytes/param of static state — which is usually the wall.

#gpu-memory
#activation-checkpointing
#quantization
#cpu-offloading
#llm-training