TL;DR — SFT teaches a model what to do; preference optimization teaches it what better looks like; RL with verifiable rewards teaches it what correct means where correctness is checkable. Most teams should start with high-quality SFT, add DPO when they have real preference signal, and reserve online RL for when they have either a strong reward model or a verifier. The bottleneck at every stage is data quality, not algorithm choice.

A base model is a spectacular autocomplete with no opinion about being useful. Post-training is where the actual product gets made — and it’s also where teams burn the most compute on the wrong stage of the pipeline. Here’s the decision logic I use.

What each stage actually does

Supervised fine-tuning (SFT) is imitation: show the model demonstrations, minimize token-level cross-entropy. It’s unbeatable for format, tone, tool-call syntax, and task coverage — anything you can demonstrate. Its ceiling: the model can’t exceed its teacher, and it learns to imitate confidence as readily as competence.

Preference optimization (RLHF with PPO, or the DPO family) works on comparisons instead of demonstrations: response A beats response B. This captures the things that are easy to judge but hard to demonstrate — helpfulness, harmlessness, taste. The LIMA result still holds up as intuition: a base model already has most capabilities; a surprisingly small amount of high-quality alignment data surfaces them. What preference tuning adds beyond SFT is calibration to human judgment, not new knowledge.

RL with verifiable rewards (RLVR) replaces the learned reward model with a checker: unit tests for code, answer matching for math, execution for agents. Where a verifier exists, it’s the strongest signal available — the model can genuinely exceed its demonstrations because it’s optimizing against correctness, not imitation.

The failure modes that define the tradeoffs

SFT overfits to style. Push too many epochs on narrow data and you get a model that has memorized the shape of a good answer — confident structure, hedged caveats — detached from content.

Reward models get Goodharted. Optimize hard against a learned RM and the policy finds its blind spots: longer answers, more bullet points, flattery. Length bias is the classic tell — if mean response length is climbing while win rates stay flat, you’re optimizing the RM, not quality. This is why PPO pipelines lean on KL penalties against a reference policy and why RM refreshes matter.

DPO inherits its dataset’s biases, permanently. DPO is offline: it can’t explore, so whatever spurious correlations live in your preference pairs get baked in. It’s also why DPO on preferences sampled from someone else’s model underperforms — the comparisons are off-policy for you.

Verifiers get gamed too. Weak test suites teach models to pass tests rather than write correct code. A verifier is just a reward model you wrote by hand — audit it like one.

A practical decision table

SituationRecipe
Model won’t follow format / instructionsSFT on 1–10k high-quality demonstrations
Outputs correct but tone/quality inconsistentSFT, then DPO on ~10k+ genuine preference pairs
Have budget + strong RM, chasing top-line qualityOnline RL (PPO-family) with KL control
Domain has checkable answers (code, math, agents)RLVR — GRPO-style methods keep infra manageable
No preference data yetDo not synthesize it from the same model you’re training and call it alignment — start with rejection sampling against a rubric

Where synthetic data fits

Synthetic data is now load-bearing in every serious pipeline, in three honest roles: coverage (generating task diversity for SFT that no human team would write), rejection sampling (generate N candidates, keep what a verifier or strong judge accepts — quality comes from the filter, not the generator), and AI feedback (RLAIF/constitutional-style preference labels, which scale where human annotation can’t).

The dishonest role: laundering a model’s own distribution through itself and expecting improvement. Without an external signal — a verifier, a stronger judge, real-world outcomes — self-training compounds biases instead of correcting them.

The uncomfortable summary

Algorithm choice is the least important decision in post-training. SFT vs DPO vs PPO moves your evals less than the quality of demonstrations, the honesty of preference labels, and the strength of your verifiers. Post-training is a data problem wearing an RL costume — budget accordingly.

FAQ

Can I skip SFT and go straight to DPO? Usually no. DPO assumes the policy can already produce responses in the neighborhood of both chosen and rejected samples. Without SFT the pairs are off-distribution and training is unstable. A short SFT stage first is cheap insurance.

How many preference pairs do I need for DPO? Order of magnitude: 5k–50k genuine, on-policy-ish pairs beats millions of scraped or synthetic ones. If two annotators wouldn’t agree on a pair, it’s noise, not signal.

Is PPO dead now that DPO exists? No. Offline methods keep closing the gap, but when you have a good reward model or a verifier, online exploration still wins — the model finds behaviors no offline dataset contains. The real question is whether your team can afford the infrastructure and the debugging.

What should I evaluate after post-training? Three things, always: target-task win rate (humans or a strong judge), regression suite on capabilities you didn’t train (post-training can silently tax math and code), and distribution drift — length, refusal rate, hedging. Improvement on one axis with silent regression on the others is the default outcome, not the exception.