2026-06-25 data/ 4 min read

Data Curation for LLMs: Filtering, Deduplication, and Mixing in Practice

A practical walkthrough of the LLM data pipeline — quality filtering, exact and near deduplication with MinHash, decontamination, and mixture weights.

table of contents

Stage 1: Extraction — most “web text” is not text
Stage 2: Quality filtering — heuristics first, classifiers second
Stage 3: Deduplication — the highest-leverage stage
Stage 4: Decontamination — don’t grade yourself on the training set
Stage 5: Mixing — the quiet hyperparameter
The loop that makes it a system
FAQ

TL;DR — Model quality is decided upstream of the training loop. A defensible data pipeline has five stages: extraction, quality filtering, deduplication (exact → near → semantic), decontamination, and mixing. Deduplication is the highest-leverage stage: it reduces memorization, prevents eval contamination from inflating your numbers, and stops you paying to train on the same document fifty times.

Every team I’ve seen debate architecture for weeks will approve a training dataset in an afternoon. The irony is that between two modern architectures the difference is a rounding error, while between two datasets it’s the difference between a model people use and one they screenshot for the group chat.

Here’s the pipeline I consider table stakes, stage by stage.

Stage 1: Extraction — most “web text” is not text

Raw web dumps are markup soup. The extractor you choose (trafilatura, resiliparse, a custom DOM pass) silently decides your data distribution: how tables survive, whether code blocks keep indentation, whether boilerplate navigation becomes 15% of your tokens. Run two extractors over the same WARC files and diff the output — the disagreement rate is usually a shock the first time.

Stage 2: Quality filtering — heuristics first, classifiers second

Cheap heuristics remove the worst offenders at almost no compute cost: documents that are too short or too long, mean word length outliers, symbol-to-word ratio, bullet-point density, repeated-line fraction, language ID confidence. This is the Gopher/C4-style rule set and it still earns its keep.

Then model-based filtering: score documents with a small classifier trained to predict “high quality” — where the definition of quality is yours. The FineWeb-Edu lesson generalizes: a classifier trained on “is this educational?” beat far more elaborate pipelines. The classifier doesn’t need to be smart; it needs to encode a consistent preference at corpus scale.

# The shape of every scalable quality filter:
score = small_model(document)          # cheap, runs on billions of docs
keep = score > threshold               # threshold set by eval ablations,
                                       # never by eyeballing examples

The failure mode to respect: every filter is a distribution edit. Aggressive perplexity filtering quietly deletes code, math, and non-standard dialects. Track what you remove, not just how much.

Stage 3: Deduplication — the highest-leverage stage

Three tiers, in order of cost:

Exact dedup. Hash normalized text (casefold, collapse whitespace); drop collisions. Trivial and non-negotiable.

Near dedup. MinHash + LSH over shingles catches the web’s true nature: the same article with different ads, dates, and footers.

from datasketch import MinHash, MinHashLSH

def sketch(text: str, num_perm: int = 128) -> MinHash:
    m = MinHash(num_perm=num_perm)
    tokens = text.lower().split()
    for i in range(len(tokens) - 4):          # 5-gram shingles
        m.update(' '.join(tokens[i:i + 5]).encode())
    return m

lsh = MinHashLSH(threshold=0.8, num_perm=128)  # ~Jaccard 0.8

At Jaccard ≈ 0.8 you’ll typically flag 20–40% of a raw web corpus as near-duplicate. That’s not a bug in your pipeline; that’s the web.

Semantic dedup. Embed documents, cluster, drop near-neighbors (the SemDeDup idea). Catches paraphrases and templated content the other tiers miss. Expensive — reserve it for high-value subsets.

Why care this much? Three compounding reasons: duplicated text is memorized (extraction attacks get easier), duplicated benchmark text contaminates evals (your MMLU score becomes fiction), and duplicated anything wastes compute (you’re paying H100 hours to re-learn the same document).

Stage 4: Decontamination — don’t grade yourself on the training set

Before anything ships, scan the corpus for n-gram overlap (13-grams is a common choice) against every benchmark you report. Remove hits, and report the hit rate itself — a corpus with 0.0% benchmark overlap is a corpus nobody checked.

Stage 5: Mixing — the quiet hyperparameter

You now have clean pools: web, code, math, papers, books, dialogue. The mixture weights over these pools are among the most consequential hyperparameters you own, and most teams set them by folklore.

The approaches that work, in increasing rigor: copy a published recipe (fine for a baseline), grid-search weights with small proxy models and extrapolate (the workhorse), or learn weights directly (DoReMi-style reweighting). Two ground rules I hold: never let one source silently dominate after dedup shrinks the pools by different amounts, and re-run the ablation when you change any upstream filter — mixture weights tuned for last month’s pipeline are stale.

The loop that makes it a system

Curation isn’t a pipeline you run once; it’s a loop: filter → dedup → mix → train a small proxy → eval → adjust. The teams with the best data are not the ones with the cleverest filters — they’re the ones who close this loop fastest and log every decision so results are attributable.

FAQ

How much does deduplication actually help? Consistently positive at web scale: better held-out perplexity at fixed compute, sharply less verbatim memorization, and honest (usually lower) benchmark numbers after decontamination.

Doesn’t filtering reduce diversity? It can — that’s the central tension. The mitigation is measuring removal rates per domain/language/format, and keeping quality thresholds looser for rare-but-valuable slices like code and math.

Do these stages apply to fine-tuning datasets too? Even more so. At 10k examples, one duplicated instruction template can be 5% of your dataset, and near-dup prompts between train and test make your internal evals meaningless.

Exact order of operations? Extract → language ID → heuristic filter → exact dedup → near dedup → model-based filter → decontaminate → mix. Dedup before the expensive classifier — no reason to score the same document twice.

#data-curation
#deduplication
#minhash
#data-mixing
#llm-pretraining