Context
The corpus contains exact duplicates (mirrored sites, RSS reposts) AND near-duplicates (a 95%-identical article republished with a different intro paragraph). LLM training-data quality depends on removing both classes — the closest match is the harder one. The classic options:
- Hash-only dedup — SHA-256 of normalized content; remove exact matches. Misses near-dupes entirely.
- MinHash + LSH — shingle-based similarity sketches with locality-sensitive hashing for sub-linear lookup. The de-facto industry default for corpus dedup.
- Embedding similarity — embed each doc, dedup via cosine threshold. Quality depends on the embedding model; cost scales with corpus size.
- Exact-substring scan — find longest common substring between doc pairs. O(N²) in doc count.
The tutorial target: 100K-doc dedup at 95% precision (true near-dupe catch rate measured against a labeled validation set).
Decision
Adopt MinHash + LSH via the datasketch library, with shingle-based
similarity. Tier-1 single-node implementation in minhash.py;
optional Tier-2 distributed scaling via pyspark in
distributed_dedup.py.
# minhash.py — TextMinHash with shingles
from datasketch import MinHash, MinHashLSH
class TextMinHash:
def __init__(self, num_perm: int = 128, shingle_size: int = 5):
self.num_perm = num_perm
self.k = shingle_size
def signature(self, text: str) -> MinHash:
m = MinHash(num_perm=self.num_perm)
# character k-shingles for short docs, word k-shingles for long
shingles = self._shingles(text)
for s in shingles:
m.update(s.encode("utf-8"))
return m
class DedupIndex:
def __init__(self, threshold: float = 0.85):
self.lsh = MinHashLSH(threshold=threshold, num_perm=128)
def query(self, sig: MinHash) -> list[str]:
return self.lsh.query(sig) # candidate near-dupes
def insert(self, doc_id: str, sig: MinHash):
self.lsh.insert(doc_id, sig)
Tradeoffs we accept
| Lever | Hash-only | MinHash + LSH (chosen) | Embedding similarity | Exact-substring |
|---|---|---|---|---|
| Catches exact dupes | ✓ | ✓ | ✓ | ✓ |
| Catches near-dupes (95% similar) | ✗ | ✓ | ✓ (depends on model) | ✓ (slow) |
| Catches paraphrased dupes (50% similar) | ✗ | Partial | ✓ | ✗ |
| Time complexity (N docs) | O(N) | O(N log N) | O(N²) embed + O(N log N) ANN | O(N²) |
| Dedup throughput at 100K | ~5 min (single-node) | ~10 min (single-node) | ~30 min + GPU | ~hours |
| Cost | $0 | $0 (CPU only) | $20+ (one-shot embed) | $0 |
| Tunable | None | threshold, num_perm, shingle_size | Embedding model, threshold | Min-substring length |
| Tutorial reproducibility | Trivial | pip install datasketch | New model + GPU | Slow on laptop |
We optimize for near-dupe recall at production cost. Hash-only catches ~30% of real dupes in the labeled validation set; MinHash + LSH hits 95% at the 0.85 Jaccard threshold (this is the tutorial's measured number). Embedding similarity beats MinHash on paraphrased dupes (50%-similar) but is overkill for training-data quality at v1 — the LLM judge eval downstream catches paraphrased issues anyway.
Consequences (positive)
- 95% precision on the labeled validation set, validated in
data/processed/dedup_input.jsonl(10.3 MB fixture with engineered exact + near-duplicate clusters). - The
datasketchMinHashLSH index is in-memory; queries are sub-ms for 100K-doc corpora on a single laptop. - The tunable knobs (
threshold,num_perm,shingle_size) are documented with a recall-vs-precision sweep in part-2. - Optional distributed scaling (
distributed_dedup.pyvia pyspark) is documented but not required for the tutorial path.
Consequences (negative)
- Misses paraphrased dupes. A 50%-Jaccard "rewritten" article
passes the 0.85 threshold. Mitigation: downstream LLM-judge eval
(M04) catches semantic dupes via
failure_analysis.pyclustering. - Threshold tuning is corpus-specific. 0.85 is good for the bundled corpus; a denser corpus might need 0.90; a sparser one 0.80. The tuning sweep in part-2 documents how to find the right number for new corpora.
- Memory ceiling at scale. A 10M-doc LSH index doesn't fit in
16 GB of RAM. Mitigation:
distributed_dedup.pyshards the LSH across pyspark workers (Tier-2 path). - Optional Spark dep. The Tier-2 path requires pyspark, which adds ~200 MB of disk. Mitigation: Tier-1 path (single-node) is sufficient for the tutorial corpus and has no Spark dependency.
Reversal plan
The dedup interface is DedupIndex.{insert, query} and a
compute_signatures(docs) -> list[MinHash] function. Replacement is
bounded:
- Hash-only — replace
compute_signatureswithsha256(content); replaceDedupIndexwith aset[str]. ~10 lines. Loses near-dupe detection. - Embedding similarity — replace
signature(text)with aembed(text) -> vectorcall; swap LSH for a vector index (Faiss, pgvector). ~50 lines + GPU/embed budget. Better recall on paraphrases. - Hybrid — run MinHash first (fast, cheap), then embedding similarity on the candidate near-dupe set. Best-of-both. Documented as a stretch goal in part-2.
Estimated effort: 0.5–3 engineer-days depending on swap target. Reversible.
References
minhash.py(TextMinHash + shingles)dedup/deduplicate.py(single-node orchestrator)distributed_dedup.py(optional pyspark Tier-2 scaling)pipeline/dedup.py(pipeline-level abstraction)data/processed/dedup_input.jsonl(10.3 MB fixture with labeled clusters)- ADR-001 (aiohttp crawler — produces the input)
- ADR-003 (tokenization — consumes the deduplicated output)