ADR-002: Dedup is MinHash + LSH (datasketch), not hash-only or embedding-similarity | LLM Ingestion Pipeline

Context

The corpus contains exact duplicates (mirrored sites, RSS reposts) AND near-duplicates (a 95%-identical article republished with a different intro paragraph). LLM training-data quality depends on removing both classes — the closest match is the harder one. The classic options:

Hash-only dedup — SHA-256 of normalized content; remove exact matches. Misses near-dupes entirely.
MinHash + LSH — shingle-based similarity sketches with locality-sensitive hashing for sub-linear lookup. The de-facto industry default for corpus dedup.
Embedding similarity — embed each doc, dedup via cosine threshold. Quality depends on the embedding model; cost scales with corpus size.
Exact-substring scan — find longest common substring between doc pairs. O(N²) in doc count.

The tutorial target: 100K-doc dedup at 95% precision (true near-dupe catch rate measured against a labeled validation set).

Decision

Adopt MinHash + LSH via the datasketch library, with shingle-based similarity. Tier-1 single-node implementation in minhash.py; optional Tier-2 distributed scaling via pyspark in distributed_dedup.py.

# minhash.py — TextMinHash with shingles
from datasketch import MinHash, MinHashLSH

class TextMinHash:
    def __init__(self, num_perm: int = 128, shingle_size: int = 5):
        self.num_perm = num_perm
        self.k = shingle_size

    def signature(self, text: str) -> MinHash:
        m = MinHash(num_perm=self.num_perm)
        # character k-shingles for short docs, word k-shingles for long
        shingles = self._shingles(text)
        for s in shingles:
            m.update(s.encode("utf-8"))
        return m

class DedupIndex:
    def __init__(self, threshold: float = 0.85):
        self.lsh = MinHashLSH(threshold=threshold, num_perm=128)

    def query(self, sig: MinHash) -> list[str]:
        return self.lsh.query(sig)  # candidate near-dupes

    def insert(self, doc_id: str, sig: MinHash):
        self.lsh.insert(doc_id, sig)

Tradeoffs we accept

Lever	Hash-only	MinHash + LSH (chosen)	Embedding similarity	Exact-substring
Catches exact dupes	✓	✓	✓	✓
Catches near-dupes (95% similar)	✗	✓	✓ (depends on model)	✓ (slow)
Catches paraphrased dupes (50% similar)	✗	Partial	✓	✗
Time complexity (N docs)	O(N)	O(N log N)	O(N²) embed + O(N log N) ANN	O(N²)
Dedup throughput at 100K	~5 min (single-node)	~10 min (single-node)	~30 min + GPU	~hours
Cost	$0	$0 (CPU only)	$20+ (one-shot embed)	$0
Tunable	None	`threshold`, `num_perm`, `shingle_size`	Embedding model, threshold	Min-substring length
Tutorial reproducibility	Trivial	`pip install datasketch`	New model + GPU	Slow on laptop

We optimize for near-dupe recall at production cost. Hash-only catches ~30% of real dupes in the labeled validation set; MinHash + LSH hits 95% at the 0.85 Jaccard threshold (this is the tutorial's measured number). Embedding similarity beats MinHash on paraphrased dupes (50%-similar) but is overkill for training-data quality at v1 — the LLM judge eval downstream catches paraphrased issues anyway.

Consequences (positive)

95% precision on the labeled validation set, validated in data/processed/dedup_input.jsonl (10.3 MB fixture with engineered exact + near-duplicate clusters).
The datasketch MinHashLSH index is in-memory; queries are sub-ms for 100K-doc corpora on a single laptop.
The tunable knobs (threshold, num_perm, shingle_size) are documented with a recall-vs-precision sweep in part-2.
Optional distributed scaling (distributed_dedup.py via pyspark) is documented but not required for the tutorial path.

Consequences (negative)

Misses paraphrased dupes. A 50%-Jaccard "rewritten" article passes the 0.85 threshold. Mitigation: downstream LLM-judge eval (M04) catches semantic dupes via failure_analysis.py clustering.
Threshold tuning is corpus-specific. 0.85 is good for the bundled corpus; a denser corpus might need 0.90; a sparser one 0.80. The tuning sweep in part-2 documents how to find the right number for new corpora.
Memory ceiling at scale. A 10M-doc LSH index doesn't fit in 16 GB of RAM. Mitigation: distributed_dedup.py shards the LSH across pyspark workers (Tier-2 path).
Optional Spark dep. The Tier-2 path requires pyspark, which adds ~200 MB of disk. Mitigation: Tier-1 path (single-node) is sufficient for the tutorial corpus and has no Spark dependency.

Reversal plan

The dedup interface is DedupIndex.{insert, query} and a compute_signatures(docs) -> list[MinHash] function. Replacement is bounded:

Hash-only — replace compute_signatures with sha256(content); replace DedupIndex with a set[str]. ~10 lines. Loses near-dupe detection.
Embedding similarity — replace signature(text) with a embed(text) -> vector call; swap LSH for a vector index (Faiss, pgvector). ~50 lines + GPU/embed budget. Better recall on paraphrases.
Hybrid — run MinHash first (fast, cheap), then embedding similarity on the candidate near-dupe set. Best-of-both. Documented as a stretch goal in part-2.

Estimated effort: 0.5–3 engineer-days depending on swap target. Reversible.

References

minhash.py (TextMinHash + shingles)
dedup/deduplicate.py (single-node orchestrator)
distributed_dedup.py (optional pyspark Tier-2 scaling)
pipeline/dedup.py (pipeline-level abstraction)
data/processed/dedup_input.jsonl (10.3 MB fixture with labeled clusters)
ADR-001 (aiohttp crawler — produces the input)
ADR-003 (tokenization — consumes the deduplicated output)