Skip to content
Back to LLM Ingestion Pipeline

Dedup is MinHash + LSH (datasketch), not hash-only or embedding-similarity

✓ AcceptedLLM Ingestion Pipeline01 — Data Foundation (sub-part: Deduplication & Cleaning)
By AI-DE Engineering Team·Stakeholders: data engineer, search relevance reviewer

Context

The corpus contains exact duplicates (mirrored sites, RSS reposts) AND near-duplicates (a 95%-identical article republished with a different intro paragraph). LLM training-data quality depends on removing both classes — the closest match is the harder one. The classic options:

  1. Hash-only dedup — SHA-256 of normalized content; remove exact matches. Misses near-dupes entirely.
  2. MinHash + LSH — shingle-based similarity sketches with locality-sensitive hashing for sub-linear lookup. The de-facto industry default for corpus dedup.
  3. Embedding similarity — embed each doc, dedup via cosine threshold. Quality depends on the embedding model; cost scales with corpus size.
  4. Exact-substring scan — find longest common substring between doc pairs. O(N²) in doc count.

The tutorial target: 100K-doc dedup at 95% precision (true near-dupe catch rate measured against a labeled validation set).

Decision

Adopt MinHash + LSH via the datasketch library, with shingle-based similarity. Tier-1 single-node implementation in minhash.py; optional Tier-2 distributed scaling via pyspark in distributed_dedup.py.

# minhash.py — TextMinHash with shingles
from datasketch import MinHash, MinHashLSH

class TextMinHash:
    def __init__(self, num_perm: int = 128, shingle_size: int = 5):
        self.num_perm = num_perm
        self.k = shingle_size

    def signature(self, text: str) -> MinHash:
        m = MinHash(num_perm=self.num_perm)
        # character k-shingles for short docs, word k-shingles for long
        shingles = self._shingles(text)
        for s in shingles:
            m.update(s.encode("utf-8"))
        return m

class DedupIndex:
    def __init__(self, threshold: float = 0.85):
        self.lsh = MinHashLSH(threshold=threshold, num_perm=128)

    def query(self, sig: MinHash) -> list[str]:
        return self.lsh.query(sig)  # candidate near-dupes

    def insert(self, doc_id: str, sig: MinHash):
        self.lsh.insert(doc_id, sig)

Tradeoffs we accept

LeverHash-onlyMinHash + LSH (chosen)Embedding similarityExact-substring
Catches exact dupes
Catches near-dupes (95% similar)✓ (depends on model)✓ (slow)
Catches paraphrased dupes (50% similar)Partial
Time complexity (N docs)O(N)O(N log N)O(N²) embed + O(N log N) ANNO(N²)
Dedup throughput at 100K~5 min (single-node)~10 min (single-node)~30 min + GPU~hours
Cost$0$0 (CPU only)$20+ (one-shot embed)$0
TunableNonethreshold, num_perm, shingle_sizeEmbedding model, thresholdMin-substring length
Tutorial reproducibilityTrivialpip install datasketchNew model + GPUSlow on laptop

We optimize for near-dupe recall at production cost. Hash-only catches ~30% of real dupes in the labeled validation set; MinHash + LSH hits 95% at the 0.85 Jaccard threshold (this is the tutorial's measured number). Embedding similarity beats MinHash on paraphrased dupes (50%-similar) but is overkill for training-data quality at v1 — the LLM judge eval downstream catches paraphrased issues anyway.

Consequences (positive)

  • 95% precision on the labeled validation set, validated in data/processed/dedup_input.jsonl (10.3 MB fixture with engineered exact + near-duplicate clusters).
  • The datasketch MinHashLSH index is in-memory; queries are sub-ms for 100K-doc corpora on a single laptop.
  • The tunable knobs (threshold, num_perm, shingle_size) are documented with a recall-vs-precision sweep in part-2.
  • Optional distributed scaling (distributed_dedup.py via pyspark) is documented but not required for the tutorial path.

Consequences (negative)

  • Misses paraphrased dupes. A 50%-Jaccard "rewritten" article passes the 0.85 threshold. Mitigation: downstream LLM-judge eval (M04) catches semantic dupes via failure_analysis.py clustering.
  • Threshold tuning is corpus-specific. 0.85 is good for the bundled corpus; a denser corpus might need 0.90; a sparser one 0.80. The tuning sweep in part-2 documents how to find the right number for new corpora.
  • Memory ceiling at scale. A 10M-doc LSH index doesn't fit in 16 GB of RAM. Mitigation: distributed_dedup.py shards the LSH across pyspark workers (Tier-2 path).
  • Optional Spark dep. The Tier-2 path requires pyspark, which adds ~200 MB of disk. Mitigation: Tier-1 path (single-node) is sufficient for the tutorial corpus and has no Spark dependency.

Reversal plan

The dedup interface is DedupIndex.{insert, query} and a compute_signatures(docs) -> list[MinHash] function. Replacement is bounded:

  1. Hash-only — replace compute_signatures with sha256(content); replace DedupIndex with a set[str]. ~10 lines. Loses near-dupe detection.
  2. Embedding similarity — replace signature(text) with a embed(text) -> vector call; swap LSH for a vector index (Faiss, pgvector). ~50 lines + GPU/embed budget. Better recall on paraphrases.
  3. Hybrid — run MinHash first (fast, cheap), then embedding similarity on the candidate near-dupe set. Best-of-both. Documented as a stretch goal in part-2.

Estimated effort: 0.5–3 engineer-days depending on swap target. Reversible.

References

  • minhash.py (TextMinHash + shingles)
  • dedup/deduplicate.py (single-node orchestrator)
  • distributed_dedup.py (optional pyspark Tier-2 scaling)
  • pipeline/dedup.py (pipeline-level abstraction)
  • data/processed/dedup_input.jsonl (10.3 MB fixture with labeled clusters)
  • ADR-001 (aiohttp crawler — produces the input)
  • ADR-003 (tokenization — consumes the deduplicated output)
Built into the project

This decision shipped as part of LLM Ingestion Pipeline — see the full architecture, starter kit, and 4 more ADRs.

Open project →
Press Cmd+K to open