Context
The pipeline ingests 1k+ pages of web data per project run, scaling to 1M-doc-capable batch ingestion. The crawler is the front of the pipeline — every downstream module (dedup, quality, tokenization, RAG) depends on what it produces. The classic options:
- Scrapy — the de-facto Python crawler framework. Spider classes, middleware pipelines, request scheduler, selectors. Heavy but battle-tested.
- requests-html — synchronous, lightweight, JavaScript rendering built in. Simple but blocks on every request.
- Playwright / Puppeteer — full browser automation. Necessary for JS-heavy sites; slow + memory-heavy for static pages.
- aiohttp + custom orchestration — async HTTP client with manual request scheduling, rate limiting, and content extraction.
We are building a tutorial pipeline that has to be reproducible by a learner on a laptop in <2 hours and survive a real production batch ingest.
Decision
Adopt aiohttp with a custom BaseCrawler + TokenBucket rate
limiter + a four-extractor fallback chain (BeautifulSoup, readability-lxml,
trafilatura, pypdf).
# base_crawler.py — the spine
from aiohttp import ClientSession, TCPConnector
import asyncio
class TokenBucket:
def __init__(self, rate: float, burst: int):
self.rate, self.burst = rate, burst
self.tokens = burst
self.last = asyncio.get_event_loop().time()
self.lock = asyncio.Lock()
async def acquire(self):
async with self.lock:
now = asyncio.get_event_loop().time()
self.tokens = min(self.burst, self.tokens + (now - self.last) * self.rate)
self.last = now
if self.tokens < 1:
await asyncio.sleep((1 - self.tokens) / self.rate)
self.tokens = 0
else:
self.tokens -= 1
class BaseCrawler:
def __init__(self, rate=2.0, concurrency=10):
self.bucket = TokenBucket(rate=rate, burst=int(rate))
self.sem = asyncio.Semaphore(concurrency)
async def fetch(self, url: str, session: ClientSession) -> dict:
await self.bucket.acquire()
async with self.sem:
async with session.get(url, timeout=30) as resp:
return {"url": url, "status": resp.status, "body": await resp.text()}
# extractors.py — fallback chain
class ContentExtractor:
def extract(self, html: str, url: str) -> str:
# 1. trafilatura (best for articles)
# 2. readability-lxml (medium quality, broader coverage)
# 3. BeautifulSoup with custom rules (legal docs, code samples)
# 4. pypdf (when content-type is PDF)
for extractor in self._chain:
text = extractor(html, url)
if text and len(text) > 200:
return text
return ""
Tradeoffs we accept
| Lever | Scrapy | requests-html | Playwright | aiohttp (chosen) |
|---|---|---|---|---|
| Day-1 setup | scrapy startproject + middleware config | pip install | Browser binary download | pip install aiohttp |
| Concurrency | Built-in (Twisted reactor) | Sync (slow) | Process-per-page (heavy) | asyncio + Semaphore |
| Throughput on static pages (1k pages, 2 req/s) | ~10 minutes | ~30+ minutes | ~30+ minutes | ~10 minutes |
| Tutorial reproducibility | Heavy (Scrapy mental model) | Easy but slow | Browser binary in CI | Easy + fast |
| Customization | Middleware pipeline (powerful, opaque) | Hard | Possible | Plain Python |
| Rate limiting | Built-in DOWNLOAD_DELAY | Build it | Build it | TokenBucket (this ADR) |
| JS rendering | Splash sidecar | Built-in | Native | Not supported (CUT) |
| robots.txt + politeness | Built-in RobotsTxtMiddleware | Build it | Build it | Build it |
| Memory footprint | ~100 MB | ~50 MB | ~500 MB+ | ~30 MB |
We optimize for tutorial reproducibility + control over the request
loop. Scrapy's spider/middleware/pipeline mental model adds 2-3
hours to a learner's onramp before they crawl their first page; aiohttp
gets to "fetch a URL with rate limiting" in 30 lines. The cost is
JS rendering (cut from scope) and reimplementing politeness rules
(robots.txt parsing in crawl/ is ~80 lines; documented).
Consequences (positive)
- A learner runs
python crawl/run.py --seeds urls.txtand sees the TokenBucket throttle live in stdout — pedagogically clearer than configuringDOWNLOAD_DELAYinsettings.py. - The four-extractor fallback chain (
extractors.py) handles articles, legal docs, code samples, and PDFs from a single URL list. Scrapy would require per-extractor middleware classes. - Memory footprint is small enough to run alongside the Ray cluster on a single 16GB laptop during M01-M02 hacking.
- The
BaseCrawler.fetch()interface is async-native, which matches the rest of the pipeline (dedup, embed, vLLM serving all async).
Consequences (negative)
- No JS rendering. Sites that require JS execution to expose
content are out of scope. Mitigation: the bundled
data/raw_corpus/ships 120 synthetic .txt files for the demo path; Tier-2 documents the Playwright swap. - Politeness primitives are BYO. robots.txt parsing, retry/backoff,
duplicate-URL detection are application code. Mitigation:
rate_limiter.py+crawl/ship working defaults; learners read the code, not a Scrapy config. - No native fixture replay. Scrapy's
httpcachewould let learners re-run a crawl against cached responses. We document arequests-mockfixture pattern in part-1.
Reversal plan
The crawler interface is BaseCrawler.fetch(url, session) -> dict.
Replacement is bounded:
- Scrapy swap — replace
BaseCrawlerwith a ScrapySpiderthat yields the same dict shape. Move TokenBucket logic to a ScrapyDOWNLOADER_MIDDLEWARESentry. ~1 engineer-week. - Playwright addition (not replacement) — add a
BrowserCrawlersubclass that uses Playwright for JS-heavy URLs; route based on URL pattern in aselect_crawler(url)function. ~3 engineer-days. - Hybrid — keep aiohttp for static, add a Playwright sidecar via feature flag.
Estimated effort: 3-5 engineer-days for Playwright addition; 1 engineer-week for full Scrapy swap. Reversible.
References
base_crawler.py(TokenBucket + async fetch)extractors.py(ContentExtractor fallback chain)crawl/(robots.txt + URL deduplication helpers)data/raw_corpus/(120 synthetic .txt fixtures for the tutorial path)- ADR-002 (MinHash + LSH dedup — consumes the crawl output)