ADR-001: Use aiohttp + custom crawler over Scrapy / requests-html | LLM Ingestion Pipeline

Context

The pipeline ingests 1k+ pages of web data per project run, scaling to 1M-doc-capable batch ingestion. The crawler is the front of the pipeline — every downstream module (dedup, quality, tokenization, RAG) depends on what it produces. The classic options:

Scrapy — the de-facto Python crawler framework. Spider classes, middleware pipelines, request scheduler, selectors. Heavy but battle-tested.
requests-html — synchronous, lightweight, JavaScript rendering built in. Simple but blocks on every request.
Playwright / Puppeteer — full browser automation. Necessary for JS-heavy sites; slow + memory-heavy for static pages.
aiohttp + custom orchestration — async HTTP client with manual request scheduling, rate limiting, and content extraction.

We are building a tutorial pipeline that has to be reproducible by a learner on a laptop in <2 hours and survive a real production batch ingest.

Decision

Adopt aiohttp with a custom BaseCrawler + TokenBucket rate limiter + a four-extractor fallback chain (BeautifulSoup, readability-lxml, trafilatura, pypdf).

# base_crawler.py — the spine
from aiohttp import ClientSession, TCPConnector
import asyncio

class TokenBucket:
    def __init__(self, rate: float, burst: int):
        self.rate, self.burst = rate, burst
        self.tokens = burst
        self.last = asyncio.get_event_loop().time()
        self.lock = asyncio.Lock()

    async def acquire(self):
        async with self.lock:
            now = asyncio.get_event_loop().time()
            self.tokens = min(self.burst, self.tokens + (now - self.last) * self.rate)
            self.last = now
            if self.tokens < 1:
                await asyncio.sleep((1 - self.tokens) / self.rate)
                self.tokens = 0
            else:
                self.tokens -= 1

class BaseCrawler:
    def __init__(self, rate=2.0, concurrency=10):
        self.bucket = TokenBucket(rate=rate, burst=int(rate))
        self.sem = asyncio.Semaphore(concurrency)

    async def fetch(self, url: str, session: ClientSession) -> dict:
        await self.bucket.acquire()
        async with self.sem:
            async with session.get(url, timeout=30) as resp:
                return {"url": url, "status": resp.status, "body": await resp.text()}

# extractors.py — fallback chain
class ContentExtractor:
    def extract(self, html: str, url: str) -> str:
        # 1. trafilatura (best for articles)
        # 2. readability-lxml (medium quality, broader coverage)
        # 3. BeautifulSoup with custom rules (legal docs, code samples)
        # 4. pypdf (when content-type is PDF)
        for extractor in self._chain:
            text = extractor(html, url)
            if text and len(text) > 200:
                return text
        return ""

Tradeoffs we accept

Lever	Scrapy	requests-html	Playwright	aiohttp (chosen)
Day-1 setup	`scrapy startproject` + middleware config	`pip install`	Browser binary download	`pip install aiohttp`
Concurrency	Built-in (Twisted reactor)	Sync (slow)	Process-per-page (heavy)	asyncio + Semaphore
Throughput on static pages (1k pages, 2 req/s)	~10 minutes	~30+ minutes	~30+ minutes	~10 minutes
Tutorial reproducibility	Heavy (Scrapy mental model)	Easy but slow	Browser binary in CI	Easy + fast
Customization	Middleware pipeline (powerful, opaque)	Hard	Possible	Plain Python
Rate limiting	Built-in `DOWNLOAD_DELAY`	Build it	Build it	TokenBucket (this ADR)
JS rendering	Splash sidecar	Built-in	Native	Not supported (CUT)
robots.txt + politeness	Built-in `RobotsTxtMiddleware`	Build it	Build it	Build it
Memory footprint	~100 MB	~50 MB	~500 MB+	~30 MB

We optimize for tutorial reproducibility + control over the request loop. Scrapy's spider/middleware/pipeline mental model adds 2-3 hours to a learner's onramp before they crawl their first page; aiohttp gets to "fetch a URL with rate limiting" in 30 lines. The cost is JS rendering (cut from scope) and reimplementing politeness rules (robots.txt parsing in crawl/ is ~80 lines; documented).

Consequences (positive)

A learner runs python crawl/run.py --seeds urls.txt and sees the TokenBucket throttle live in stdout — pedagogically clearer than configuring DOWNLOAD_DELAY in settings.py.
The four-extractor fallback chain (extractors.py) handles articles, legal docs, code samples, and PDFs from a single URL list. Scrapy would require per-extractor middleware classes.
Memory footprint is small enough to run alongside the Ray cluster on a single 16GB laptop during M01-M02 hacking.
The BaseCrawler.fetch() interface is async-native, which matches the rest of the pipeline (dedup, embed, vLLM serving all async).

Consequences (negative)

No JS rendering. Sites that require JS execution to expose content are out of scope. Mitigation: the bundled data/raw_corpus/ ships 120 synthetic .txt files for the demo path; Tier-2 documents the Playwright swap.
Politeness primitives are BYO. robots.txt parsing, retry/backoff, duplicate-URL detection are application code. Mitigation: rate_limiter.py + crawl/ ship working defaults; learners read the code, not a Scrapy config.
No native fixture replay. Scrapy's httpcache would let learners re-run a crawl against cached responses. We document a requests-mock fixture pattern in part-1.

Reversal plan

The crawler interface is BaseCrawler.fetch(url, session) -> dict. Replacement is bounded:

Scrapy swap — replace BaseCrawler with a Scrapy Spider that yields the same dict shape. Move TokenBucket logic to a Scrapy DOWNLOADER_MIDDLEWARES entry. ~1 engineer-week.
Playwright addition (not replacement) — add a BrowserCrawler subclass that uses Playwright for JS-heavy URLs; route based on URL pattern in a select_crawler(url) function. ~3 engineer-days.
Hybrid — keep aiohttp for static, add a Playwright sidecar via feature flag.

Estimated effort: 3-5 engineer-days for Playwright addition; 1 engineer-week for full Scrapy swap. Reversible.

References

base_crawler.py (TokenBucket + async fetch)
extractors.py (ContentExtractor fallback chain)
crawl/ (robots.txt + URL deduplication helpers)
data/raw_corpus/ (120 synthetic .txt fixtures for the tutorial path)
ADR-002 (MinHash + LSH dedup — consumes the crawl output)