Skip to content
Back to LLM Ingestion Pipeline

Use aiohttp + custom crawler over Scrapy / requests-html

✓ AcceptedLLM Ingestion Pipeline01 — Data Foundation (sub-part: Data Collection & Crawling)
By AI-DE Engineering Team·Stakeholders: data engineer, infra reviewer

Context

The pipeline ingests 1k+ pages of web data per project run, scaling to 1M-doc-capable batch ingestion. The crawler is the front of the pipeline — every downstream module (dedup, quality, tokenization, RAG) depends on what it produces. The classic options:

  1. Scrapy — the de-facto Python crawler framework. Spider classes, middleware pipelines, request scheduler, selectors. Heavy but battle-tested.
  2. requests-html — synchronous, lightweight, JavaScript rendering built in. Simple but blocks on every request.
  3. Playwright / Puppeteer — full browser automation. Necessary for JS-heavy sites; slow + memory-heavy for static pages.
  4. aiohttp + custom orchestration — async HTTP client with manual request scheduling, rate limiting, and content extraction.

We are building a tutorial pipeline that has to be reproducible by a learner on a laptop in <2 hours and survive a real production batch ingest.

Decision

Adopt aiohttp with a custom BaseCrawler + TokenBucket rate limiter + a four-extractor fallback chain (BeautifulSoup, readability-lxml, trafilatura, pypdf).

# base_crawler.py — the spine
from aiohttp import ClientSession, TCPConnector
import asyncio

class TokenBucket:
    def __init__(self, rate: float, burst: int):
        self.rate, self.burst = rate, burst
        self.tokens = burst
        self.last = asyncio.get_event_loop().time()
        self.lock = asyncio.Lock()

    async def acquire(self):
        async with self.lock:
            now = asyncio.get_event_loop().time()
            self.tokens = min(self.burst, self.tokens + (now - self.last) * self.rate)
            self.last = now
            if self.tokens < 1:
                await asyncio.sleep((1 - self.tokens) / self.rate)
                self.tokens = 0
            else:
                self.tokens -= 1

class BaseCrawler:
    def __init__(self, rate=2.0, concurrency=10):
        self.bucket = TokenBucket(rate=rate, burst=int(rate))
        self.sem = asyncio.Semaphore(concurrency)

    async def fetch(self, url: str, session: ClientSession) -> dict:
        await self.bucket.acquire()
        async with self.sem:
            async with session.get(url, timeout=30) as resp:
                return {"url": url, "status": resp.status, "body": await resp.text()}
# extractors.py — fallback chain
class ContentExtractor:
    def extract(self, html: str, url: str) -> str:
        # 1. trafilatura (best for articles)
        # 2. readability-lxml (medium quality, broader coverage)
        # 3. BeautifulSoup with custom rules (legal docs, code samples)
        # 4. pypdf (when content-type is PDF)
        for extractor in self._chain:
            text = extractor(html, url)
            if text and len(text) > 200:
                return text
        return ""

Tradeoffs we accept

LeverScrapyrequests-htmlPlaywrightaiohttp (chosen)
Day-1 setupscrapy startproject + middleware configpip installBrowser binary downloadpip install aiohttp
ConcurrencyBuilt-in (Twisted reactor)Sync (slow)Process-per-page (heavy)asyncio + Semaphore
Throughput on static pages (1k pages, 2 req/s)~10 minutes~30+ minutes~30+ minutes~10 minutes
Tutorial reproducibilityHeavy (Scrapy mental model)Easy but slowBrowser binary in CIEasy + fast
CustomizationMiddleware pipeline (powerful, opaque)HardPossiblePlain Python
Rate limitingBuilt-in DOWNLOAD_DELAYBuild itBuild itTokenBucket (this ADR)
JS renderingSplash sidecarBuilt-inNativeNot supported (CUT)
robots.txt + politenessBuilt-in RobotsTxtMiddlewareBuild itBuild itBuild it
Memory footprint~100 MB~50 MB~500 MB+~30 MB

We optimize for tutorial reproducibility + control over the request loop. Scrapy's spider/middleware/pipeline mental model adds 2-3 hours to a learner's onramp before they crawl their first page; aiohttp gets to "fetch a URL with rate limiting" in 30 lines. The cost is JS rendering (cut from scope) and reimplementing politeness rules (robots.txt parsing in crawl/ is ~80 lines; documented).

Consequences (positive)

  • A learner runs python crawl/run.py --seeds urls.txt and sees the TokenBucket throttle live in stdout — pedagogically clearer than configuring DOWNLOAD_DELAY in settings.py.
  • The four-extractor fallback chain (extractors.py) handles articles, legal docs, code samples, and PDFs from a single URL list. Scrapy would require per-extractor middleware classes.
  • Memory footprint is small enough to run alongside the Ray cluster on a single 16GB laptop during M01-M02 hacking.
  • The BaseCrawler.fetch() interface is async-native, which matches the rest of the pipeline (dedup, embed, vLLM serving all async).

Consequences (negative)

  • No JS rendering. Sites that require JS execution to expose content are out of scope. Mitigation: the bundled data/raw_corpus/ ships 120 synthetic .txt files for the demo path; Tier-2 documents the Playwright swap.
  • Politeness primitives are BYO. robots.txt parsing, retry/backoff, duplicate-URL detection are application code. Mitigation: rate_limiter.py + crawl/ ship working defaults; learners read the code, not a Scrapy config.
  • No native fixture replay. Scrapy's httpcache would let learners re-run a crawl against cached responses. We document a requests-mock fixture pattern in part-1.

Reversal plan

The crawler interface is BaseCrawler.fetch(url, session) -> dict. Replacement is bounded:

  1. Scrapy swap — replace BaseCrawler with a Scrapy Spider that yields the same dict shape. Move TokenBucket logic to a Scrapy DOWNLOADER_MIDDLEWARES entry. ~1 engineer-week.
  2. Playwright addition (not replacement) — add a BrowserCrawler subclass that uses Playwright for JS-heavy URLs; route based on URL pattern in a select_crawler(url) function. ~3 engineer-days.
  3. Hybrid — keep aiohttp for static, add a Playwright sidecar via feature flag.

Estimated effort: 3-5 engineer-days for Playwright addition; 1 engineer-week for full Scrapy swap. Reversible.

References

  • base_crawler.py (TokenBucket + async fetch)
  • extractors.py (ContentExtractor fallback chain)
  • crawl/ (robots.txt + URL deduplication helpers)
  • data/raw_corpus/ (120 synthetic .txt fixtures for the tutorial path)
  • ADR-002 (MinHash + LSH dedup — consumes the crawl output)
Built into the project

This decision shipped as part of LLM Ingestion Pipeline — see the full architecture, starter kit, and 4 more ADRs.

Open project →
Press Cmd+K to open