# ADR-001 — Use aiohttp + custom crawler over Scrapy / requests-html

- **Status:** Accepted
- **Date:** 2026-05-09
- **Module:** 01 — Data Foundation (sub-part: Data Collection & Crawling)
- **Stakeholders:** data engineer, infra reviewer

## Context

The pipeline ingests 1k+ pages of web data per project run, scaling to
1M-doc-capable batch ingestion. The crawler is the front of the
pipeline — every downstream module (dedup, quality, tokenization, RAG)
depends on what it produces. The classic options:

1. **Scrapy** — the de-facto Python crawler framework. Spider classes,
   middleware pipelines, request scheduler, selectors. Heavy but
   battle-tested.
2. **requests-html** — synchronous, lightweight, JavaScript rendering
   built in. Simple but blocks on every request.
3. **Playwright / Puppeteer** — full browser automation. Necessary for
   JS-heavy sites; slow + memory-heavy for static pages.
4. **aiohttp + custom orchestration** — async HTTP client with manual
   request scheduling, rate limiting, and content extraction.

We are building a tutorial pipeline that has to be reproducible by a
learner on a laptop in <2 hours and survive a real production batch
ingest.

## Decision

Adopt **aiohttp** with a custom `BaseCrawler` + TokenBucket rate
limiter + a four-extractor fallback chain (BeautifulSoup, readability-lxml,
trafilatura, pypdf).

```python
# base_crawler.py — the spine
from aiohttp import ClientSession, TCPConnector
import asyncio

class TokenBucket:
    def __init__(self, rate: float, burst: int):
        self.rate, self.burst = rate, burst
        self.tokens = burst
        self.last = asyncio.get_event_loop().time()
        self.lock = asyncio.Lock()

    async def acquire(self):
        async with self.lock:
            now = asyncio.get_event_loop().time()
            self.tokens = min(self.burst, self.tokens + (now - self.last) * self.rate)
            self.last = now
            if self.tokens < 1:
                await asyncio.sleep((1 - self.tokens) / self.rate)
                self.tokens = 0
            else:
                self.tokens -= 1

class BaseCrawler:
    def __init__(self, rate=2.0, concurrency=10):
        self.bucket = TokenBucket(rate=rate, burst=int(rate))
        self.sem = asyncio.Semaphore(concurrency)

    async def fetch(self, url: str, session: ClientSession) -> dict:
        await self.bucket.acquire()
        async with self.sem:
            async with session.get(url, timeout=30) as resp:
                return {"url": url, "status": resp.status, "body": await resp.text()}
```

```python
# extractors.py — fallback chain
class ContentExtractor:
    def extract(self, html: str, url: str) -> str:
        # 1. trafilatura (best for articles)
        # 2. readability-lxml (medium quality, broader coverage)
        # 3. BeautifulSoup with custom rules (legal docs, code samples)
        # 4. pypdf (when content-type is PDF)
        for extractor in self._chain:
            text = extractor(html, url)
            if text and len(text) > 200:
                return text
        return ""
```

## Tradeoffs we accept

| Lever                                          | Scrapy                                    | requests-html | Playwright               | aiohttp (chosen)       |
| ---------------------------------------------- | ----------------------------------------- | ------------- | ------------------------ | ---------------------- |
| Day-1 setup                                    | `scrapy startproject` + middleware config | `pip install` | Browser binary download  | `pip install aiohttp`  |
| Concurrency                                    | Built-in (Twisted reactor)                | Sync (slow)   | Process-per-page (heavy) | asyncio + Semaphore    |
| Throughput on static pages (1k pages, 2 req/s) | ~10 minutes                               | ~30+ minutes  | ~30+ minutes             | ~10 minutes            |
| Tutorial reproducibility                       | Heavy (Scrapy mental model)               | Easy but slow | Browser binary in CI     | Easy + fast            |
| Customization                                  | Middleware pipeline (powerful, opaque)    | Hard          | Possible                 | Plain Python           |
| Rate limiting                                  | Built-in `DOWNLOAD_DELAY`                 | Build it      | Build it                 | TokenBucket (this ADR) |
| JS rendering                                   | Splash sidecar                            | Built-in      | Native                   | Not supported (CUT)    |
| robots.txt + politeness                        | Built-in `RobotsTxtMiddleware`            | Build it      | Build it                 | Build it               |
| Memory footprint                               | ~100 MB                                   | ~50 MB        | ~500 MB+                 | ~30 MB                 |

We optimize for **tutorial reproducibility + control over the request
loop**. Scrapy's spider/middleware/pipeline mental model adds 2-3
hours to a learner's onramp before they crawl their first page; aiohttp
gets to "fetch a URL with rate limiting" in 30 lines. The cost is
JS rendering (cut from scope) and reimplementing politeness rules
(`robots.txt` parsing in `crawl/` is ~80 lines; documented).

## Consequences (positive)

- A learner runs `python crawl/run.py --seeds urls.txt` and sees the
  TokenBucket throttle live in stdout — pedagogically clearer than
  configuring `DOWNLOAD_DELAY` in `settings.py`.
- The four-extractor fallback chain (`extractors.py`) handles
  articles, legal docs, code samples, and PDFs from a single URL list.
  Scrapy would require per-extractor middleware classes.
- Memory footprint is small enough to run alongside the Ray cluster
  on a single 16GB laptop during M01-M02 hacking.
- The `BaseCrawler.fetch()` interface is async-native, which matches
  the rest of the pipeline (dedup, embed, vLLM serving all async).

## Consequences (negative)

- **No JS rendering.** Sites that require JS execution to expose
  content are out of scope. Mitigation: the bundled `data/raw_corpus/`
  ships 120 synthetic .txt files for the demo path; Tier-2 documents
  the Playwright swap.
- **Politeness primitives are BYO.** robots.txt parsing, retry/backoff,
  duplicate-URL detection are application code. Mitigation:
  `rate_limiter.py` + `crawl/` ship working defaults; learners read
  the code, not a Scrapy config.
- **No native fixture replay.** Scrapy's `httpcache` would let learners
  re-run a crawl against cached responses. We document a `requests-mock`
  fixture pattern in part-1.

## Reversal plan

The crawler interface is `BaseCrawler.fetch(url, session) -> dict`.
Replacement is bounded:

1. **Scrapy swap** — replace `BaseCrawler` with a Scrapy `Spider` that
   yields the same dict shape. Move TokenBucket logic to a Scrapy
   `DOWNLOADER_MIDDLEWARES` entry. ~1 engineer-week.
2. **Playwright addition** (not replacement) — add a `BrowserCrawler`
   subclass that uses Playwright for JS-heavy URLs; route based on
   URL pattern in a `select_crawler(url)` function. ~3 engineer-days.
3. **Hybrid** — keep aiohttp for static, add a Playwright sidecar via
   feature flag.

Estimated effort: **3-5 engineer-days** for Playwright addition;
**1 engineer-week** for full Scrapy swap. Reversible.

## References

- `base_crawler.py` (TokenBucket + async fetch)
- `extractors.py` (ContentExtractor fallback chain)
- `crawl/` (robots.txt + URL deduplication helpers)
- `data/raw_corpus/` (120 synthetic .txt fixtures for the tutorial path)
- ADR-002 (MinHash + LSH dedup — consumes the crawl output)
