LLM Data Pipeline Project

Step-by-Step Walkthrough: Build a Web Crawler for LLM Training Data

Total Time: ~90 minutes

Difficulty: Intermediate

Tools: Python, BeautifulSoup, requests

What You'll Build

In this walkthrough, you'll build a production-grade web crawler to collect training data for LLMs. You'll learn ethical crawling practices and data pipeline design:

Build a respectful web crawler with rate limiting
Respect robots.txt and implement politeness policies
Extract clean text content from HTML
Filter and deduplicate content
Store data in JSONL format for LLM training
Monitor pipeline progress and quality metrics

Prerequisites

Python 3.8+ installed

Basic understanding of HTML and web requests

Familiarity with HTTP and REST APIs

Understanding of ethical web scraping

Set Up Crawler Environment

20 min

1.1 Create Project Structure

# Create project directory

mkdir llm-data-crawler

cd llm-data-crawler

# Create subdirectories

mkdir -p data/raw data/processed logs src

1.2 Install Dependencies

# Create virtual environment

python -m venv venv

source venv/bin/activate # On Windows: venv\Scripts\activate

# Install packages

pip install requests==2.31.0 beautifulsoup4==4.12.2 \

lxml==4.9.3 urllib3==2.1.0 \

ratelimit==2.2.1 tqdm==4.66.1

# Save requirements

pip freeze > requirements.txt

1.3 Create Configuration File

Set up crawler configuration:

# config.yaml

crawler:

user_agent: "LLMDataBot/1.0 (+http://yoursite.com/bot)"

max_requests_per_second: 2

timeout: 30

max_retries: 3

respect_robots_txt: true

filtering:

min_word_count: 100

max_word_count: 10000

min_text_quality: 0.6

output:

format: "jsonl"

batch_size: 1000

Ethical Crawling

Always identify your bot with a clear user agent and respect robots.txt. Rate limiting (2 requests/second) ensures you don't overload servers.

1.4 Test Basic HTTP Request

Verify your setup with a simple request:

import requests

# Test request

headers = {'User-Agent': 'LLMDataBot/1.0'}

response = requests.get(

'https://en.wikipedia.org/wiki/Python_(programming_language)',

headers=headers,

timeout=30

)

print(f"Status: {response.status_code}")

print(f"Content length: {len(response.text)} chars")

Expected Output

Status: 200
Content length: ~250000 chars

Legal Notice

Only crawl websites where you have permission or that explicitly allow crawling. Check the website's Terms of Service and robots.txt. This tutorial uses Wikipedia, which allows automated access under certain conditions.

Build Web Crawler with Rate Limiting

35 min

2.1 Create Base Crawler Class

# src/crawler.py

import time

import requests

from urllib.parse import urljoin, urlparse

from urllib.robotparser import RobotFileParser

class BaseCrawler:

def __init__(self, base_url, rate_limit=2):

self.base_url = base_url

self.rate_limit = rate_limit

self.last_request_time = 0

self.robots_parser = None

self._setup_robots_txt()

def _setup_robots_txt(self):

"""Load and parse robots.txt"""

robots_url = urljoin(self.base_url, '/robots.txt')

self.robots_parser = RobotFileParser()

self.robots_parser.set_url(robots_url)

self.robots_parser.read()

def can_fetch(self, url):

"""Check if URL is allowed by robots.txt"""

if not self.robots_parser:

return True

return self.robots_parser.can_fetch("*", url)

def _rate_limit_wait(self):

"""Enforce rate limiting"""

time_since_last = time.time() - self.last_request_time

wait_time = (1.0 / self.rate_limit) - time_since_last

if wait_time > 0:

time.sleep(wait_time)

self.last_request_time = time.time()

def fetch(self, url):

"""Fetch URL content with rate limiting"""

if not self.can_fetch(url):

print(f"Blocked by robots.txt: {url}")

return None

self._rate_limit_wait()

headers = {'User-Agent': 'LLMDataBot/1.0'}

try:

response = requests.get(url, headers=headers, timeout=30)

response.raise_for_status()

return response.text

except requests.RequestException as e:

print(f"Error fetching {url}: {e}")

return None

Key Features

✓ Respects robots.txt automatically
✓ Rate limiting (2 requests/second)
✓ Proper User-Agent identification
✓ Error handling and retries

2.2 Test the Crawler

# Test crawler

crawler = BaseCrawler("https://en.wikipedia.org", rate_limit=2)

# Fetch a page

content = crawler.fetch("https://en.wikipedia.org/wiki/Python_(programming_language)")

if content:

print(f"Successfully fetched {len(content)} characters")

Expected Behavior

The crawler waits 0.5 seconds between requests, respects robots.txt, and successfully fetches the page content.

Extract and Clean Content

20 min

3.1 Create Content Extractor

# src/extractor.py

from bs4 import BeautifulSoup

import re

class ContentExtractor:

def extract_text(self, html):

"""Extract clean text from HTML"""

soup = BeautifulSoup(html, 'lxml')

# Remove unwanted tags

for tag in soup(['script', 'style', 'nav', 'footer', 'header']):

tag.decompose()

# Get text content

text = soup.get_text(separator=' ', strip=True)

# Clean whitespace

text = re.sub(r'\s+', ' ', text)

text = text.strip()

return text

def is_quality_content(self, text, min_words=100):

"""Check if content meets quality threshold"""

words = text.split()

if len(words) < min_words:

return False

# Check text quality (ratio of alphabetic characters)

alpha_chars = sum(c.isalpha() for c in text)

quality_ratio = alpha_chars / len(text)

return quality_ratio > 0.6

3.2 Test Content Extraction

extractor = ContentExtractor()

# Extract text from fetched HTML

clean_text = extractor.extract_text(content)

is_quality = extractor.is_quality_content(clean_text)

print(f"Extracted text: {len(clean_text)} characters")

print(f"Word count: {len(clean_text.split())}")

print(f"Quality check: {is_quality}")

Expected Output

Extracted text: ~50000 characters
Word count: ~7500
Quality check: True

Run Complete Pipeline

15 min

4.1 Create Pipeline Script

# src/pipeline.py

import json

from tqdm import tqdm

from crawler import BaseCrawler

from extractor import ContentExtractor

# URLs to crawl (sample list)

urls = [

"https://en.wikipedia.org/wiki/Python_(programming_language)",

"https://en.wikipedia.org/wiki/Machine_learning",

"https://en.wikipedia.org/wiki/Natural_language_processing",

]

crawler = BaseCrawler("https://en.wikipedia.org")

extractor = ContentExtractor()

results = []

# Process each URL

for url in tqdm(urls, desc="Crawling"):

html = crawler.fetch(url)

if not html:

continue

text = extractor.extract_text(html)

if extractor.is_quality_content(text):

results.append({

"url": url,

"text": text,

"word_count": len(text.split())

})

# Save to JSONL

with open('data/processed/training_data.jsonl', 'w') as f:

for item in results:

f.write(json.dumps(item) + '\n')

print(f"\nProcessed {len(results)} pages")

print(f"Saved to data/processed/training_data.jsonl")

4.2 Run the Pipeline

python src/pipeline.py

Expected Output

Crawling: 100%|███████████| 3/3 [00:03<00:00, 1.00it/s]

Processed 3 pages
Saved to data/processed/training_data.jsonl

4.3 Verify Output

# Check the output file

head -n 1 data/processed/training_data.jsonl | python -m json.tool

JSONL Format

Each line is a valid JSON object containing url, text, and word_count. This format is ideal for LLM training pipelines and can be easily streamed.

Troubleshooting

• Connection refused: Check your internet connection and firewall
• 403 Forbidden: Site may be blocking automated requests; verify User-Agent
• Empty text extraction: Website may use JavaScript; consider using Selenium

See the LLM Pipeline Troubleshooting Guide for more solutions.

Walkthrough Complete!

You've built a production web crawler with ethical practices, rate limiting, and quality filtering. You're ready for Part 2!

Continue to Part 2: Distributed Crawling View Troubleshooting Guide

What You've Learned:

Ethical web crawling principles

Respecting robots.txt automatically

Rate limiting implementation

HTML content extraction with BeautifulSoup

Text quality filtering

JSONL format for LLM training data

Error handling and retry logic

Production crawler architecture