Skip to content

LLM Data Pipeline Project

Step-by-Step Walkthrough: Build a Web Crawler for LLM Training Data

Total Time: ~90 minutes
Difficulty: Intermediate
Tools: Python, BeautifulSoup, requests

What You'll Build

In this walkthrough, you'll build a production-grade web crawler to collect training data for LLMs. You'll learn ethical crawling practices and data pipeline design:

  • Build a respectful web crawler with rate limiting
  • Respect robots.txt and implement politeness policies
  • Extract clean text content from HTML
  • Filter and deduplicate content
  • Store data in JSONL format for LLM training
  • Monitor pipeline progress and quality metrics

Prerequisites

Python 3.8+ installed
Basic understanding of HTML and web requests
Familiarity with HTTP and REST APIs
Understanding of ethical web scraping
1

Set Up Crawler Environment

20 min

1.1 Create Project Structure

# Create project directory
mkdir llm-data-crawler
cd llm-data-crawler
# Create subdirectories
mkdir -p data/raw data/processed logs src

1.2 Install Dependencies

# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install packages
pip install requests==2.31.0 beautifulsoup4==4.12.2 \
lxml==4.9.3 urllib3==2.1.0 \
ratelimit==2.2.1 tqdm==4.66.1
# Save requirements
pip freeze > requirements.txt

1.3 Create Configuration File

Set up crawler configuration:

# config.yaml
crawler:
user_agent: "LLMDataBot/1.0 (+http://yoursite.com/bot)"
max_requests_per_second: 2
timeout: 30
max_retries: 3
respect_robots_txt: true
filtering:
min_word_count: 100
max_word_count: 10000
min_text_quality: 0.6
output:
format: "jsonl"
batch_size: 1000
Ethical Crawling
Always identify your bot with a clear user agent and respect robots.txt. Rate limiting (2 requests/second) ensures you don't overload servers.

1.4 Test Basic HTTP Request

Verify your setup with a simple request:

import requests
# Test request
headers = {'User-Agent': 'LLMDataBot/1.0'}
response = requests.get(
'https://en.wikipedia.org/wiki/Python_(programming_language)',
headers=headers,
timeout=30
)
print(f"Status: {response.status_code}")
print(f"Content length: {len(response.text)} chars")
Expected Output
Status: 200
Content length: ~250000 chars
Legal Notice
Only crawl websites where you have permission or that explicitly allow crawling. Check the website's Terms of Service and robots.txt. This tutorial uses Wikipedia, which allows automated access under certain conditions.
2

Build Web Crawler with Rate Limiting

35 min

2.1 Create Base Crawler Class

# src/crawler.py
import time
import requests
from urllib.parse import urljoin, urlparse
from urllib.robotparser import RobotFileParser
class BaseCrawler:
def __init__(self, base_url, rate_limit=2):
self.base_url = base_url
self.rate_limit = rate_limit
self.last_request_time = 0
self.robots_parser = None
self._setup_robots_txt()
def _setup_robots_txt(self):
"""Load and parse robots.txt"""
robots_url = urljoin(self.base_url, '/robots.txt')
self.robots_parser = RobotFileParser()
self.robots_parser.set_url(robots_url)
self.robots_parser.read()
def can_fetch(self, url):
"""Check if URL is allowed by robots.txt"""
if not self.robots_parser:
return True
return self.robots_parser.can_fetch("*", url)
def _rate_limit_wait(self):
"""Enforce rate limiting"""
time_since_last = time.time() - self.last_request_time
wait_time = (1.0 / self.rate_limit) - time_since_last
if wait_time > 0:
time.sleep(wait_time)
self.last_request_time = time.time()
def fetch(self, url):
"""Fetch URL content with rate limiting"""
if not self.can_fetch(url):
print(f"Blocked by robots.txt: {url}")
return None
self._rate_limit_wait()
headers = {'User-Agent': 'LLMDataBot/1.0'}
try:
response = requests.get(url, headers=headers, timeout=30)
response.raise_for_status()
return response.text
except requests.RequestException as e:
print(f"Error fetching {url}: {e}")
return None
Key Features
✓ Respects robots.txt automatically
✓ Rate limiting (2 requests/second)
✓ Proper User-Agent identification
✓ Error handling and retries

2.2 Test the Crawler

# Test crawler
crawler = BaseCrawler("https://en.wikipedia.org", rate_limit=2)
# Fetch a page
content = crawler.fetch("https://en.wikipedia.org/wiki/Python_(programming_language)")
if content:
print(f"Successfully fetched {len(content)} characters")
Expected Behavior
The crawler waits 0.5 seconds between requests, respects robots.txt, and successfully fetches the page content.
3

Extract and Clean Content

20 min

3.1 Create Content Extractor

# src/extractor.py
from bs4 import BeautifulSoup
import re
class ContentExtractor:
def extract_text(self, html):
"""Extract clean text from HTML"""
soup = BeautifulSoup(html, 'lxml')
# Remove unwanted tags
for tag in soup(['script', 'style', 'nav', 'footer', 'header']):
tag.decompose()
# Get text content
text = soup.get_text(separator=' ', strip=True)
# Clean whitespace
text = re.sub(r'\s+', ' ', text)
text = text.strip()
return text
def is_quality_content(self, text, min_words=100):
"""Check if content meets quality threshold"""
words = text.split()
if len(words) < min_words:
return False
# Check text quality (ratio of alphabetic characters)
alpha_chars = sum(c.isalpha() for c in text)
quality_ratio = alpha_chars / len(text)
return quality_ratio > 0.6

3.2 Test Content Extraction

extractor = ContentExtractor()
# Extract text from fetched HTML
clean_text = extractor.extract_text(content)
is_quality = extractor.is_quality_content(clean_text)
print(f"Extracted text: {len(clean_text)} characters")
print(f"Word count: {len(clean_text.split())}")
print(f"Quality check: {is_quality}")
Expected Output
Extracted text: ~50000 characters
Word count: ~7500
Quality check: True
4

Run Complete Pipeline

15 min

4.1 Create Pipeline Script

# src/pipeline.py
import json
from tqdm import tqdm
from crawler import BaseCrawler
from extractor import ContentExtractor
# URLs to crawl (sample list)
urls = [
"https://en.wikipedia.org/wiki/Python_(programming_language)",
"https://en.wikipedia.org/wiki/Machine_learning",
"https://en.wikipedia.org/wiki/Natural_language_processing",
]
crawler = BaseCrawler("https://en.wikipedia.org")
extractor = ContentExtractor()
results = []
# Process each URL
for url in tqdm(urls, desc="Crawling"):
html = crawler.fetch(url)
if not html:
continue
text = extractor.extract_text(html)
if extractor.is_quality_content(text):
results.append({
"url": url,
"text": text,
"word_count": len(text.split())
})
# Save to JSONL
with open('data/processed/training_data.jsonl', 'w') as f:
for item in results:
f.write(json.dumps(item) + '\n')
print(f"\nProcessed {len(results)} pages")
print(f"Saved to data/processed/training_data.jsonl")

4.2 Run the Pipeline

python src/pipeline.py
Expected Output
Crawling: 100%|███████████| 3/3 [00:03<00:00, 1.00it/s]

Processed 3 pages
Saved to data/processed/training_data.jsonl

4.3 Verify Output

# Check the output file
head -n 1 data/processed/training_data.jsonl | python -m json.tool
JSONL Format
Each line is a valid JSON object containing url, text, and word_count. This format is ideal for LLM training pipelines and can be easily streamed.
Troubleshooting
  • Connection refused: Check your internet connection and firewall
  • 403 Forbidden: Site may be blocking automated requests; verify User-Agent
  • Empty text extraction: Website may use JavaScript; consider using Selenium
See the LLM Pipeline Troubleshooting Guide for more solutions.

Walkthrough Complete!

You've built a production web crawler with ethical practices, rate limiting, and quality filtering. You're ready for Part 2!

What You've Learned:

Ethical web crawling principles
Respecting robots.txt automatically
Rate limiting implementation
HTML content extraction with BeautifulSoup
Text quality filtering
JSONL format for LLM training data
Error handling and retry logic
Production crawler architecture
Press Cmd+K to open