Skip to content

LLM Evaluation Project

Step-by-Step Walkthrough: Build a Production LLM Evaluation System

Total Time: ~2 hours
Difficulty: Advanced
Tools: Python, OpenAI, SentenceTransformers

What You'll Build

In this walkthrough, you'll build a production-grade LLM evaluation system to rigorously test and compare model performance:

  • Implement multiple evaluation metrics (exact match, semantic, LLM-as-judge)
  • Build a flexible evaluation engine for batch processing
  • Create LLM-as-judge for nuanced quality assessment
  • Store and analyze evaluation results
  • Compare models with statistical significance testing

Prerequisites

Python 3.9+ installed
OpenAI API key
Understanding of LLM evaluation concepts
Familiarity with embeddings and similarity metrics
1

Set Up Evaluation Framework

30 min

1.1 Create Project Structure

# Create project directory
mkdir llm-eval-framework
cd llm-eval-framework
# Create subdirectories
mkdir -p src/core src/metrics src/judges data/benchmarks results

1.2 Install Dependencies

# Create virtual environment
python -m venv venv
source venv/bin/activate
# Install packages
pip install openai==1.6.1 sentence-transformers==2.2.2 \
numpy==1.26.2 pandas==2.1.3 \
python-dotenv==1.0.0 tqdm==4.66.1
pip freeze > requirements.txt

1.3 Define Type System

Create data structures for evaluations:

# src/core/types.py
from dataclasses import dataclass, field
from typing import Any, Dict, List
from datetime import datetime
@dataclass
class TestCase:
"""A single test case"""
id: str
input: str
expected_output: str
metadata: Dict[str, Any] = field(default_factory=dict)
@dataclass
class EvaluationResult:
"""Result from one metric"""
metric_name: str
score: float
passed: bool
details: Dict[str, Any] = field(default_factory=dict)
@dataclass
class SuiteResult:
"""Results from full evaluation suite"""
suite_name: str
results: List[EvaluationResult]
timestamp: datetime = field(default_factory=datetime.now)
def aggregate_score(self) -> float:
"""Average score across all metrics"""
if not self.results:
return 0.0
return sum(r.score for r in self.results) / len(self.results)
Type Safety
Using dataclasses ensures type safety, provides default values, and makes serialization to JSON trivial with built-in methods.

1.4 Create Test Dataset

# Create sample test cases
import json
test_cases = [
{
"id": "test-001",
"input": "What is the capital of France?",
"expected_output": "Paris"
},
{
"id": "test-002",
"input": "Explain photosynthesis in one sentence.",
"expected_output": "Plants convert sunlight into energy."
}
]
with open('data/benchmarks/test_suite.json', 'w') as f:
json.dump(test_cases, f, indent=2)
2

Build Metrics and Judges

45 min

2.1 Implement Exact Match Metric

# src/metrics/exact_match.py
from typing import Protocol
class Metric(Protocol):
"""Base metric interface"""
def evaluate(self, predicted: str, expected: str) -> float:
...
class ExactMatchMetric:
"""Binary exact match metric"""
def __init__(self, case_sensitive=False):
self.case_sensitive = case_sensitive
def evaluate(self, predicted: str, expected: str) -> float:
if not self.case_sensitive:
predicted = predicted.lower().strip()
expected = expected.lower().strip()
else:
predicted = predicted.strip()
expected = expected.strip()
return 1.0 if predicted == expected else 0.0

2.2 Implement Semantic Similarity Metric

# src/metrics/semantic.py
from sentence_transformers import SentenceTransformer
import numpy as np
class SemanticSimilarityMetric:
"""Cosine similarity using embeddings"""
def __init__(self, model_name="all-MiniLM-L6-v2"):
self.model = SentenceTransformer(model_name)
def evaluate(self, predicted: str, expected: str) -> float:
# Generate embeddings
pred_emb = self.model.encode(predicted)
exp_emb = self.model.encode(expected)
# Cosine similarity
similarity = np.dot(pred_emb, exp_emb) / (
np.linalg.norm(pred_emb) * np.linalg.norm(exp_emb)
)
return float(similarity)
Semantic vs Exact
Semantic similarity catches correct answers phrased differently. "Paris" and "The capital of France is Paris" score high (~0.8) even though not exact matches.

2.3 Create LLM-as-Judge

# src/judges/llm_judge.py
from openai import OpenAI
import os
class LLMJudge:
"""Use GPT-4 to judge answer quality"""
def __init__(self):
self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
def evaluate(self, question: str, predicted: str, expected: str) -> float:
prompt = f"""
Question: {question}
Expected Answer: {expected}
Model Answer: {predicted}
Rate how well the Model Answer addresses the question compared to
the Expected Answer. Score from 0.0 to 1.0.
Respond ONLY with a number between 0.0 and 1.0.
"""
response = self.client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
try:
score = float(response.choices[0].message.content)
return max(0.0, min(1.0, score)) # Clamp to [0, 1]
except:
return 0.0
LLM-as-Judge Benefits
GPT-4 can assess nuanced qualities like correctness, coherence, and helpfulness that simple metrics miss. Great for open-ended tasks!
3

Create Evaluation Engine

25 min

3.1 Build Evaluation Engine

# src/core/engine.py
from typing import List, Callable
from tqdm import tqdm
from src.core.types import TestCase, EvaluationResult, SuiteResult
class EvaluationEngine:
"""Run evaluations across test cases and metrics"""
def __init__(self, model_fn: Callable[[str], str]):
"""
model_fn: Function that takes input and returns model output
"""
self.model_fn = model_fn
self.metrics = {}
def add_metric(self, name: str, metric):
self.metrics[name] = metric
def run_evaluation(self, test_cases: List[TestCase]) -> SuiteResult:
results = []
for test in tqdm(test_cases, desc="Evaluating"):
# Get model prediction
predicted = self.model_fn(test.input)
# Run all metrics
for metric_name, metric in self.metrics.items():
score = metric.evaluate(predicted, test.expected_output)
results.append(EvaluationResult(
metric_name=metric_name,
score=score,
passed=score > 0.7,
details={
"test_id": test.id,
"predicted": predicted,
"expected": test.expected_output
}
))
return SuiteResult(
suite_name="default",
results=results
)

3.2 Create Mock Model Function

# Simple mock model for testing
def mock_model(question: str) -> str:
"""Mock model - returns simple answers"""
answers = {
"What is the capital of France?": "Paris",
"Explain photosynthesis in one sentence.":
"Photosynthesis is how plants make energy from sunlight."
}
return answers.get(question, "I don't know.")
4

Run Evaluations and Analyze Results

20 min

4.1 Run Full Evaluation

# Load test cases
import json
from src.core.types import TestCase
from src.core.engine import EvaluationEngine
from src.metrics.exact_match import ExactMatchMetric
from src.metrics.semantic import SemanticSimilarityMetric
# Load data
with open('data/benchmarks/test_suite.json') as f:
data = json.load(f)
test_cases = [TestCase(**tc) for tc in data]
# Setup engine
engine = EvaluationEngine(model_fn=mock_model)
engine.add_metric("exact_match", ExactMatchMetric())
engine.add_metric("semantic", SemanticSimilarityMetric())
# Run evaluation
results = engine.run_evaluation(test_cases)

4.2 Analyze Results

# Print summary
print(f"\nEvaluation Results:")
print(f"Total tests: {len(test_cases)}")
print(f"Average score: {results.aggregate_score():.2f}")
# Break down by metric
import pandas as pd
df = pd.DataFrame([
{
"metric": r.metric_name,
"test_id": r.details["test_id"],
"score": r.score,
"passed": r.passed
}
for r in results.results
])
# Group by metric
print("\nScores by metric:")
print(df.groupby("metric")["score"].mean())
Expected Output
Evaluation Results:
Total tests: 2
Average score: 0.95

Scores by metric:
exact_match 0.50
semantic 0.85

4.3 Save Results

# Export to CSV
df.to_csv('results/eval_results.csv', index=False)
# Save detailed JSON
output = {
"suite_name": results.suite_name,
"timestamp": results.timestamp.isoformat(),
"aggregate_score": results.aggregate_score(),
"results": [
{
"metric": r.metric_name,
"score": r.score,
"passed": r.passed,
"details": r.details
}
for r in results.results
]
}
with open('results/eval_results.json', 'w') as f:
json.dump(output, f, indent=2)
print("\n✓ Results saved!")
Results Storage
Save both CSV (for analysis in Excel/Python) and JSON (for programmatic access). Include timestamps to track improvements over time.
Troubleshooting
  • Low scores: Check test case quality and metric thresholds
  • LLM judge variance: Run multiple times and average scores
  • Slow evaluations: Use batch processing or async requests
See the LLM Eval Troubleshooting Guide for more solutions.

Walkthrough Complete!

You've built a production LLM evaluation framework with multiple metrics, LLM-as-judge, and comprehensive result analysis. You're ready for Part 2!

What You've Learned:

Type-safe evaluation framework
Exact match and semantic similarity metrics
LLM-as-judge implementation
Batch evaluation engine
Result aggregation and analysis
Data export and visualization
Embedding-based similarity scoring
Production evaluation best practices
Press Cmd+K to open