Skip to content
Back to LLM Evaluation Framework

Single shared `test_cases` table for all suites (DEPRECATED)

✗ DeprecatedLLM Evaluation Framework02 — Test Management System
By AI-DE Engineering Team·Stakeholders: ML engineer, eng manager

Context

Original v0 design (ADR-005, originally accepted) had one shared test_cases table with a suite_id foreign key. Tagging used a many-to-many junction test_case_tags against a global tags table. This was the simplest schema and shipped on day 3 of M02.

It worked fine until the regression suite crossed ~1,200 test cases. Then we hit two problems:

  1. Bulk-tag operations deadlocked under load. Re-tagging "all 800 RAG cases as tier_b_gold" would take a row-level lock on every matching test_case row. While that ran, the CI eval workflow (M04) couldn't append new test cases to the same suite — Postgres would queue the inserts behind the bulk update. P95 CI eval time jumped from 90s to 11min on the day this fired.
  2. Cross-suite tag pollution. A user adding the tag "flaky" to test cases in suite-A also affected suite-B because the tag dictionary was global. The same string "flaky" meant "transiently flaky on retrieval" in suite-A and "human-flagged for re-labeling" in suite-B.

What was originally decided

Single test_cases table with suite_id FK; many-to-many tags via test_case_tags against a global tags table.

# DEPRECATED — see "What we got wrong" below
class TestCase(Base):
    __tablename__ = "test_cases"
    id = Column(Integer, primary_key=True)
    suite_id = Column(Integer, ForeignKey("test_suites.id"), nullable=False)
    input = Column(Text)
    expected_output = Column(Text)
    # ...
    tags = relationship("Tag", secondary=test_case_tags, lazy="selectin")

What we reversed to

Per-suite tables (logical partition) plus per-suite tag dictionaries:

class TestCase(Base):
    __tablename__ = "test_cases"
    id = Column(Integer, primary_key=True)
    suite_id = Column(Integer, ForeignKey("test_suites.id"), nullable=False, index=True)
    # ... unchanged

class TestCaseTag(Base):
    """Per-suite tag dictionary — same string in suite-A vs suite-B can mean different things."""
    __tablename__ = "test_case_tags"
    suite_id = Column(Integer, ForeignKey("test_suites.id"), nullable=False, index=True)
    tag = Column(String, nullable=False)
    test_case_id = Column(Integer, ForeignKey("test_cases.id"), nullable=False)
    __table_args__ = (
        UniqueConstraint("suite_id", "tag", "test_case_id"),
        Index("ix_tag_lookup", "suite_id", "tag"),  # lookups always scoped to suite
    )

Bulk operations now scope to a suite:

-- Before (held lock on the whole table):
UPDATE test_cases SET status = 'archived' WHERE id IN (...);

-- After (lock scoped to one suite):
UPDATE test_cases SET status = 'archived' WHERE suite_id = $1 AND id IN (...);
-- + the index ix_test_cases_suite_id makes this a quick range scan

Why reversed

Two concrete incidents drove the reversal:

  1. 2026-04-22 incident: Bulk re-tag on the regression suite (1,400 cases) took a 9-minute row-lock that blocked CI for 11 minutes. PR #312 backed up. Logged in runbooks/incident-2026-04-22-tag-deadlock.md.
  2. 2026-04-24 user-error: A user re-purposed the tag "flaky" in suite-A; the dashboard's suite-B tag-cloud silently changed because tags were global. Caught in code review of an unrelated PR.

The first incident was the gating one. The second was the credibility one — global state surprises future maintainers.

What we got wrong (and what we'd do again)

Got wrong: treated the eval test corpus like a relational catalog (one canonical entity, M2M relationships). It's actually closer to a multi-tenant data store — each test suite has its own evolution timeline and shouldn't be coupled to others' performance characteristics.

Would do again: single SQLAlchemy session per request, Pydantic schemas at the API boundary, integer surrogate keys. Those decisions still hold; only the multi-suite topology changed.

Reversal cost

  • Schema migration: 1 day (Alembic)
  • Code changes: ~2 days (junction table refactor, bulk ops scoping, tag-cloud dashboard rewrite)
  • Backfill: 4 hours (Python script that read the old test_case_tags and copy-on-write to the new test_case_tags)
  • Total: ~3.5 engineer-days

Lessons

  • "Global state is fine until it isn't" is true; the question is at what scale "isn't" arrives. For us, ~1,200 test cases. For yours, run the math up front.
  • Lock scope is a feature of the schema, not of the query. Adding suite_id to the WHERE clause didn't help when the schema didn't have an index supporting it.
  • Tag dictionaries that span tenants/suites are a class-of-bug, not just a schema decision. The naming collisions are quiet until they aren't.
  • Document deprecated decisions. The next engineer who proposes "let's just use one global tag table" needs to find this ADR.

References

  • api/models/database.py (current schema, post-reversal)
  • runbooks/incident-2026-04-22-tag-deadlock.md (the lock incident)
  • migrations/0007_per_suite_tags.py (Alembic migration)
  • ADR-001 (the gold set's tag-routing logic depends on per-suite tags)
Built into the project

This decision shipped as part of LLM Evaluation Framework — see the full architecture, starter kit, and 4 more ADRs.

Open project →
Press Cmd+K to open