ADR-005: Single shared `test_cases` table for all suites (DEPRECATED) | LLM Evaluation Framework

Context

Original v0 design (ADR-005, originally accepted) had one shared test_cases table with a suite_id foreign key. Tagging used a many-to-many junction test_case_tags against a global tags table. This was the simplest schema and shipped on day 3 of M02.

It worked fine until the regression suite crossed ~1,200 test cases. Then we hit two problems:

Bulk-tag operations deadlocked under load. Re-tagging "all 800 RAG cases as tier_b_gold" would take a row-level lock on every matching test_case row. While that ran, the CI eval workflow (M04) couldn't append new test cases to the same suite — Postgres would queue the inserts behind the bulk update. P95 CI eval time jumped from 90s to 11min on the day this fired.
Cross-suite tag pollution. A user adding the tag "flaky" to test cases in suite-A also affected suite-B because the tag dictionary was global. The same string "flaky" meant "transiently flaky on retrieval" in suite-A and "human-flagged for re-labeling" in suite-B.

What was originally decided

Single test_cases table with suite_id FK; many-to-many tags via test_case_tags against a global tags table.

# DEPRECATED — see "What we got wrong" below
class TestCase(Base):
    __tablename__ = "test_cases"
    id = Column(Integer, primary_key=True)
    suite_id = Column(Integer, ForeignKey("test_suites.id"), nullable=False)
    input = Column(Text)
    expected_output = Column(Text)
    # ...
    tags = relationship("Tag", secondary=test_case_tags, lazy="selectin")

What we reversed to

Per-suite tables (logical partition) plus per-suite tag dictionaries:

class TestCase(Base):
    __tablename__ = "test_cases"
    id = Column(Integer, primary_key=True)
    suite_id = Column(Integer, ForeignKey("test_suites.id"), nullable=False, index=True)
    # ... unchanged

class TestCaseTag(Base):
    """Per-suite tag dictionary — same string in suite-A vs suite-B can mean different things."""
    __tablename__ = "test_case_tags"
    suite_id = Column(Integer, ForeignKey("test_suites.id"), nullable=False, index=True)
    tag = Column(String, nullable=False)
    test_case_id = Column(Integer, ForeignKey("test_cases.id"), nullable=False)
    __table_args__ = (
        UniqueConstraint("suite_id", "tag", "test_case_id"),
        Index("ix_tag_lookup", "suite_id", "tag"),  # lookups always scoped to suite
    )

Bulk operations now scope to a suite:

-- Before (held lock on the whole table):
UPDATE test_cases SET status = 'archived' WHERE id IN (...);

-- After (lock scoped to one suite):
UPDATE test_cases SET status = 'archived' WHERE suite_id = $1 AND id IN (...);
-- + the index ix_test_cases_suite_id makes this a quick range scan

Why reversed

Two concrete incidents drove the reversal:

2026-04-22 incident: Bulk re-tag on the regression suite (1,400 cases) took a 9-minute row-lock that blocked CI for 11 minutes. PR #312 backed up. Logged in runbooks/incident-2026-04-22-tag-deadlock.md.
2026-04-24 user-error: A user re-purposed the tag "flaky" in suite-A; the dashboard's suite-B tag-cloud silently changed because tags were global. Caught in code review of an unrelated PR.

The first incident was the gating one. The second was the credibility one — global state surprises future maintainers.

What we got wrong (and what we'd do again)

Got wrong: treated the eval test corpus like a relational catalog (one canonical entity, M2M relationships). It's actually closer to a multi-tenant data store — each test suite has its own evolution timeline and shouldn't be coupled to others' performance characteristics.

Would do again: single SQLAlchemy session per request, Pydantic schemas at the API boundary, integer surrogate keys. Those decisions still hold; only the multi-suite topology changed.

Reversal cost

Schema migration: 1 day (Alembic)
Code changes: ~2 days (junction table refactor, bulk ops scoping, tag-cloud dashboard rewrite)
Backfill: 4 hours (Python script that read the old test_case_tags and copy-on-write to the new test_case_tags)
Total: ~3.5 engineer-days

Lessons

"Global state is fine until it isn't" is true; the question is at what scale "isn't" arrives. For us, ~1,200 test cases. For yours, run the math up front.
Lock scope is a feature of the schema, not of the query. Adding suite_id to the WHERE clause didn't help when the schema didn't have an index supporting it.
Tag dictionaries that span tenants/suites are a class-of-bug, not just a schema decision. The naming collisions are quiet until they aren't.
Document deprecated decisions. The next engineer who proposes "let's just use one global tag table" needs to find this ADR.

References

api/models/database.py (current schema, post-reversal)
runbooks/incident-2026-04-22-tag-deadlock.md (the lock incident)
migrations/0007_per_suite_tags.py (Alembic migration)
ADR-001 (the gold set's tag-routing logic depends on per-suite tags)