Context
Original v0 design (ADR-005, originally accepted) had one shared test_cases table with a suite_id foreign key. Tagging used a many-to-many junction test_case_tags against a global tags table. This was the simplest schema and shipped on day 3 of M02.
It worked fine until the regression suite crossed ~1,200 test cases. Then we hit two problems:
- Bulk-tag operations deadlocked under load. Re-tagging "all 800 RAG cases as
tier_b_gold" would take a row-level lock on every matchingtest_caserow. While that ran, the CI eval workflow (M04) couldn't append new test cases to the same suite — Postgres would queue the inserts behind the bulk update. P95 CI eval time jumped from 90s to 11min on the day this fired. - Cross-suite tag pollution. A user adding the tag
"flaky"to test cases in suite-A also affected suite-B because the tag dictionary was global. The same string "flaky" meant "transiently flaky on retrieval" in suite-A and "human-flagged for re-labeling" in suite-B.
What was originally decided
Single test_cases table with suite_id FK; many-to-many tags via test_case_tags against a global tags table.
# DEPRECATED — see "What we got wrong" below
class TestCase(Base):
__tablename__ = "test_cases"
id = Column(Integer, primary_key=True)
suite_id = Column(Integer, ForeignKey("test_suites.id"), nullable=False)
input = Column(Text)
expected_output = Column(Text)
# ...
tags = relationship("Tag", secondary=test_case_tags, lazy="selectin")
What we reversed to
Per-suite tables (logical partition) plus per-suite tag dictionaries:
class TestCase(Base):
__tablename__ = "test_cases"
id = Column(Integer, primary_key=True)
suite_id = Column(Integer, ForeignKey("test_suites.id"), nullable=False, index=True)
# ... unchanged
class TestCaseTag(Base):
"""Per-suite tag dictionary — same string in suite-A vs suite-B can mean different things."""
__tablename__ = "test_case_tags"
suite_id = Column(Integer, ForeignKey("test_suites.id"), nullable=False, index=True)
tag = Column(String, nullable=False)
test_case_id = Column(Integer, ForeignKey("test_cases.id"), nullable=False)
__table_args__ = (
UniqueConstraint("suite_id", "tag", "test_case_id"),
Index("ix_tag_lookup", "suite_id", "tag"), # lookups always scoped to suite
)
Bulk operations now scope to a suite:
-- Before (held lock on the whole table):
UPDATE test_cases SET status = 'archived' WHERE id IN (...);
-- After (lock scoped to one suite):
UPDATE test_cases SET status = 'archived' WHERE suite_id = $1 AND id IN (...);
-- + the index ix_test_cases_suite_id makes this a quick range scan
Why reversed
Two concrete incidents drove the reversal:
- 2026-04-22 incident: Bulk re-tag on the regression suite (1,400 cases) took a 9-minute row-lock that blocked CI for 11 minutes. PR #312 backed up. Logged in
runbooks/incident-2026-04-22-tag-deadlock.md. - 2026-04-24 user-error: A user re-purposed the tag
"flaky"in suite-A; the dashboard's suite-B tag-cloud silently changed because tags were global. Caught in code review of an unrelated PR.
The first incident was the gating one. The second was the credibility one — global state surprises future maintainers.
What we got wrong (and what we'd do again)
Got wrong: treated the eval test corpus like a relational catalog (one canonical entity, M2M relationships). It's actually closer to a multi-tenant data store — each test suite has its own evolution timeline and shouldn't be coupled to others' performance characteristics.
Would do again: single SQLAlchemy session per request, Pydantic schemas at the API boundary, integer surrogate keys. Those decisions still hold; only the multi-suite topology changed.
Reversal cost
- Schema migration: 1 day (Alembic)
- Code changes: ~2 days (junction table refactor, bulk ops scoping, tag-cloud dashboard rewrite)
- Backfill: 4 hours (Python script that read the old
test_case_tagsand copy-on-write to the newtest_case_tags) - Total: ~3.5 engineer-days
Lessons
- "Global state is fine until it isn't" is true; the question is at what scale "isn't" arrives. For us, ~1,200 test cases. For yours, run the math up front.
- Lock scope is a feature of the schema, not of the query. Adding
suite_idto the WHERE clause didn't help when the schema didn't have an index supporting it. - Tag dictionaries that span tenants/suites are a class-of-bug, not just a schema decision. The naming collisions are quiet until they aren't.
- Document deprecated decisions. The next engineer who proposes "let's just use one global tag table" needs to find this ADR.
References
api/models/database.py(current schema, post-reversal)runbooks/incident-2026-04-22-tag-deadlock.md(the lock incident)migrations/0007_per_suite_tags.py(Alembic migration)- ADR-001 (the gold set's tag-routing logic depends on per-suite tags)