Evaluation-Gated Releases for LLM Systems
LLM systems fail differently from normal software. A change can improve five cases and silently break three, and nothing throws an error. The only defense is a gate: no change ships unless it clears a measured bar.
Freeze a benchmark
Build a fixed set of representative questions with known good answers and expected sources. Freeze it. The moment your benchmark drifts with every change, it stops being a baseline you can trust.
Freeze the judge too
If you use a model to score outputs, pin the judge model and the exact judging prompt. A moving judge makes every comparison meaningless because you cannot tell whether the system changed or the grader did.
Know your noise floor
Run the same config twice and measure the variance. If two identical runs differ by a point, a one-point improvement is noise, not signal. Define the gate above the noise floor.
Set tiers before you look at results
Decide the thresholds in advance: ship above X, hold below Y, re-measure in between. Deciding the bar after seeing the numbers is how regressions talk their way into production.
# Decide BEFORE running
SHIP >= 76% recall
HOLD < 74% recall
RE-RUN 74-76% (within noise, measure again)
A regression is a reason to stop
When a change misses the gate, the answer is not to lower the gate. It is to understand why, fix it, or shelve the change. The discipline is the whole point.