Evaluation-Gated Releases for LLM Systems

LLM systems fail differently from normal software. A change can improve five cases and silently break three, and nothing throws an error. The only defense is a gate: no change ships unless it clears a measured bar.

Freeze a benchmark

Build a fixed set of representative questions with known good answers and expected sources. Freeze it. The moment your benchmark drifts with every change, it stops being a baseline you can trust.

Freeze the judge too

If you use a model to score outputs, pin the judge model and the exact judging prompt. A moving judge makes every comparison meaningless because you cannot tell whether the system changed or the grader did.

Know your noise floor

Run the same config twice and measure the variance. If two identical runs differ by a point, a one-point improvement is noise, not signal. Define the gate above the noise floor.

Set tiers before you look at results

Decide the thresholds in advance: ship above X, hold below Y, re-measure in between. Deciding the bar after seeing the numbers is how regressions talk their way into production.

# Decide BEFORE running
SHIP    >= 76% recall
HOLD    <  74% recall
RE-RUN   74-76%   (within noise, measure again)

A regression is a reason to stop

When a change misses the gate, the answer is not to lower the gate. It is to understand why, fix it, or shelve the change. The discipline is the whole point.