Yash Raj Pandey - Writing

Evaluation-Gated Releases for LLM Systems

Fri, 03 Jul 2026 00:00:00 GMT

LLM systems fail differently from normal software. A change can improve five cases and silently break three, and nothing throws an error. The only defense is a gate: no change ships unless it clears a measured bar.

Freeze a benchmark

Build a fixed set of representative questions with known good answers and expected sources. Freeze it. The moment your benchmark drifts with every change, it stops being a baseline you can trust.

Freeze the judge too

If you use a model to score outputs, pin the judge model and the exact judging prompt. A moving judge makes every comparison meaningless because you cannot tell whether the system changed or the grader did.

Know your noise floor

Run the same config twice and measure the variance. If two identical runs differ by a point, a one-point improvement is noise, not signal. Define the gate above the noise floor.

Set tiers before you look at results

Decide the thresholds in advance: ship above X, hold below Y, re-measure in between. Deciding the bar after seeing the numbers is how regressions talk their way into production.

# Decide BEFORE running
SHIP    >= 76% recall
HOLD    <  74% recall
RE-RUN   74-76%   (within noise, measure again)

A regression is a reason to stop

When a change misses the gate, the answer is not to lower the gate. It is to understand why, fix it, or shelve the change. The discipline is the whole point.

RAG That Holds Up in Production

Fri, 03 Jul 2026 00:00:00 GMT

Most RAG demos look great and most RAG systems quietly disappoint, because the demo never stressed retrieval. The model is rarely the bottleneck. The retrieval and the chunking are.

Garbage chunks, garbage answers

Retrieval quality is capped by chunk quality. Documents that parse badly (watermarked PDFs, image-only pages, broken tables) produce chunks the retriever cannot use. Fix ingestion before you tune anything downstream.

Hybrid retrieval beats pure vector

Dense embeddings miss exact terms; lexical search misses paraphrase. Combining dense and sparse (lexical) retrieval catches both. A good embedding model plus hybrid search is a stronger default than either alone.

Rerank, but watch the dilution

A reranker over a candidate set sharpens results, but feeding it too many low-quality candidates can dilute the good ones and add latency. Tune the candidate ceiling deliberately rather than maximizing it.

Cite or it did not happen

In production RAG, an answer without traceable sources is a liability. Return the supporting passages alongside the answer so a human can verify, and so you can debug what the model actually retrieved.

Build the eval before you optimize

You cannot improve what you cannot measure. A fixed benchmark of questions with expected sources, scored on recall and answer accuracy, turns "I think this is better" into "this is 3 points better or it is not shipping."

Self-Hosting Open-Weight LLMs

Fri, 03 Jul 2026 00:00:00 GMT

There is a whole class of work where you cannot send the data to a cloud API: confidential records, regulated environments, anything air-gapped. The good news is that open-weight models have gotten good enough that you do not have to. Here is how I think about running them locally.

Pick the model to fit the hardware, not the other way around

Start from the memory you actually have. A quantized model that fits comfortably in unified memory and runs fast beats a larger one that swaps and crawls. Quantization (Q5/Q6) usually costs little accuracy for a large memory win.

Choose a serving layer on purpose

Ollama is the fastest path to a working local endpoint and great for development. vLLM gives you higher throughput and better batching when you need to serve real concurrent load. They solve different problems; do not default to one out of habit.

Watch the context window, it is where performance goes to die

A model spilling to CPU because the context window default is too large will feel broken even on strong hardware. Set the context length deliberately to what the task needs, and enable flash attention where supported.

# Keep context lean so inference stays on the accelerator
OLLAMA_CONTEXT_LENGTH=4096
OLLAMA_FLASH_ATTENTION=1

Keep the application layer hardware-agnostic

Treat the model and the inference backend as swappable. If your app talks to a clean internal interface rather than a specific runtime, you can move from one machine or model to a better one without rewriting everything above it.

Measure before you trust

Local does not mean unverified. Build a small benchmark of real questions with known good answers and run it whenever you change the model, the quantization, or the serving config. Vibes are not a release gate.