There is a specific kind of optimism that kills LLM features. The prompt works on the third try, the demo lands in the standup, and someone says ship it. A week later it is quietly inventing account numbers in production and nobody can say when it started — because nobody was measuring.
A prompt is not a spec
Traditional code fails loudly: a type error, a thrown exception, a red test. A language model fails softly. It returns something fluent and plausible that happens to be wrong, and it does so non-deterministically, so the same input can pass on Monday and fail on Thursday. A demo proves the feature can work once. It says nothing about how often it works, or what it does at the edges. Treating a successful demo as a guarantee is how teams ship rumors instead of features.
So we invert the order. Before we write the feature, we write the evaluation. We assemble a graded set of real inputs with known-good outputs, define a rubric a second model — or a human — can apply consistently, and only then start iterating on the prompt and the retrieval pipeline. The eval is the spec. If we cannot describe what good looks like precisely enough to grade it, we do not yet understand the feature well enough to build it.
Build the set from reality
Good eval sets come from real traffic, not imagination. We sample production logs for the cases that actually occur, then deliberately salt in the adversarial ones: the ambiguous question, the prompt injection, the input in the wrong language, the request the model should refuse. Each example is labeled once, carefully, and becomes a fixed point. The full suite runs offline on every prompt and model change, and a thinner version runs online against live traffic so we catch the drift the golden set never anticipated.
A feature without an eval is a rumor. You are asserting it works; you just have no way to know.— Protocore · AI engineering
Once the suite exists, it becomes a gate. Every change to a prompt, a model version, or a retrieval index runs the full evaluation in CI, and we hold a regression budget the way other teams hold an error budget: a change that improves one metric while quietly degrading another does not merge. Swapping to a newer, cheaper model stops being a leap of faith and becomes a diff you can read — these ten cases got better, these two got worse, here is the trade-off.
It is slower at the start and far faster afterward. On a document-intelligence system we operate, the eval harness is what let us reach 92% straight-through processing and, more importantly, hold it there across three model upgrades. Features feel less magical to build this way. They also stop surprising us in production, which is the entire point.
Have a system to build?
Tell us the problem. We'll come back with an architecture and a plan.
Start a project