Building Reliable LLM Features: What Production Actually Demands

A working demo takes an afternoon. You point a prompt at an endpoint, paste in a few examples, and it answers. It feels finished. Then real people start using it, with messy inputs and actual stakes, and you find out the demo was the easy part.

I work on systems where reliability isn’t optional, so I’ve had to be deliberate about the layer around the model rather than the model itself. None of what follows is clever. It’s just the set of habits that have kept these features trustworthy, and the ones I wish I’d had in place earlier.

Treat the model like a service that can lie to you

Every other dependency in your system has a contract. A database returns rows or an error. An API returns a status code. A model returns text that reads well whether or not it’s correct, which is the trickiest kind of output to handle, because being wrong looks a lot like being right.

So I wrap a model call the way I’d wrap any flaky external service, and I don’t trust the raw response. Ask for structured output, then check that it actually has the shape you expected before anything downstream sees it.

def extract_fields(text: str) -> Fields | None:
    for attempt in range(2):
        raw = call_model(prompt(text), timeout=10)
        try:
            return Fields.model_validate_json(raw)
        except ValidationError:
            log.warning("invalid model output", attempt=attempt, raw=raw)
    return None  # the caller decides what a miss means

That last line matters more than it looks. The model will miss sometimes. What your system does on a miss is what decides whether people can rely on it.

Decide how you’ll measure it before you tune it

The temptation is to keep editing the wording until the output looks good on the two or three examples you happen to be staring at. I’ve done it. It isn’t measurement, it’s hoping.

Before touching the prompt, I put together a small set of real inputs with known-good answers. Thirty to fifty is enough to be useful, and you usually have an afternoon to build it. Where you can, score with plain assertions: a field is present, a number is in range, a phrase must or must not appear. For the fuzzy cases, like whether a summary is faithful to its source, you can have a model grade the output, but check the grader against your own judgment first so you actually trust it.

Once you have that set, every prompt change becomes a number that moves instead of a vibe. You’ll catch the case where fixing one example quietly breaks four others, which is exactly the failure that ships when you’re eyeballing.

Keep your prompts in version control

A prompt sitting in a string literal that somebody edited during a hotfix is an incident waiting to happen. Treat prompts like the code they are. Keep them in source control, tag each output with the version that produced it so a regression is traceable, and change one thing at a time. If you adjust the prompt and switch models and bump the temperature all at once, you’ve learned nothing when the quality moves.

Watch what it costs and how long it takes

Two model calls chained together feel fine in a demo. Five of them in an agent loop, each waiting on the last, is a twenty-second response and a bill that climbs with every success. A few things that have saved me real money and time: use the smallest model that still passes your eval set instead of reaching for the biggest one out of caution; cache aggressively, because real traffic repeats itself more than you’d expect; and stream the response when a person is waiting, because text appearing as it’s generated feels far faster even when the total time is the same.

Plan for the answer being wrong

Reliability isn’t a high score on your eval set. It’s what the user experiences on the day the model is confidently wrong, and that day comes. So decide that experience on purpose. A clear “try again.” A way to fall back to a non-AI path. An easy correction when the answer is close but off. A person checking anything you can’t undo.

The features I’ve seen people actually trust weren’t the ones with the cleverest prompts. They were the ones built by someone who assumed the model would fail and made sure that failure was survivable. That’s most of the work, and it’s the part the demo never shows you.