Non-Functional Requirements for AI Systems: What Staff Engineers Should Specify

Ask a team what their new AI feature is supposed to do and you’ll get a detailed answer. Ask how accurate it has to be, how slow it’s allowed to get, or what the user sees when it fails, and you usually get a pause. Those are the non-functional requirements, and for an AI feature they aren’t a footnote you fill in later. They’re most of what decides whether the thing is shippable.

For ordinary software, the functional spec mostly stands on its own, because the behavior is predictable. An AI feature is probabilistic, so “what it does” is only half the story. The other half is how well, how fast, how cheaply, and what happens on a bad day. Here’s the checklist I push teams to fill in before they start building.

Accuracy: pick a number, and know the cost of a miss

“It should be accurate” isn’t a requirement. What success rate makes this worth shipping, and how did you decide? The answer depends entirely on what a wrong answer costs. A bad autocomplete suggestion costs nothing. A wrong update to a financial record costs a lot. So the acceptable error rate falls out of the blast radius, and it should be written down, along with how you plan to measure it. If there’s no way to measure accuracy in the spec, the target is decoration.

Latency: budget for the slow responses, not the average

Set a latency target, and set it at the tail. An average of 800ms means nothing if one request in a hundred takes fifteen seconds, because that slow one is what the user remembers. And be clear about what “fast enough” means in practice. Streaming the answer as it’s generated changes how fast it feels even when the total time is identical. If the spec says “feels responsive,” it should say how you’re getting there.

Cost: set a ceiling before the bill surprises you

AI features have a real per-request cost, and for anything agentic that cost climbs with success, because the hard cases take more calls. Decide a target cost per request and a rough model for how it scales with usage. “We’ll keep an eye on the bill” is how you discover at the end of the month that the feature loses money every time someone uses it. The ceiling is also what constrains your model choice and your architecture, so set it early.

Fallback: decide what the user sees when it fails

The model will fail sometimes — a timeout, a garbage response, a rate limit. The spec has to say what happens then. A clear “try again,” a non-AI path, a cached result: all fine. “We didn’t think about it,” which in practice means a spinner that never resolves, is not. This is the requirement teams skip most often and regret most reliably.

Observability: log enough to know if a change helped

For a probabilistic system, logging isn’t just for debugging — it’s how you measure quality at all. Specify that every request records its input, output, model and prompt version, latency, cost, and whatever the user did next. Without it, you can’t tell whether last week’s change helped or hurt, and bolting it on after launch means throwing away the data you most needed.

Governance: the requirements that aren’t yours to choose

On regulated or enterprise systems, some of these aren’t your decision, so name them in the spec instead of discovering them in an audit. What data is allowed to leave for a model provider, and where does inference run? Can you reconstruct why the system did what it did, months later? When the AI changes a record, is that change traceable and reversible? Designing these in is cheap. Retrofitting them is not.

The functional spec is the easy half, and it’s the half everyone writes. The non-functional half is what determines whether the feature survives contact with real users. If your AI project doesn’t have these written down, that gap is the biggest risk on the project, and nobody’s managing it.