How I Evaluate LLM Output Without a Ground-Truth Dataset

When you ship an AI feature, the honest starting point is usually this: you have a prompt, a feeling that it works, and no data telling you how often it’s actually right. No answer key. And someone still has to decide whether it’s good enough to turn on.

Waiting for a perfectly labeled dataset is how a feature sits in limbo for months. You don’t need one to start. Here’s the progression I use to get from “seems fine” to an actual number.

Start with thirty examples you label yourself

The single most useful thing you can do is sit down for an afternoon and build a small set by hand. Collect thirty to fifty realistic inputs — pulled from real usage if you have it, made up if you don’t — and write down what a good answer looks like for each.

This feels too small to matter. It isn’t. Thirty examples will surface your most common failure modes right away, and they give you the one thing a gut feeling never will: a number that moves when you change something. “Eight of thirty failed” is far more useful than “it seems a bit off sometimes.” Keep the set in version control next to the code. It’s a test suite, not a one-off.

Match the check to the task

Not every output needs the same kind of check. Sort the task into one of three buckets.

If the output has a checkable property — valid JSON, a query that runs, an answer from an allowed set — just assert on it. No judgment needed, only code.

def check(output: str) -> bool:
    try:
        data = json.loads(output)
        return "summary" in data and len(data["summary"]) <= 280
    except json.JSONDecodeError:
        return False

If you have an expected answer, compare against it, but rarely with an exact string match, because the model will phrase things differently and still be right. Check that the key facts are present, or use similarity to flag answers that drifted far from the reference. And if it’s genuinely open-ended — a summary, a rewrite, an explanation — that’s where most people give up on measuring. Don’t. That’s what a model-as-judge is for.

For the fuzzy stuff, have a model grade it — but check the grader

For the subjective qualities — is this summary faithful to the source, is this answer grounded or did it invent something, is the tone right — you can have a model score the output against a rubric. It works better than people expect, but only if you do two things.

First, score one specific thing at a time. “Rate one to five whether every claim in the summary is supported by the source” beats “rate the quality.” Vague criteria give you vague scores. Second, check the grader against your own labels before you trust it. Run it over the thirty examples you labeled by hand and see whether it agrees with you. If it doesn’t, fix the rubric until it does. This is the part I’ve spent the most time on, building a grader for exactly this kind of quality check, and the calibration step is what separates a grader you can rely on from a second opinion you can’t.

The trap to watch for: graders are biased. They tend to prefer longer answers and a confident tone over a correct one. Checking against human judgment is what catches that.

Once you’re live, production is your best dataset

The moment the feature is in front of users, you’re sitting on the best data you’ll ever have, if you capture it. Log every input, output, prompt version, and whatever the user did next: accepted the suggestion, edited it, retried, gave up. Those signals are a continuous, honest read on real quality. Pull the failures and the edited outputs back into your hand-labeled set periodically and it grows on its own, staying closer to what users actually do than anything you’d invent.

None of this needs a research team or a labeling budget. It just needs you to decide to measure instead of guess. The teams that move fast with AI aren’t the ones who skip evaluation. They’re the ones who made it cheap enough to do every time.