5 Evals Every Production LLM Needs

You’ve picked a model. It has good benchmark scores. You’ve done some manual testing and it seems to work. Now what?

Most teams skip straight to deployment. Then they spend the next three months firefighting production issues that proper evaluation would have caught.

Here are the five evaluations that actually matter for production LLMs. None of them are on any leaderboard.

1. Task Completion Rate

This sounds obvious, but most teams don’t measure it properly.

Task completion isn’t “did the model generate a response.” It’s “did the model actually accomplish what the user needed.” These are very different things.

A customer asks about their order status. The model responds with a polite, grammatically correct message that doesn’t actually tell them where their order is. That’s a failed task, even though the response looks fine on the surface.

How to measure it:

Define what “success” means for each task type in your application
Build test cases with clear success criteria, not just expected outputs
Use a separate LLM to judge task completion (cheaper and more consistent than human review at scale)
Track completion rates by task type, not just overall

You’ll often find that your model is great at some tasks and terrible at others. A 90% overall completion rate might hide a 40% rate on your most important task type.

2. Refusal Calibration

When should your model say “I can’t help with that”?

Too many refusals and users get frustrated. Too few and you get liability issues, hallucinations presented as facts, and responses outside your model’s actual capabilities.

The goal is calibrated refusals: the model refuses when it should and doesn’t when it shouldn’t.

Build two test sets:

Should-refuse: Questions outside your scope, requests for capabilities you don’t have, anything that requires information the model doesn’t have access to
Should-not-refuse: Legitimate requests that the model might be overly cautious about

Then measure:

False refusal rate (refused when it shouldn’t have)
False acceptance rate (answered when it should have refused)

Most production issues come from false acceptances. The model confidently answers a question it has no business answering. A customer asks about a policy that doesn’t exist and the model invents one. Someone asks for medical advice and the model provides it.

False refusals are annoying. False acceptances are dangerous.

3. Consistency Under Variation

Users don’t phrase questions the same way every time. Your model needs to give consistent answers regardless of how the question is asked.

Take your core test cases and create 3-5 variations of each:

Different wording, same meaning
Formal vs casual phrasing
With and without typos
Different levels of detail in the question
Questions vs statements (“What’s the return policy?” vs “Tell me about returns”)

Then measure how often the model gives semantically equivalent answers to equivalent questions.

Inconsistency destroys user trust. If a customer gets different answers depending on how they phrase the question, they learn that your system can’t be relied on. They start asking the same question multiple ways to see what they get. That’s a sign your system is failing.

Target: 95%+ consistency on core use cases. Anything less and you’ll hear about it from users.

4. Boundary Behavior

Most failures happen at boundaries. Questions that are almost in scope. Requests that are mostly reasonable with one problematic element. Edge cases that the model has to make judgment calls on.

Build a test set specifically for boundaries:

Scope boundaries: Questions that are adjacent to your use case but not quite in it
Knowledge boundaries: Questions where the model has partial but incomplete information
Policy boundaries: Requests that are mostly fine but have edge case concerns
Capability boundaries: Tasks that are at the limit of what the model can reliably do

What you’re looking for isn’t necessarily right answers. You’re looking for appropriate behavior. Does the model recognize it’s in uncertain territory? Does it hedge appropriately? Does it ask for clarification when needed?

A model that confidently handles boundary cases wrong is more dangerous than one that struggles with them visibly. At least the visible struggle gives users a signal to verify.

5. Adversarial Robustness

Some users will try to break your system. Intentionally or not, they’ll find the prompts that make it misbehave.

You need to find those prompts first.

Test for:

Prompt injection: Can users inject instructions that override your system prompt?
Jailbreaking: Can users get the model to ignore its guidelines?
Information extraction: Can users get the model to reveal system prompts, internal instructions, or information it shouldn’t share?
Manipulation: Can users use social engineering tactics to change the model’s behavior?

The adversarial testing landscape changes constantly. New attack techniques emerge. What worked as a defense last month might not work today. This isn’t a one-time eval. It’s ongoing.

Practical approach:

Start with known attack patterns (there are public datasets)
Add variations specific to your use case
Run red-team exercises with people who actually try to break things
Monitor production for novel attack patterns and add them to your test set

You won’t catch everything. The goal is to catch the obvious stuff before users do and have a process for catching the rest quickly.

Putting It Together

These five evals give you a realistic picture of how your model will behave in production:

Eval	What It Catches	Target
Task Completion	Model doesn’t actually do the job	90%+ on core tasks
Refusal Calibration	Wrong answers presented confidently	<5% false acceptance
Consistency	Different answers to same question	95%+ consistency
Boundary Behavior	Failures on edge cases	Appropriate hedging
Adversarial	Security and safety issues	Block known attacks

None of these are hard to implement. They just require thinking about evaluation differently. Instead of “how smart is this model,” you’re asking “will this model work for my specific use case.”

That’s the question that actually matters.

Running Evals Continuously

One more thing: these aren’t one-time checks.

Models change. Providers update them. Your use cases evolve. User behavior shifts. What passed last month might fail today.

Set up these evals to run:

Before any deployment
After any model or prompt change
On a regular schedule (weekly at minimum)
When you see production issues that might indicate regression

Evaluation isn’t a gate you pass once. It’s a continuous process. The teams that treat it that way are the ones who actually keep their LLMs working in production.

1. Task Completion Rate

2. Refusal Calibration

3. Consistency Under Variation

4. Boundary Behavior

5. Adversarial Robustness

Putting It Together

Running Evals Continuously

Stay ahead of AI governance

Related Articles

The Eval Crisis: Why Most Benchmarks Don't Matter

Eval Debt Will End Careers

Download Resource