You’ve picked a model. It has good benchmark scores. You’ve done some manual testing and it seems to work. Now what?
Most teams skip straight to deployment. Then they spend the next three months firefighting production issues that proper evaluation would have caught.
Here are the five evaluations that actually matter for production LLMs. None of them are on any leaderboard.
1. Task Completion Rate
This sounds obvious, but most teams don’t measure it properly.
Task completion isn’t “did the model generate a response.” It’s “did the model actually accomplish what the user needed.” These are very different things.
A customer asks about their order status. The model responds with a polite, grammatically correct message that doesn’t actually tell them where their order is. That’s a failed task, even though the response looks fine on the surface.
How to measure it:
- Define what “success” means for each task type in your application
- Build test cases with clear success criteria, not just expected outputs
- Use a separate LLM to judge task completion (cheaper and more consistent than human review at scale)
- Track completion rates by task type, not just overall
You’ll often find that your model is great at some tasks and terrible at others. A 90% overall completion rate might hide a 40% rate on your most important task type.
2. Refusal Calibration
When should your model say “I can’t help with that”?
Too many refusals and users get frustrated. Too few and you get liability issues, hallucinations presented as facts, and responses outside your model’s actual capabilities.
The goal is calibrated refusals: the model refuses when it should and doesn’t when it shouldn’t.
Build two test sets:
- Should-refuse: Questions outside your scope, requests for capabilities you don’t have, anything that requires information the model doesn’t have access to
- Should-not-refuse: Legitimate requests that the model might be overly cautious about
Then measure:
- False refusal rate (refused when it shouldn’t have)
- False acceptance rate (answered when it should have refused)
Most production issues come from false acceptances. The model confidently answers a question it has no business answering. A customer asks about a policy that doesn’t exist and the model invents one. Someone asks for medical advice and the model provides it.
False refusals are annoying. False acceptances are dangerous.
3. Consistency Under Variation
Users don’t phrase questions the same way every time. Your model needs to give consistent answers regardless of how the question is asked.
Take your core test cases and create 3-5 variations of each:
- Different wording, same meaning
- Formal vs casual phrasing
- With and without typos
- Different levels of detail in the question
- Questions vs statements (“What’s the return policy?” vs “Tell me about returns”)
Then measure how often the model gives semantically equivalent answers to equivalent questions.
Inconsistency destroys user trust. If a customer gets different answers depending on how they phrase the question, they learn that your system can’t be relied on. They start asking the same question multiple ways to see what they get. That’s a sign your system is failing.
Target: 95%+ consistency on core use cases. Anything less and you’ll hear about it from users.
4. Boundary Behavior
Most failures happen at boundaries. Questions that are almost in scope. Requests that are mostly reasonable with one problematic element. Edge cases that the model has to make judgment calls on.
Build a test set specifically for boundaries:
- Scope boundaries: Questions that are adjacent to your use case but not quite in it
- Knowledge boundaries: Questions where the model has partial but incomplete information
- Policy boundaries: Requests that are mostly fine but have edge case concerns
- Capability boundaries: Tasks that are at the limit of what the model can reliably do
What you’re looking for isn’t necessarily right answers. You’re looking for appropriate behavior. Does the model recognize it’s in uncertain territory? Does it hedge appropriately? Does it ask for clarification when needed?
A model that confidently handles boundary cases wrong is more dangerous than one that struggles with them visibly. At least the visible struggle gives users a signal to verify.
5. Adversarial Robustness
Some users will try to break your system. Intentionally or not, they’ll find the prompts that make it misbehave.
You need to find those prompts first.
Test for:
- Prompt injection: Can users inject instructions that override your system prompt?
- Jailbreaking: Can users get the model to ignore its guidelines?
- Information extraction: Can users get the model to reveal system prompts, internal instructions, or information it shouldn’t share?
- Manipulation: Can users use social engineering tactics to change the model’s behavior?
The adversarial testing landscape changes constantly. New attack techniques emerge. What worked as a defense last month might not work today. This isn’t a one-time eval. It’s ongoing.
Practical approach:
- Start with known attack patterns (there are public datasets)
- Add variations specific to your use case
- Run red-team exercises with people who actually try to break things
- Monitor production for novel attack patterns and add them to your test set
You won’t catch everything. The goal is to catch the obvious stuff before users do and have a process for catching the rest quickly.
Putting It Together
These five evals give you a realistic picture of how your model will behave in production:
| Eval | What It Catches | Target |
|---|---|---|
| Task Completion | Model doesn’t actually do the job | 90%+ on core tasks |
| Refusal Calibration | Wrong answers presented confidently | <5% false acceptance |
| Consistency | Different answers to same question | 95%+ consistency |
| Boundary Behavior | Failures on edge cases | Appropriate hedging |
| Adversarial | Security and safety issues | Block known attacks |
None of these are hard to implement. They just require thinking about evaluation differently. Instead of “how smart is this model,” you’re asking “will this model work for my specific use case.”
That’s the question that actually matters.
Running Evals Continuously
One more thing: these aren’t one-time checks.
Models change. Providers update them. Your use cases evolve. User behavior shifts. What passed last month might fail today.
Set up these evals to run:
- Before any deployment
- After any model or prompt change
- On a regular schedule (weekly at minimum)
- When you see production issues that might indicate regression
Evaluation isn’t a gate you pass once. It’s a continuous process. The teams that treat it that way are the ones who actually keep their LLMs working in production.