Everyone’s deploying LLMs. Few are evaluating them properly.
The gap between “it works in the demo” and “it works in production” is almost always an evaluation gap. Teams ship models that pass the eye test, then spend months firefighting hallucinations, regressions, and edge cases they never saw coming.
The problem isn’t that evaluation is hard. It’s that most teams don’t know what good evaluation looks like - or where they currently stand.
This post introduces a six-level maturity model for LLM evaluation. Use it to figure out where you are, where you need to be, and what it takes to get there.
Manual Vibes] --> L1[Level 1
Structured Review] L1 --> L2[Level 2
Automated Assertions] L2 --> L3[Level 3
Statistical Evaluation] L3 --> L4[Level 4
Continuous Pipelines] L4 --> L5[Level 5
Adaptive Systems] style L0 fill:#ef4444,color:#fff style L1 fill:#f97316,color:#fff style L2 fill:#eab308,color:#000 style L3 fill:#22c55e,color:#fff style L4 fill:#3b82f6,color:#fff style L5 fill:#8b5cf6,color:#fff
The Six Levels at a Glance
| Level | Name | How You Evaluate |
|---|---|---|
| 0 | Manual Vibes | “It looks right to me” |
| 1 | Structured Review | Checklists and rubrics |
| 2 | Automated Assertions | Test suites with golden datasets |
| 3 | Statistical Evaluation | Metrics, benchmarks, significance tests |
| 4 | Continuous Pipelines | CI/CD integration, regression detection |
| 5 | Adaptive Systems | Self-improving evals, production feedback |
Most teams think they’re at Level 3. Most teams are actually at Level 1.
Let’s walk through each level.
Level 0: Manual Vibes
What it looks like: Someone runs a few prompts, reads the outputs, and decides if the model is “good enough.” There’s no documentation, no consistency, no record of what was tested.
# Level 0: The entire evaluation process
response = llm.generate("Summarize this document...")
print(response)
# Developer squints at output
# "Yeah, that looks fine. Ship it."
How you got here: You’re moving fast. The demo is tomorrow. Evaluation feels like a luxury you’ll add later.
The problem: “Later” never comes. When something breaks in production, you have no baseline to compare against. You don’t know if the model got worse or if this edge case always existed.
Signs you’re at Level 0:
- No written evaluation criteria
- Different people would judge the same output differently
- You can’t reproduce last week’s evaluation
- “It worked when I tested it” is a common phrase
What you’re missing: Any form of systematic quality control. You’re flying blind.
Level 1: Structured Review
What it looks like: You’ve created rubrics. Human reviewers follow documented criteria. There’s a spreadsheet or tool tracking results.
# Level 1: Structured human evaluation
RUBRIC = {
"accuracy": "Does the response contain factual errors?",
"completeness": "Does it address all parts of the query?",
"tone": "Is the tone appropriate for the use case?",
"hallucination": "Does it claim things not in the source?",
}
def human_review(response, rubric=RUBRIC):
scores = {}
for criterion, question in rubric.items():
scores[criterion] = input(f"{question} (1-5): ")
return scores
Capabilities unlocked:
- Consistent evaluation criteria across reviewers
- Historical record of quality assessments
- Ability to identify patterns in failures
The problem: Human review doesn’t scale. Reviewing 100 outputs takes hours. Reviewing 10,000 is impossible. You end up sampling, and sampling means you miss things.
Signs you’re at Level 1:
- You have evaluation rubrics documented somewhere
- Quality depends heavily on which human is reviewing
- Evaluation happens at milestones, not continuously
- You sample because full coverage is impractical
What you’re missing: Automation. Speed. Coverage.
Level 2: Automated Assertions
What it looks like: You’ve built test suites. Golden datasets with expected outputs. Automated checks that run without human intervention.
# Level 2: Automated test suite
import pytest
GOLDEN_DATASET = [
{
"input": "What is the capital of France?",
"expected_contains": ["Paris"],
"expected_not_contains": ["Lyon", "Marseille"],
},
{
"input": "Summarize: The cat sat on the mat.",
"min_length": 5,
"max_length": 50,
},
]
@pytest.mark.parametrize("case", GOLDEN_DATASET)
def test_llm_output(case, llm_client):
response = llm_client.generate(case["input"])
if "expected_contains" in case:
for term in case["expected_contains"]:
assert term.lower() in response.lower()
if "expected_not_contains" in case:
for term in case["expected_not_contains"]:
assert term.lower() not in response.lower()
if "min_length" in case:
assert len(response) >= case["min_length"]
Capabilities unlocked:
- Repeatable evaluation with every code change
- Immediate feedback on regressions
- Coverage of known edge cases
- Evaluation becomes part of the development workflow
The problem: Assertions are binary. Pass or fail. But LLM outputs exist on a spectrum. “Contains Paris” doesn’t tell you if the response was actually good. You can pass every assertion and still ship garbage.
Signs you’re at Level 2:
- You have automated tests for LLM outputs
- Tests check for specific strings or patterns
- You maintain a golden dataset of test cases
- Most tests are pass/fail with no nuance
What you’re missing: Nuance. Statistical rigor. Understanding of how good is “good enough.”
Level 3: Statistical Evaluation
What it looks like: You’re measuring quality with real metrics. Benchmarks against baselines. Statistical significance testing. Understanding variance, not just averages.
# Level 3: Statistical evaluation with metrics
from scipy import stats
import numpy as np
class LLMEvaluator:
def __init__(self, baseline_scores: list[float]):
self.baseline = baseline_scores
self.baseline_mean = np.mean(baseline_scores)
self.baseline_std = np.std(baseline_scores)
def evaluate_batch(self, responses: list[str],
scorer: callable) -> dict:
scores = [scorer(r) for r in responses]
# Statistical comparison to baseline
t_stat, p_value = stats.ttest_ind(scores, self.baseline)
return {
"mean": np.mean(scores),
"std": np.std(scores),
"baseline_mean": self.baseline_mean,
"significant_difference": p_value < 0.05,
"p_value": p_value,
"improved": np.mean(scores) > self.baseline_mean,
"effect_size": (np.mean(scores) - self.baseline_mean)
/ self.baseline_std,
}
def detect_regression(self, new_scores: list[float],
threshold: float = 0.1) -> bool:
"""Detect if new scores are significantly worse."""
effect = (np.mean(new_scores) - self.baseline_mean) / self.baseline_std
return effect < -threshold
Capabilities unlocked:
- Quantified quality improvements (or regressions)
- Confidence intervals around your metrics
- Ability to detect small but meaningful changes
- Data-driven decisions about model updates
The problem: You’re evaluating at discrete points in time. Between evaluations, your model could drift, your data could shift, your prompts could diverge. You find out about problems after they’ve been in production for weeks.
Signs you’re at Level 3:
- You track metrics like BLEU, ROUGE, or custom scores
- You compare against baselines with statistical tests
- You understand variance in your evaluations
- Evaluation still happens in batches, not continuously
What you’re missing: Real-time feedback. Continuous monitoring. Early warning systems.
Level 4: Continuous Pipelines
What it looks like: Evaluation is built into your CI/CD pipeline. Every commit triggers evaluation. Production traffic is continuously sampled and scored. Regressions are caught before they reach users - or immediately after.
# Level 4: Continuous evaluation pipeline
class ContinuousEvalPipeline:
def __init__(self, config: EvalConfig):
self.metrics = config.metrics
self.thresholds = config.thresholds
self.alert_channels = config.alerts
self.sample_rate = config.sample_rate
def evaluate_on_commit(self, commit_sha: str) -> EvalReport:
"""Run full evaluation suite on new commits."""
model = load_model_at_commit(commit_sha)
results = {}
for metric_name, metric_fn in self.metrics.items():
scores = self.run_eval_suite(model, metric_fn)
results[metric_name] = {
"scores": scores,
"passed": np.mean(scores) >= self.thresholds[metric_name],
"baseline_comparison": self.compare_to_baseline(scores),
}
report = EvalReport(commit_sha, results)
if report.has_regression:
self.block_deployment(commit_sha, report)
self.alert(f"Regression detected in {commit_sha}")
return report
def sample_production(self, request: Request,
response: Response) -> None:
"""Sample and evaluate production traffic."""
if random.random() > self.sample_rate:
return
# Async evaluation - don't block the response
self.eval_queue.put({
"request": request,
"response": response,
"timestamp": datetime.utcnow(),
"model_version": self.current_model_version,
})
def detect_drift(self, window_hours: int = 24) -> DriftReport:
"""Detect quality drift over time."""
recent = self.get_scores(hours=window_hours)
historical = self.get_scores(hours=window_hours * 7)
drift = self.calculate_drift(recent, historical)
if drift.is_significant:
self.alert(f"Quality drift detected: {drift.summary}")
return drift
Capabilities unlocked:
- Regressions blocked before deployment
- Production quality monitored in real-time
- Drift detection catches slow degradation
- Evaluation metrics visible to the whole team
- Historical trends for capacity planning
The problem: Your evaluation criteria are static. You defined what “good” looks like months ago. But your users evolved. Your use cases expanded. Your evals are testing for yesterday’s problems.
Signs you’re at Level 4:
- Evaluation runs automatically on every commit
- Production traffic is sampled and scored
- Dashboards show quality metrics over time
- Regressions trigger alerts or block deploys
- You have SLOs for model quality
What you’re missing: Adaptivity. Learning from production. Evaluation that evolves with your system.
Level 5: Adaptive Systems
What it looks like: Your evaluation system learns from production feedback. User signals - thumbs up, corrections, escalations - feed back into eval criteria. New failure modes are automatically detected and added to test suites. The system improves itself.
# Level 5: Adaptive evaluation system
class AdaptiveEvalSystem:
def __init__(self):
self.eval_model = load_eval_model()
self.failure_detector = FailurePatternDetector()
self.criteria_generator = CriteriaGenerator()
def ingest_feedback(self, feedback: UserFeedback) -> None:
"""Learn from user feedback signals."""
if feedback.is_negative:
# Analyze what went wrong
failure_pattern = self.failure_detector.analyze(
feedback.request,
feedback.response,
feedback.user_correction,
)
if failure_pattern.is_novel:
# Generate new eval criteria
new_criterion = self.criteria_generator.create(
failure_pattern
)
self.add_to_eval_suite(new_criterion)
# Backfill: check historical data
self.backfill_evaluation(new_criterion)
def evolve_metrics(self) -> None:
"""Periodically evolve evaluation metrics based on data."""
# Analyze correlation between metrics and user satisfaction
correlations = self.analyze_metric_effectiveness()
# Deprecate metrics that don't predict user satisfaction
for metric, corr in correlations.items():
if corr < 0.3:
self.deprecate_metric(metric)
self.alert(f"Metric {metric} deprecated - low correlation")
# Propose new metrics based on failure patterns
new_metrics = self.propose_metrics_from_failures()
for metric in new_metrics:
self.eval_queue_for_human_review(metric)
def synthetic_test_generation(self,
failure_mode: str) -> list[TestCase]:
"""Generate synthetic test cases for known failure modes."""
return self.test_generator.generate(
failure_mode=failure_mode,
count=100,
diversity_threshold=0.8,
)
Capabilities unlocked:
- Evaluation criteria evolve with your product
- Novel failure modes are caught and codified
- Metrics stay correlated with actual user satisfaction
- Test coverage expands automatically
- The system gets smarter over time
Reality check: Level 5 is partially theoretical. Some organizations have elements of this - feedback loops, automated test generation - but fully adaptive evaluation systems are rare. This is the frontier.
Signs you’re approaching Level 5:
- User feedback directly influences eval criteria
- New test cases are generated from production failures
- You measure correlation between evals and user outcomes
- Your evaluation system has its own evaluation system
Assessment: Where Are You?
Answer these questions honestly:
Level 0 → 1:
- Do you have written evaluation criteria?
- Can two reviewers evaluate the same output consistently?
- Do you keep records of past evaluations?
Level 1 → 2:
- Do you have automated tests for LLM outputs?
- Can you run evaluations without human involvement?
- Do you maintain a golden dataset?
Level 2 → 3:
- Do you track quantitative metrics (not just pass/fail)?
- Do you compare against baselines with statistical rigor?
- Do you understand variance in your evaluations?
Level 3 → 4:
- Does evaluation run automatically on every commit?
- Do you sample and score production traffic?
- Do you have alerts for quality regressions?
Level 4 → 5:
- Does user feedback influence evaluation criteria?
- Are new failure modes automatically added to test suites?
- Do you measure if your metrics predict user satisfaction?
Your level is the highest where you can check all boxes.
The Path Forward
Moving up the maturity ladder isn’t about boiling the ocean. It’s about incremental improvements that compound.
From 0 to 1 (1-2 weeks): Write down your evaluation criteria. Create a simple rubric. Start tracking results in a spreadsheet. This costs nothing and pays off immediately.
From 1 to 2 (2-4 weeks): Pick your top 20 failure cases and write assertions for them. Add them to your test suite. Run them on every PR. You’ll catch regressions before they ship.
From 2 to 3 (1-2 months): Introduce real metrics. Start with something simple - even just average response length or keyword hit rate. Establish baselines. Learn what statistical significance means for your use case.
From 3 to 4 (2-3 months): Build the pipeline. Integrate evaluation into CI/CD. Sample production traffic. Set up alerting. This is infrastructure work, but it’s what separates hobby projects from production systems.
From 4 to 5 (ongoing): Start closing feedback loops. Connect user signals to evaluation. Experiment with automated test generation. This is research-grade work - approach it iteratively.
Why This Matters
Here’s the uncomfortable truth: most LLM failures in production aren’t model failures. They’re evaluation failures. The model did exactly what it was going to do - you just didn’t check.
Organizations that invest in evaluation maturity ship faster, not slower. They catch problems early when they’re cheap to fix. They build confidence in their systems. They can actually answer the question “is this model better than the last one?”
The teams struggling with LLM reliability usually aren’t struggling with the LLM. They’re struggling with knowing whether the LLM is working.
Fix your evaluation. The rest follows.
Need to level up your evaluation maturity? Rotascale Eval provides the metrics, pipelines, and feedback loops you need to move up the maturity ladder - without building everything from scratch.
Learn about Rotascale Eval