A financial services client showed us something instructive last month.
Their credit risk agent returned this for a small business loan application:
{
"applicant_id": "SMB-2024-7891",
"risk_score": 72,
"risk_category": "moderate",
"recommendation": "approve_with_conditions",
"conditions": ["quarterly_review", "collateral_required"],
"confidence": 0.94
}
Every field present. Every type correct. Every enum value valid. The JSON was impeccable.
The answer was wrong.
The applicant had three recent defaults that the agent never examined. The risk score should have been in the low 30s. The recommendation should have been decline. But the output passed every schema validation check their team had built.
They’d spent six months perfecting their structured output pipeline. JSON mode, function calling with strict schemas, response validation middleware. They had 100% schema compliance. They assumed that meant their system was reliable.
It wasn’t.
The Structured Output Hype Cycle
Structured output has been one of the most celebrated improvements in LLM tooling. And for good reason - the progression solved real problems:
2023: Raw text parsing. Regex extraction. Pray the model follows your format. Failure rate: 15-30%.
2024 Q1: JSON mode. Models reliably produce valid JSON. Format failures drop to near zero.
2024 Q2: Function calling with schemas. Models fill in defined parameters. Type safety improves.
2024 Q4: Constrained decoding. Token-level enforcement guarantees schema compliance. Format reliability hits 99.9%+.
2025: Teams declare the structured output problem “solved” and move on.
Each step was a genuine improvement. Each step solved a format problem. None of them solved the semantic problem.
graph LR
A[Raw Text] -->|"JSON mode"| B[Valid JSON]
B -->|"Function calling"| C[Schema Compliant]
C -->|"Constrained decoding"| D[Type Safe]
D -.->|"???"| E[Semantically Correct]
style A fill:#dc3545,color:#fff
style B fill:#fd7e14,color:#fff
style C fill:#ffc107,color:#000
style D fill:#20c997,color:#fff
style E fill:#6c757d,color:#fff,stroke-dasharray: 5 5
The industry spent two years climbing from raw text to type safety. That’s real progress. But the gap between type safety and semantic correctness is where production systems fail - and no amount of schema engineering will close it.
What Structured Output Actually Guarantees
Let’s be precise about what you get and what you don’t.
| Layer | What It Means | Structured Output Guarantees It? |
|---|---|---|
| Syntactic validity | Output is parseable JSON/XML | Yes |
| Schema compliance | Fields, types, enums match spec | Yes |
| Referential integrity | IDs reference real entities | No |
| Semantic accuracy | Values reflect actual facts | No |
| Logical consistency | Fields don’t contradict each other | No |
| Temporal validity | Information is current | No |
| Policy compliance | Output follows business rules | No |
| Reasoning soundness | Conclusion follows from evidence | No |
Structured output gives you the top two rows. Production reliability requires all eight.
A risk score of 72 is schema-compliant. Whether it’s correct depends on whether the model actually examined the applicant’s credit history, weighed the defaults appropriately, applied the right scoring methodology, and didn’t hallucinate favorable data points. None of that is captured by schema validation.
Three Ways Structured Output Fails You
1. Confident Hallucination in Required Fields
When a schema requires a field, the model must produce a value. If it doesn’t have enough information to produce the right value, it will produce a plausible wrong one.
Required fields don’t tolerate uncertainty. A risk_score field doesn’t accept “I’m not sure.” It accepts a number. So the model gives you a number - and confidence is no indicator of accuracy.
This is worse than a model that refuses to answer. At least refusal is honest. A hallucinated value in a required field is a confident lie wrapped in valid syntax.
In regulated industries, this isn’t a technical curiosity. It’s a compliance incident. A fabricated risk score that triggers an automated lending decision is exactly the kind of failure that draws regulatory scrutiny.
2. Schema-Shaped Drift
Model providers update their models regularly. Each update can subtly shift how the model interprets your schema - not breaking it, but changing the semantics.
We’ve seen this pattern repeatedly: a model update causes risk_category: "moderate" to be assigned to cases that were previously classified as "high". The enum values haven’t changed. The distribution of values has. Schema validation sees nothing wrong.
This is semantic drift wearing a syntactically valid disguise. Your monitoring checks that the output is valid JSON with the right fields. It doesn’t check that “moderate” still means what it meant last month.
3. Adversarial Schema Compliance
Prompt injection attacks don’t need to break your schema. They just need to influence the values within it.
An attacker who understands your schema can craft inputs that steer the model toward specific schema-compliant outputs. The output passes every validation check. The values serve the attacker’s intent, not yours.
This is particularly dangerous in financial services, insurance claims, and any domain where schema-compliant output triggers automated downstream actions. An approve/deny decision is binary and schema-valid either way. The question is whether the right one was selected - and schema validation can’t tell you.
Why Better Schemas Won’t Save You
The instinct when confronted with semantic failures is to add more schema constraints. More enums. Tighter ranges. Conditional required fields. Co-occurrence rules.
This is a natural but misguided response.
Adding more schema constraints for semantic reliability is like adding spell-check rules to catch factual errors. You can make the spell-checker arbitrarily sophisticated - it will still never tell you that a correctly spelled sentence is factually wrong.
We’ve seen teams build schemas with 200+ constraints, conditional logic, cross-field validation rules, and custom validators. The schemas become maintenance nightmares. And the fundamental problem remains: the model can satisfy every constraint while getting the answer wrong.
Schema complexity grows linearly. The semantic space you’re trying to constrain grows combinatorially. You can’t win this arms race.
The solution isn’t a better schema. It’s a different kind of verification entirely.
What Semantic Reliability Actually Requires
If schema validation is necessary but insufficient, what else do you need? Four capabilities that operate above the schema layer:
Reasoning Capture
You need to know why the model produced each value. Not just what it output, but the chain of reasoning that led there. When a model assigns risk_score: 72, you need the reasoning chain: which data points it examined, how it weighted them, what it considered and rejected.
This is what the AgentOps Flight Recorder provides - chain-of-thought persistence for every decision. When a schema-compliant output is wrong, the reasoning chain shows you where the reasoning went wrong.
Policy Enforcement
Business rules, regulatory requirements, and domain logic can’t be expressed in JSON Schema. “Risk scores for applicants with recent defaults must not exceed 45” is a semantic constraint that no schema language can enforce.
This requires a policy engine that operates on the meaning of outputs, not their format. AgentOps implements this through a three-layer OPA-based policy engine - gateway, sidecar, and inline enforcement - that evaluates outputs against business rules in real time.
Runtime Monitoring
Semantic drift doesn’t announce itself. You need continuous monitoring that establishes behavioral baselines and detects when output distributions shift - even when every individual output is schema-valid.
Guardian provides this: 96% detection accuracy for behavioral anomalies, including the subtle distribution shifts that schema validation misses entirely.
Pre-Deployment Evaluation
Before any model touches production, it should be evaluated against semantic test cases, not just schema validation tests. Does the model produce correct risk scores for known scenarios? Does it handle edge cases appropriately? Does it fail gracefully when data is missing?
Eval provides systematic, reproducible evaluation at scale - the kind of semantic testing that catches failures before they reach production.
graph TB
subgraph MOST["What Most Teams Have"]
A1[JSON Schema Validation] --> A2[Type Checking]
A2 --> A3[Enum Validation]
end
subgraph PROD["What Production Requires"]
B1[Schema Validation] --> B2[Reasoning Capture]
B2 --> B3[Policy Enforcement]
B3 --> B4[Runtime Monitoring]
B4 --> B5[Semantic Evaluation]
end
style MOST fill:#1a1a2e,stroke:#ffc107,color:#fff
style PROD fill:#1a1a2e,stroke:#20c997,color:#fff
The Trust Cascade: Right-Sizing Verification
Not every output needs the same level of semantic verification. The cost of verification should match the risk of the decision.
| Verification Layer | What It Catches | Cost per Check | Apply To |
|---|---|---|---|
| Schema validation | Format errors, type mismatches | ~$0 | 100% of outputs |
| Deterministic rules | Known policy violations, range checks | ~$0 | 100% of outputs |
| Statistical checks | Distribution drift, calibration decay | $0.001 | Sampled (10-20%) |
| Single-agent verification | Reasoning errors, factual inconsistencies | $0.01 | Medium-risk outputs (~15%) |
| Multi-agent tribunal | Adversarial probing, edge cases | $0.03-0.05 | High-risk outputs (~3%) |
This is the Trust Cascade applied to output verification. Low-cost checks catch the majority of issues. Expensive checks are reserved for high-stakes decisions.
The cascade matters because semantic verification isn’t free. Full reasoning verification for every output would be prohibitively expensive. The cascade makes it economically viable - the same principle we apply to AI decision routing applied to verification.
A financial services firm running 500,000 risk assessments per month can’t afford multi-agent verification on every one. But they can’t afford no semantic verification either. The cascade gives them both coverage and economics.
Where to Start
If you’re relying on structured output as your reliability strategy, here’s how to close the gap:
-
Audit your current failures. Pull 1,000 recent schema-valid outputs and manually evaluate semantic accuracy. Most teams are shocked by what they find. This baseline tells you where you actually are.
-
Implement reasoning capture. Before you can verify reasoning, you need to capture it. Add chain-of-thought persistence so every output has an auditable reasoning trail.
-
Build semantic evaluation suites. Create test cases with known-correct answers. Run them continuously, not just at deployment. When a model update changes your semantic accuracy, you want to know immediately.
-
Deploy policy enforcement. Translate your business rules into enforceable policies that operate on output semantics, not just output format. Start with your highest-risk outputs.
-
Establish drift monitoring. Track output distributions over time. When the distribution of risk categories shifts, or confidence scores cluster differently, you want alerts - not surprises.
Structured output solved the format problem. It didn’t solve the reliability problem. The format problem was the easy one.
If your AI governance strategy starts and ends with schema validation, you’re checking that the answer is well-formatted while ignoring whether it’s correct. In regulated industries, that gap is where compliance incidents, customer harm, and institutional risk live.
See how AgentOps closes the gap between schema compliance and semantic reliability →