Structured Output Isn't Reliable Output

A financial services client showed us something instructive last month.

Their credit risk agent returned this for a small business loan application:

{
  "applicant_id": "SMB-2024-7891",
  "risk_score": 72,
  "risk_category": "moderate",
  "recommendation": "approve_with_conditions",
  "conditions": ["quarterly_review", "collateral_required"],
  "confidence": 0.94
}

Every field present. Every type correct. Every enum value valid. The JSON was impeccable.

The answer was wrong.

The applicant had three recent defaults that the agent never examined. The risk score should have been in the low 30s. The recommendation should have been decline. But the output passed every schema validation check their team had built.

They’d spent six months perfecting their structured output pipeline. JSON mode, function calling with strict schemas, response validation middleware. They had 100% schema compliance. They assumed that meant their system was reliable.

It wasn’t.

The Structured Output Hype Cycle

Structured output has been one of the most celebrated improvements in LLM tooling. And for good reason - the progression solved real problems:

2023: Raw text parsing. Regex extraction. Pray the model follows your format. Failure rate: 15-30%.

2024 Q1: JSON mode. Models reliably produce valid JSON. Format failures drop to near zero.

2024 Q2: Function calling with schemas. Models fill in defined parameters. Type safety improves.

2024 Q4: Constrained decoding. Token-level enforcement guarantees schema compliance. Format reliability hits 99.9%+.

2025: Teams declare the structured output problem “solved” and move on.

Each step was a genuine improvement. Each step solved a format problem. None of them solved the semantic problem.

graph LR
    A[Raw Text] -->|"JSON mode"| B[Valid JSON]
    B -->|"Function calling"| C[Schema Compliant]
    C -->|"Constrained decoding"| D[Type Safe]
    D -.->|"???"| E[Semantically Correct]

    style A fill:#dc3545,color:#fff
    style B fill:#fd7e14,color:#fff
    style C fill:#ffc107,color:#000
    style D fill:#20c997,color:#fff
    style E fill:#6c757d,color:#fff,stroke-dasharray: 5 5

The industry spent two years climbing from raw text to type safety. That’s real progress. But the gap between type safety and semantic correctness is where production systems fail - and no amount of schema engineering will close it.

What Structured Output Actually Guarantees

Let’s be precise about what you get and what you don’t.

Layer	What It Means	Structured Output Guarantees It?
Syntactic validity	Output is parseable JSON/XML	Yes
Schema compliance	Fields, types, enums match spec	Yes
Referential integrity	IDs reference real entities	No
Semantic accuracy	Values reflect actual facts	No
Logical consistency	Fields don’t contradict each other	No
Temporal validity	Information is current	No
Policy compliance	Output follows business rules	No
Reasoning soundness	Conclusion follows from evidence	No

Structured output gives you the top two rows. Production reliability requires all eight.

A risk score of 72 is schema-compliant. Whether it’s correct depends on whether the model actually examined the applicant’s credit history, weighed the defaults appropriately, applied the right scoring methodology, and didn’t hallucinate favorable data points. None of that is captured by schema validation.

Three Ways Structured Output Fails You

1. Confident Hallucination in Required Fields

When a schema requires a field, the model must produce a value. If it doesn’t have enough information to produce the right value, it will produce a plausible wrong one.

Required fields don’t tolerate uncertainty. A risk_score field doesn’t accept “I’m not sure.” It accepts a number. So the model gives you a number - and confidence is no indicator of accuracy.

This is worse than a model that refuses to answer. At least refusal is honest. A hallucinated value in a required field is a confident lie wrapped in valid syntax.

In regulated industries, this isn’t a technical curiosity. It’s a compliance incident. A fabricated risk score that triggers an automated lending decision is exactly the kind of failure that draws regulatory scrutiny.

2. Schema-Shaped Drift

Model providers update their models regularly. Each update can subtly shift how the model interprets your schema - not breaking it, but changing the semantics.

We’ve seen this pattern repeatedly: a model update causes risk_category: "moderate" to be assigned to cases that were previously classified as "high". The enum values haven’t changed. The distribution of values has. Schema validation sees nothing wrong.

This is semantic drift wearing a syntactically valid disguise. Your monitoring checks that the output is valid JSON with the right fields. It doesn’t check that “moderate” still means what it meant last month.

3. Adversarial Schema Compliance

Prompt injection attacks don’t need to break your schema. They just need to influence the values within it.

An attacker who understands your schema can craft inputs that steer the model toward specific schema-compliant outputs. The output passes every validation check. The values serve the attacker’s intent, not yours.

This is particularly dangerous in financial services, insurance claims, and any domain where schema-compliant output triggers automated downstream actions. An approve/deny decision is binary and schema-valid either way. The question is whether the right one was selected - and schema validation can’t tell you.

Why Better Schemas Won’t Save You

The instinct when confronted with semantic failures is to add more schema constraints. More enums. Tighter ranges. Conditional required fields. Co-occurrence rules.

This is a natural but misguided response.

Adding more schema constraints for semantic reliability is like adding spell-check rules to catch factual errors. You can make the spell-checker arbitrarily sophisticated - it will still never tell you that a correctly spelled sentence is factually wrong.

We’ve seen teams build schemas with 200+ constraints, conditional logic, cross-field validation rules, and custom validators. The schemas become maintenance nightmares. And the fundamental problem remains: the model can satisfy every constraint while getting the answer wrong.

Schema complexity grows linearly. The semantic space you’re trying to constrain grows combinatorially. You can’t win this arms race.

The solution isn’t a better schema. It’s a different kind of verification entirely.

What Semantic Reliability Actually Requires

If schema validation is necessary but insufficient, what else do you need? Four capabilities that operate above the schema layer:

Reasoning Capture

You need to know why the model produced each value. Not just what it output, but the chain of reasoning that led there. When a model assigns risk_score: 72, you need the reasoning chain: which data points it examined, how it weighted them, what it considered and rejected.

This is what the AgentOps Flight Recorder provides - chain-of-thought persistence for every decision. When a schema-compliant output is wrong, the reasoning chain shows you where the reasoning went wrong.

Policy Enforcement

Business rules, regulatory requirements, and domain logic can’t be expressed in JSON Schema. “Risk scores for applicants with recent defaults must not exceed 45” is a semantic constraint that no schema language can enforce.

This requires a policy engine that operates on the meaning of outputs, not their format. AgentOps implements this through a three-layer OPA-based policy engine - gateway, sidecar, and inline enforcement - that evaluates outputs against business rules in real time.

Runtime Monitoring

Semantic drift doesn’t announce itself. You need continuous monitoring that establishes behavioral baselines and detects when output distributions shift - even when every individual output is schema-valid.

Guardian provides this: 96% detection accuracy for behavioral anomalies, including the subtle distribution shifts that schema validation misses entirely.

Pre-Deployment Evaluation

Before any model touches production, it should be evaluated against semantic test cases, not just schema validation tests. Does the model produce correct risk scores for known scenarios? Does it handle edge cases appropriately? Does it fail gracefully when data is missing?

Eval provides systematic, reproducible evaluation at scale - the kind of semantic testing that catches failures before they reach production.

graph TB
    subgraph MOST["What Most Teams Have"]
        A1[JSON Schema Validation] --> A2[Type Checking]
        A2 --> A3[Enum Validation]
    end

    subgraph PROD["What Production Requires"]
        B1[Schema Validation] --> B2[Reasoning Capture]
        B2 --> B3[Policy Enforcement]
        B3 --> B4[Runtime Monitoring]
        B4 --> B5[Semantic Evaluation]
    end

    style MOST fill:#1a1a2e,stroke:#ffc107,color:#fff
    style PROD fill:#1a1a2e,stroke:#20c997,color:#fff

The Trust Cascade: Right-Sizing Verification

Not every output needs the same level of semantic verification. The cost of verification should match the risk of the decision.

Verification Layer	What It Catches	Cost per Check	Apply To
Schema validation	Format errors, type mismatches	~$0	100% of outputs
Deterministic rules	Known policy violations, range checks	~$0	100% of outputs
Statistical checks	Distribution drift, calibration decay	$0.001	Sampled (10-20%)
Single-agent verification	Reasoning errors, factual inconsistencies	$0.01	Medium-risk outputs (~15%)
Multi-agent tribunal	Adversarial probing, edge cases	$0.03-0.05	High-risk outputs (~3%)

This is the Trust Cascade applied to output verification. Low-cost checks catch the majority of issues. Expensive checks are reserved for high-stakes decisions.

The cascade matters because semantic verification isn’t free. Full reasoning verification for every output would be prohibitively expensive. The cascade makes it economically viable - the same principle we apply to AI decision routing applied to verification.

A financial services firm running 500,000 risk assessments per month can’t afford multi-agent verification on every one. But they can’t afford no semantic verification either. The cascade gives them both coverage and economics.

Where to Start

If you’re relying on structured output as your reliability strategy, here’s how to close the gap:

Audit your current failures. Pull 1,000 recent schema-valid outputs and manually evaluate semantic accuracy. Most teams are shocked by what they find. This baseline tells you where you actually are.
Implement reasoning capture. Before you can verify reasoning, you need to capture it. Add chain-of-thought persistence so every output has an auditable reasoning trail.
Build semantic evaluation suites. Create test cases with known-correct answers. Run them continuously, not just at deployment. When a model update changes your semantic accuracy, you want to know immediately.
Deploy policy enforcement. Translate your business rules into enforceable policies that operate on output semantics, not just output format. Start with your highest-risk outputs.
Establish drift monitoring. Track output distributions over time. When the distribution of risk categories shifts, or confidence scores cluster differently, you want alerts - not surprises.

Structured output solved the format problem. It didn’t solve the reliability problem. The format problem was the easy one.

If your AI governance strategy starts and ends with schema validation, you’re checking that the answer is well-formatted while ignoring whether it’s correct. In regulated industries, that gap is where compliance incidents, customer harm, and institutional risk live.

See how AgentOps closes the gap between schema compliance and semantic reliability →