Trust and Verify: Testing Agentic Output

The agent built the dashboard overnight. The numbers look clean, the layout is polished, and it ran without errors. Your stakeholder is in the room in 20 minutes. You open the Looker report, spot-check the top-line revenue figure — looks about right — and hit share.

Two hours later, someone finds the error: the agent had assumed order_status = 'completed' was the right filter for revenue recognition. Your team has a slightly different definition — completed plus delivered, minus a specific return code. The agent's numbers were 70% lower than the actual figure. No hallucination (when the model fabricates plausible but incorrect data, such as fake API responses or invented row values), no syntax error, no API failure. Just a reasonable assumption that happened to be wrong for this specific business context.

The agent wrote correct code and made a wrong assumption. That combination is harder to catch than a syntax error — because the output looks right until someone with domain knowledge checks it.

This is the verification problem. And it's harder than it sounds.

Agentic output is output produced by an LLM-powered agent acting on your behalf — the queries, code, dashboards, and pipeline artifacts it generates for you.

By the end of this module, you will be able to:

Apply the three-check framework to verify agentic pipeline output
Identify the five most common failure modes in agent-generated pipelines
Identify the characteristics of effective behavioral tests for non-deterministic agents
Describe the trust gradient concept and where to apply it

The three-check verification framework

Agent output requires a different verification approach than traditional code review. The logic can be subtly wrong in ways that aren't visible at the code level — because the error is often in an assumption, not a syntax error. The framework that holds up in practice:

Check 1 — Understand the output. Before looking at a single line of code, ask: can you explain every number in this output from first principles? If a revenue figure came back at $2.3M, do you know what that means? What dates? What filters? What business definition of "revenue"? If you can't explain every number without looking at the code, you haven't verified anything — you've just looked.

Check 2 — Inspect the code for logic, not syntax. Don't read the code like a linter. Read it like a domain expert: does the logic match what you know to be true about this data? The agent's code will probably be syntactically correct — that's the easy part. The failure modes are semantic: wrong filter, wrong join key, wrong grain of aggregation, wrong assumption about what a column means. Those require domain knowledge to catch, not code review skills.

Check 3 — Check the actual data. Run queries against the underlying data to verify the agent's claims. Look at actual rows, not summaries. Inspect the join: are records being dropped unexpectedly? Check cardinality (the number of distinct values per join key): is a one-to-many join producing a fan-out (where a join multiplies rows unexpectedly when keys are not unique)? Check nulls: is a null-handling assumption silently dropping rows that should be included?

Verification is not optional — it's the skill that makes agentic systems trustworthy. The agent's output is a first draft from a capable but fallible system. Your job is to review it like a senior engineer reviewing a junior engineer's PR: assume it's probably mostly right, but look specifically for the places where an assumption could be wrong.

Why this matters more for agents

The reason verification is harder with agentic systems isn't that agents are less capable than humans at writing code. It's that they're confidently wrong in ways that are hard to detect.

Researchers stress-test agents by making small changes to task wording that don't change the intended meaning. ReliabilityBench found that minor semantic rewording alone reduced single-run success from 96.9% to 88.1% — and that's without the compounding effect of multi-step pipelines. When an agent builds a multi-step pipeline, each step compounds the error surface.

Hallucination rates vary dramatically by domain: in well-structured technical domains with verifiable answers, rates can be substantially lower; in open-ended reasoning about business logic or domain conventions, HALoGEN found rates can reach 86%. Data engineering sits in between — the SQL syntax is verifiable, but the business logic assumptions are not.

Failure mode	What it looks like	How to catch it
Wrong filter assumption	Agent uses `status = 'active'` when your definition is `status IN ('active', 'trial')`	Check 1 — verify business definition of every filter against documented rules
Join fan-out	A many-to-many join inflates counts silently	Check 3 — verify cardinality of join keys before and after
Null handling mismatch	Agent drops nulls; your standard is to flag them	Check 2 — inspect null handling explicitly against your conventions file
Grain mismatch	Agent aggregates at the wrong level of detail	Check 3 — sample output rows and verify grain matches expectation
Confident confabulation	Agent states a business rule that doesn't exist in your context stack (docs, prompts, and retrieved knowledge the agent can see)	Check 1 — trace every business logic assumption to a documented source

The most important insight from the hallucination research: agents are systematically overconfident. They don't know what they don't know — they fill gaps with plausible-sounding answers. This is a structural property of how these systems work, not a bug that gets fixed in the next model release.

Behavioral testing

Code review catches structural errors. Behavioral testing catches failure mode errors — what happens when the agent encounters conditions it wasn't designed for.

The principle: test the agent's response to specific scenarios, not just its default-path output.

## Behavioral test suite: orders_daily pipeline agent

### Test 1: Schema change handling
Setup: Remove a non-nullable field from the upstream schema
Expected: Agent detects schema diff, assesses downstream impact,
          opens PR — does NOT auto-deploy
Pass criteria: PR opened with impact assessment; no production deploy;
               human notified within step budget (step budget = the maximum
               number of LLM calls the agent is allowed before escalating
               to a human)

### Test 2: API timeout
Setup: Simulate upstream API timeout at retry 3
Expected: Agent logs failure, escalates to on-call with context —
          does NOT retry indefinitely
Pass criteria: Escalation fired within step budget; retry count logged

### Test 3: Malformed data
Setup: Inject records with null values in a required join key
Expected: Agent flags records, applies null-handling convention
          (flag, don't drop), continues pipeline
Pass criteria: _data_quality_flag column populated; row count in summary

### Test 4: Volume anomaly
Setup: Upstream produces 30% fewer rows than 7-day rolling average
Expected: Agent flags anomaly, pauses before publish, escalates for review
Pass criteria: Pipeline pauses; escalation includes volume comparison and context

Characteristics of effective behavioral tests for agentic systems:

Scenario-based: each test describes a specific real-world condition the agent may encounter (schema change, API timeout, malformed data, volume anomaly)
Explicit setup / expected behavior / pass criteria: every test case states what state to create, what the agent should do, and what observable outcome constitutes a pass
Statistical thresholds for non-determinism: pass criteria specify aggregate rates ("passes 19/20 runs"), not single-run binary results
Escalation and step-budget boundaries: tests verify that the agent escalates to a human within a defined step limit — not just that it eventually produces output

The reason behavioral testing matters specifically for agents: agents are non-deterministic. The same input can produce different outputs across runs. Testing frameworks designed for deterministic code don't handle this well — a test that passes once doesn't tell you it will pass consistently.

Statistical testing frameworks that account for non-determinism — running multiple test iterations and measuring aggregate pass rates — are more appropriate for production agentic systems, as AgentAssay demonstrates. Your CI/CD pipeline for agent-generated code should include behavioral tests, and your pass criteria should be statistical ("passes 19/20 runs") rather than binary ("passed once").

For agentic systems, a single passing run is not evidence of reliable behavior. Measure aggregate pass rates across multiple runs — that is the production standard.

Building confidence incrementally

The right mental model for agentic trust is a gradient, not a switch. You don't decide to "trust the agent" one day and hand off full autonomy. You observe behavior over time, measure outcomes, and progressively expand scope as trust accumulates.

The specific trust levels, advancement criteria, and rollback procedures are defined in Module 4: Governance — the architecture where this framework belongs. For verification purposes, the operating principle is: define your criteria before you're under pressure to expand autonomy.

Trust tier	Observable signal	Expansion criterion
Supervised	All output reviewed by a human before use	≥30 consecutive verified runs with zero critical errors
Spot-checked	Random sampling of ~20% of outputs	Defect rate below threshold across 90-day rolling window
Autonomous	Alerts only on anomaly or escalation trigger	Defined SLA (service level agreement — team targets for latency, error rate, and freshness) met; rollback procedure tested (the procedure to revert to the last known-good run has been exercised, not just documented) and confirmed

Databricks has published a production methodology (coSTAR) that reduced their agent verification cycle from weeks to hours — through structured evaluation frameworks, automated behavioral testing, and incremental trust expansion. The underlying principle — measure, don't assume — applies to any stack.

Exercise: Verify Your Output

Time: 20 minutes

If you have a pipeline from the ADE 101 lab, run all three verification checks on it using the Expeditions verification dialogue below. If you don't — no warehouse or lab project — you can still finish this exercise: walk through the orders_daily worked example to see each check applied end to end, then complete the empty template using your own claims and notes (template-only path; no live pipeline required).

Expeditions pipeline verification dialogue (from the ADE 101 lab):

If you completed the ADE 101 lab and built the monthly revenue-by-channel flow, use these prompts to apply all three checks:

Check 1 — understand the output:

Give me a quick summary of what the flow produces. How many distinct
channels are in the output? What's the most recent month in the data,
and which channel had the highest total revenue that month?
Does anything about the distribution look off?

Read the answer and verify it against your expeditions-code-standards.md rule. Does the output apply the revenue filter correctly — cancelled and refunded bookings excluded before any aggregation?

Check 2 — inspect the code:

Show me the SQL for the total revenue calculation. Walk me through
how revenue is computed — what filter is applied to booking status,
and does it match the standard in expeditions-code-standards.md?

Read the SQL as a domain expert, not a linter. Does the filter match your business definition of revenue — bookings that are completed and not cancelled or refunded? Is the grain correct (one row per channel per month)?

Check 3 — verify the data:

Show me the top 3 channels by total revenue for the most recent
full month in the output. Walk me through the revenue calculation
for the top channel — what are the underlying booking records
and how was the total derived?

If any channel shows an unexpectedly low or zero revenue figure, drill deeper:

The revenue for the "direct" channel in January looks low compared
to other months. Show me the underlying booking records for that
channel and month and walk me through how the total was calculated
row by row.

What to notice: If the three checks are working, you'll be able to explain any channel's revenue figure from the underlying booking records — or find the error. The most common: the revenue filter in expeditions-code-standards.md excludes cancelled and refunded bookings before any aggregation, but if the agent missed that rule, revenue will be overcounted. A channel total that can't be traced back to specific qualifying bookings is a semantic error — the exact failure mode this framework is designed to surface before it reaches a stakeholder.

This is the verification sequence: start with aggregate sanity (Check 1), inspect the logic (Check 2), then drill into specific records when something looks off (Check 3). If the reasoning is wrong, you've caught a semantic error before it reaches a stakeholder. If the reasoning is right, you've built intuition about the data.

Worked example — orders_daily pipeline verification:

### Check 1: Understand the output
Claim: "Daily revenue by product category, last 30 days"

Questions to answer before looking at the code:
- Business definition of "revenue"?
  → Verified: order_status IN ('completed', 'delivered') AND return_code IS NULL
- Date range: calendar days or rolling? UTC or local?
  → Verified: calendar day, UTC
- Grain: one row per category per day
  → Expected: ~7 categories × 30 days = ~210 rows

Actual output: 210 rows ✓ | Revenue definition: matches ✓ | Dates: UTC ✓

### Check 2: Inspect the code
Key review points:
- Filter: WHERE order_status IN ('completed', 'delivered') AND return_code IS NULL ✓
- Join key: orders.account_id = accounts.account_id (many-to-one) ✓
- Null handling: null in category mapped to 'Uncategorized' ✓
- Grain: GROUP BY DATE_TRUNC('day', order_ts), product_category ✓

Issue found: Agent assumed shipping_cost is never null; schema allows null.
Fix: Add COALESCE(shipping_cost, 0) — does not affect revenue total
(shipping is not included in the revenue calculation)

### Check 3: Verify the data
Row count before join: 48,231
Row count after join: 48,231 (no fan-out ✓)
Null count in revenue column: 0 (expected) ✓
Revenue range: $12.4M–$18.9M/day (within 7-day historical range ✓)
Sample 10 rows: all grain correct ✓

Result: Pipeline verified. One code fix applied (null coalesce). Ready for PR.

Optional: After completing the copy-paste prompts above, run your own verification using this template (template-only path; no live pipeline required):

### Check 1: Understand the output
Claim:

Questions:
- Business definition of key metric:
- Date range / timezone logic:
- Expected grain and approximate row count:

Verified: Yes / No | Issues found:

### Check 2: Inspect the code
Key review points:
- Filters:
- Join keys and cardinality:
- Null handling vs. team convention:
- Aggregation grain:

Issues found:

### Check 3: Verify the data
Row count before / after joins:
Null counts in key columns:
Value ranges vs. historical:
Sample row inspection:

Result:

Key takeaways

Verification has three distinct checks. Understanding the output (business logic), inspecting the code (logic correctness), and checking the data (actual rows, joins, nulls). Most people skip to the middle and miss the first and third — which is where the most important errors live.
Agents are systematically overconfident. ReliabilityBench found that minor semantic rewording alone reduces single-run success from 96.9% to 88.1% — multi-step pipelines compound this further. Hallucination rates can reach 86% in open-ended reasoning domains. Verify every significant output before it reaches stakeholders.
Trust is a gradient, not a switch. Define criteria for advancing trust levels before you're under pressure to do so. Measured task outcomes over time — not intuition or absence of recent failures — are the appropriate basis for expanding autonomy.

Verification tells you what happened after the agent acted. Governance determines what the agent is allowed to do before it acts — and creates the accountability structures that let you expand autonomy safely.

Next: Governance, Guardrails, and Security →

Additional Reading

ReliabilityBench: Measuring LLM Agent Reliability Under Production Conditions (2026) — Shows that minor task rewording drops single-run success from 96.9% to 88.1%; required reading before any production agentic deployment.
HALoGEN: Hallucination Evaluation Across Domains (ACL 2025) — Domain-stratified hallucination analysis; rates can be much lower in structured technical settings and reach up to ~86% in open-ended reasoning about business logic and conventions.
AgentAssay: Statistical Testing for Non-Deterministic Agents (2026) — Framework for statistically valid behavioral testing of agents across multiple runs; the right approach when deterministic test pass/fail is insufficient.
coSTAR: How We Ship AI Agents at Databricks Fast Without Breaking Things (Databricks, 2026) — Production methodology that shortened verification cycles substantially; concrete implementation details worth studying.

The three-check verification framework​

Why this matters more for agents​

Behavioral testing​

Building confidence incrementally​

Additional Reading​

The three-check verification framework

Why this matters more for agents

Behavioral testing

Building confidence incrementally

Additional Reading