Observability: Monitoring Reasoning, Not Just Output
Your agentic system has been running for three months. Pipeline health metrics look great — 99.2% success rate, latency is stable, costs are within budget. Then last Tuesday, a Slack message arrives from a stakeholder: "These recommendations look off. Did something change?" You pull the pipeline logs. The pipelines ran. All the tool calls succeeded. No errors, no alerts. You look at the outputs and they're... subtly worse. Not catastrophically wrong — just less accurate than they were two months ago. The pipeline didn't fail. It just quietly got less good.
Observability for agentic data engineering requires monitoring not just whether pipelines ran, but whether agent reasoning stayed coherent over time — a concern that sits alongside reliability patterns you explored in multi-agent orchestration and production readiness gates.
After this module, you'll be able to:
- Design a four-pillar observability system for a production agentic pipeline
- Analyze agent reasoning traces to identify where decisions degraded
- Build an alerting strategy that catches gradual drift, not just outages
- Debug agentic failures using reasoning traces rather than execution logs alone
This is the observability gap specific to agentic systems: traditional monitoring tells you whether your system ran. It doesn't tell you whether it reasoned correctly, whether its accuracy is trending down, or whether the model it's built on changed in a way that shifted its behavior.
A pipeline that produces wrong outputs on schedule is indistinguishable from a healthy pipeline — until you look at the data.
The four pillars of agentic observability
Traditional pipeline monitoring covers the first pillar. Agentic systems require all four.
| Pillar | What it measures | Key signals |
|---|---|---|
| Pipeline health | Whether the system ran correctly | Success rate, p50 (median) and p99 (99th-percentile) latency, error count, last successful run |
| Reasoning quality | Whether the agent reasoned correctly | Sample accuracy rate (output quality score on a sampled, expert-graded subset), escalation rate, reference task comparison |
| Cost & efficiency | Whether resource use is sustainable | Cost per run, tokens per run, cost trend vs. 30-day average |
| Governance | Whether the agent stayed in scope | Human approval gate hits, out-of-scope actions (target: 0), audit log completeness — an immutable trace record per pipeline run, including inputs, outputs, tool calls, and agent decisions, retained for your required period |
A human approval gate is a checkpoint that requires human sign-off before the agent takes a designated action. An out-of-scope action is any step the agent attempts outside its defined tool permissions or allowed playbook. These signals pair naturally with the control patterns in Governance and Production Readiness.
Pipeline health is table stakes — you already monitor this. The gap is pillars 2 through 4.
Reasoning quality: the hardest pillar to measure
Reasoning quality is what most teams skip, because it's the hardest to instrument. It's also the most important — and the one most likely to surface the subtle degradation that triggered that Slack message.
Chen, Zaharia, and Zou, in Research on LLM performance drift over time, evaluated model outputs against test suites at two points — March and June 2023 — and found measurable performance differences between the snapshots, driven by opaque provider-side model updates — including when the user's API parameters and system prompt stayed the same. In live agentic deployments, two additional drift causes compound this: context drift (the real-world data environment drifting out from under the agent's assumptions — schemas change, upstream distributions shift) and prompt staleness (the agent's instructions no longer matching current domain conventions or terminology). You can't assume a system that was accurate three months ago is accurate today.
Three practical approaches to measuring reasoning quality:
- Weekly sampling — randomly sample 5–10% of agent outputs per week and evaluate them against ground truth or expert judgment. This catches gradual decline that aggregate success metrics miss.
- Reference task comparison — maintain a small set of reference inputs with known correct outputs. Run these against the production agent weekly. Degradation on reference tasks before it appears in general outputs is an early warning signal.
- Escalation rate tracking — track how often the agent escalates to human review versus resolves autonomously. An escalation rate that's rising without a corresponding increase in incident complexity signals reasoning degradation.
Reasoning traces: your primary debugging tool
When an agentic failure happens, the first instinct is to check execution logs. The problem: execution logs tell you what the agent did. They don't tell you why.
Reasoning traces capture the agent's decision process — not just the tool calls it made, but the reasoning steps between them.
The following diagram shows how each reasoning step maps to an observable signal:
The five steps map to a generic reasoning trace you can log in production: observation, reasoning, action, validation, and output.
A trace for a schema drift remediation might show:
- Agent reads schema diff
- Agent reasons: "This looks like a field rename — customer_id → cust_id"
- Agent proposes rename fix
- Agent validates fix against staging
- Agent opens PR
If the fix was wrong, the trace shows you step 2 — where the reasoning went off. Without the trace, you'd only know the fix was wrong, not why.
Instrument trace capture before you ship to production. Adding it retroactively after something goes wrong is painful — you're debugging blind until you get instrumentation in place.
Langfuse — Open-source, self-hostable tracing and evaluation for LLM apps; good when you want full control of data residency. Arize AI — Enterprise-oriented observability and quality monitoring for ML and LLM systems in production. OpenTelemetry with emerging LLM semantic conventions — standards-based instrumentation when you already run OTEL and want traces to land in your existing backends. These are illustrative examples — your team should evaluate tools based on your requirements and vendor agreements.
A minimal trace schema:
{
"run_id": "ord_daily_20240115_0400",
"agent": "schema_drift_monitor",
"input": { "table": "orders_daily", "drift_detected": true },
"reasoning_steps": [
{ "step": 1, "observation": "customer_id missing from schema", "reasoning": "Field rename detected — customer_id → cust_id in upstream" },
{ "step": 2, "action": "propose_rename_fix", "tool_calls_made": 1 }
],
"output": { "pr_opened": true, "fix_type": "field_rename" },
"duration_ms": 4200,
"tokens_used": 1840
}
Alerting strategy: trend-based, not threshold-only
Traditional pipeline alerting is threshold-based: error rate exceeds 2%, alert fires. This works for binary failure states. It doesn't work for gradual reasoning degradation.
Agentic systems need two layers of alerting:
| Layer | What it catches | Role |
|---|---|---|
| Threshold alerts | Point-in-time breaches: error rate, latency, cost spike | The standard layer for acute failures and hard limits |
| Trend alerts | Gradual moves: e.g. sample score 94% → 91% → 88% over weeks | Harder to configure; surfaces slow reasoning degradation a single threshold would miss |
A reasoning quality score that drifts down over several weeks often will not cross a fixed threshold until late; trend-style rules are how you catch that pattern at week two instead of after a stakeholder complaint.
| Alert type | Signal | Urgency | Action |
|---|---|---|---|
| Error rate spike | >2% errors in 30 min | Immediate | Page on-call |
| Cost overrun | >150% of 30-day avg for 3+ runs | Same-day | Investigate tool call volume |
| Reasoning quality drop | Sample score <90% or declining 3+ weeks | Warning | Audit recent outputs, check for provider-side or configuration change |
| Escalation rate increase | >20% above baseline for 1 week | Warning | Review escalated cases for pattern |
| Out-of-scope action | Any action outside defined tool scope | Immediate | Suspend agent pending review |
Organize response expectations with a simple three-tier model: Tier 1: informational (log, dashboard, no immediate human action) → Tier 2: Warning (ticket or audit within a business day — e.g. sustained reasoning quality decline or escalation-rate increase) → Tier 3: Critical (page or suspend — e.g. error spikes, cost overruns past policy, or any out-of-scope action). Map each alert row above to the tier that matches its Urgency and Action columns.
The key distinction: threshold alerts indicate something broke. Trend alerts indicate something is slowly breaking. Both matter. Only trend alerts catch the subtle failure mode that caused that Slack message.
⏱ 30–45 minutes for the complete exercise; 15–20 minutes for a focused 2-row draft
Design the observability system for your production agent. Paste the prompt below into your assistant of choice and use the reply to refine your monitoring and alerting design.
Optional (Ascend): To stress-test the same logic in your deployment, open Otto there and run the same prompt.
I'm designing an observability system for a production agentic data pipeline. The agent monitors our orders_daily pipeline for schema drift and data quality anomalies. It runs nightly and has been in production for 6 weeks with a 97% success rate.
I need help designing the monitoring and alerting layer. Specifically:
1. What are the three most important metrics I should track beyond basic success/failure rate — and why?
2. Read this short trace: Agent receives query → searches → returns "No results found" even though results exist. At which step did reasoning diverge from execution? What would you instrument to catch this?
3. A pipeline shows rising cost per output but stable latency. Which pillar signals this, and what threshold would trigger investigation?
4. My agent's weekly sample accuracy score dropped from 94% to 91% to 88% over the last 3 weeks. The pipeline is still running successfully. Should I be concerned? What does this pattern typically indicate?
5. How should I tier my alerts? Give me three tiers with specific thresholds and response actions for each.
What to notice: Does the response distinguish between a point-in-time dip and a sustained trend? For alert tiers, check whether threshold breaches and gradual drift get different treatment (threshold for acute spikes, trend rules for slow degradation). A strong answer often gives a proportionate middle tier for sustained score decline — for example, a warning path with a ticket and follow-up within a business day — without treating a multi-week slide like a single bad hour. If the assistant collapses trend and threshold into the same tier, that's the gap this module's alerting framework addresses.
Your turn — design for your pipeline:
| Panel | What pipeline/agent? | Metrics to track | Alert threshold |
|---|---|---|---|
| Example: Reasoning Quality | orders_daily nightly schema drift monitor | Weekly sample accuracy on labeled drift cases (%); reference-task pass rate (%); escalation rate (% of runs vs 90-day baseline) | Sample accuracy < 90% in any week or reference pass rate down 3 consecutive weeks → Warning (audit outputs, check for provider or data drift); escalation rate > 25% above baseline for 1 week → Review escalations for pattern |
| Pipeline Health | |||
| Reasoning Quality | |||
| Example: Cost & Efficiency | Same agent | Cost per run vs 30-day rolling average; tokens (or equivalent) per successful output; cost per labeled correct output if you have weekly samples | Cost > 150% of 30-day avg for 3+ runs → same-day investigate tool volume and prompt length; cost per output up 20%+ for 2 weeks while latency flat → trend ticket (check for retrieval bloat or redundant tool calls) |
| Cost & Efficiency | |||
| Example: Governance | Same agent | Human approval gate hits (count and % of runs); out-of-scope action attempts (target 0); audit log completeness (% of runs with immutable trace IDs) | Any out-of-scope attempt → Critical, suspend pending review; approval gate denials spiking vs baseline → Warning, review policy vs actual tasks; missing audit/trace for a run → informational Tier 1 with same-day backfill |
| Governance |
Rubric: Each row should include: at least 1 signal, at least 1 threshold or trigger, and at least 1 action. A strong answer for Pipeline Health might track record count variance (threshold: >15% drop triggers investigation) with a data quality audit response.
- Traditional monitoring tells you if it ran. Agentic observability tells you if it reasoned correctly. Add a second monitoring layer covering reasoning quality, not just execution success. LLM behavior changes over time — assume it and monitor for it.
- Reasoning traces are your primary debugging tool. Without them, agentic failures are opaque. Instrument trace capture before you ship to production, not after something goes wrong.
- Alert on trends, not just thresholds. Reasoning quality degradation is gradual. Weekly sampling, reference task comparison, and escalation rate tracking surface the slow drift that aggregate success metrics miss.
Your system is observable, your agents are monitored, and you have a clear picture of what's working. The next production challenge isn't technical — it's organizational. How do you drive adoption of this across the rest of your team and your company?
With your observability system designed, the next step is driving adoption across your organization — and building the business case that sustains it.
Next: Adoption Roadmap →
Additional Reading
- Research on LLM performance drift over time — Chen, Zaharia, and Zou evaluated model outputs against test suites at two points — March and June 2023 — and found measurable differences between snapshots driven by opaque provider-side updates; useful background for why reasoning quality monitoring matters beyond pipeline health.
- Production ML monitoring patterns — Foundational framing for monitoring systems that produce probabilistic outputs — the conceptual bridge from traditional pipeline monitoring to agentic observability.
- MAST: a taxonomy of failure modes in multi-agent systems — 14 failure modes in multi-agent systems, including the partial failure and silent degradation patterns that observability is designed to surface.