Observability: Monitoring Reasoning, Not Just Output

Your agentic system has been running for three months. Pipeline health metrics look great — 99.2% success rate, latency is stable, costs are within budget. Then last Tuesday, a Slack message arrives from a stakeholder: "These recommendations look off. Did something change?" You pull the pipeline logs. The pipelines ran. All the tool calls succeeded. No errors, no alerts. You look at the outputs and they're... subtly worse. Not catastrophically wrong — just less accurate than they were two months ago. The pipeline didn't fail. It just quietly got less good.

Observability for agentic data engineering requires monitoring not just whether pipelines ran, but whether agent reasoning stayed coherent over time — a concern that sits alongside reliability patterns you explored in multi-agent orchestration and production readiness gates.

After this module, you'll be able to:

Design a four-pillar observability system for a production agentic pipeline
Analyze agent reasoning traces to identify where decisions degraded
Build an alerting strategy that catches gradual drift, not just outages
Debug agentic failures using reasoning traces rather than execution logs alone

This is the observability gap specific to agentic systems: traditional monitoring tells you whether your system ran. It doesn't tell you whether it reasoned correctly, whether its accuracy is trending down, or whether the model it's built on changed in a way that shifted its behavior.

A pipeline that produces wrong outputs on schedule is indistinguishable from a healthy pipeline — until you look at the data.

The four pillars of agentic observability

Traditional pipeline monitoring covers the first pillar. Agentic systems require all four.

Pillar	What it measures	Key signals
Pipeline health	Whether the system ran correctly	Success rate, p50 (median) and p99 (99th-percentile) latency, error count, last successful run
Reasoning quality	Whether the agent reasoned correctly	Sample accuracy rate (output quality score on a sampled, expert-graded subset), escalation rate, reference task comparison
Cost & efficiency	Whether resource use is sustainable	Cost per run, tokens per run, cost trend vs. 30-day average
Governance	Whether the agent stayed in scope	Human approval gate hits, out-of-scope actions (target: 0), audit log completeness — an immutable trace record per pipeline run, including inputs, outputs, tool calls, and agent decisions, retained for your required period

A human approval gate is a checkpoint that requires human sign-off before the agent takes a designated action. An out-of-scope action is any step the agent attempts outside its defined tool permissions or allowed playbook. These signals pair naturally with the control patterns in Governance and Production Readiness.

Pipeline health is table stakes — you already monitor this. The gap is pillars 2 through 4.

Reasoning quality: the hardest pillar to measure

Reasoning quality is what most teams skip, because it's the hardest to instrument. It's also the most important — and the one most likely to surface the subtle degradation that triggered that Slack message.

Chen, Zaharia, and Zou, in Research on LLM performance drift over time, evaluated model outputs against test suites at two points — March and June 2023 — and found measurable performance differences between the snapshots, driven by opaque provider-side model updates — including when the user's API parameters and system prompt stayed the same. In live agentic deployments, two additional drift causes compound this: context drift (the real-world data environment drifting out from under the agent's assumptions — schemas change, upstream distributions shift) and prompt staleness (the agent's instructions no longer matching current domain conventions or terminology). You can't assume a system that was accurate three months ago is accurate today.

Three practical approaches to measuring reasoning quality:

Weekly sampling — randomly sample 5–10% of agent outputs per week and evaluate them against ground truth or expert judgment. This catches gradual decline that aggregate success metrics miss.
Reference task comparison — maintain a small set of reference inputs with known correct outputs. Run these against the production agent weekly. Degradation on reference tasks before it appears in general outputs is an early warning signal.
Escalation rate tracking — track how often the agent escalates to human review versus resolves autonomously. An escalation rate that's rising without a corresponding increase in incident complexity signals reasoning degradation.

Reasoning traces: your primary debugging tool

When an agentic failure happens, the first instinct is to check execution logs. The problem: execution logs tell you what the agent did. They don't tell you why.

Reasoning traces capture the agent's decision process — not just the tool calls it made, but the reasoning steps between them.

tip

The following diagram shows how each reasoning step maps to an observable signal:

The five steps map to a generic reasoning trace you can log in production: observation, reasoning, action, validation, and output.

A trace for a schema drift remediation might show:

Agent reads schema diff
Agent reasons: "This looks like a field rename — customer_id → cust_id"
Agent proposes rename fix
Agent validates fix against staging
Agent opens PR

If the fix was wrong, the trace shows you step 2 — where the reasoning went off. Without the trace, you'd only know the fix was wrong, not why.

Instrument trace capture before you ship to production. Adding it retroactively after something goes wrong is painful — you're debugging blind until you get instrumentation in place.

Trace tooling

Langfuse — Open-source, self-hostable tracing and evaluation for LLM apps; good when you want full control of data residency. Arize AI — Enterprise-oriented observability and quality monitoring for ML and LLM systems in production. OpenTelemetry with emerging LLM semantic conventions — standards-based instrumentation when you already run OTEL and want traces to land in your existing backends. These are illustrative examples — your team should evaluate tools based on your requirements and vendor agreements.

A minimal trace schema:

{
  "run_id": "ord_daily_20240115_0400",
  "agent": "schema_drift_monitor",
  "input": { "table": "orders_daily", "drift_detected": true },
  "reasoning_steps": [
    { "step": 1, "observation": "customer_id missing from schema", "reasoning": "Field rename detected — customer_id → cust_id in upstream" },
    { "step": 2, "action": "propose_rename_fix", "tool_calls_made": 1 }
  ],
  "output": { "pr_opened": true, "fix_type": "field_rename" },
  "duration_ms": 4200,
  "tokens_used": 1840
}

Alerting strategy: trend-based, not threshold-only

Traditional pipeline alerting is threshold-based: error rate exceeds 2%, alert fires. This works for binary failure states. It doesn't work for gradual reasoning degradation.

Agentic systems need two layers of alerting:

Layer	What it catches	Role
Threshold alerts	Point-in-time breaches: error rate, latency, cost spike	The standard layer for acute failures and hard limits
Trend alerts	Gradual moves: e.g. sample score 94% → 91% → 88% over weeks	Harder to configure; surfaces slow reasoning degradation a single threshold would miss

A reasoning quality score that drifts down over several weeks often will not cross a fixed threshold until late; trend-style rules are how you catch that pattern at week two instead of after a stakeholder complaint.

Alert type	Signal	Urgency	Action
Error rate spike	>2% errors in 30 min	Immediate	Page on-call
Cost overrun	>150% of 30-day avg for 3+ runs	Same-day	Investigate tool call volume
Reasoning quality drop	Sample score <90% or declining 3+ weeks	Warning	Audit recent outputs, check for provider-side or configuration change
Escalation rate increase	>20% above baseline for 1 week	Warning	Review escalated cases for pattern
Out-of-scope action	Any action outside defined tool scope	Immediate	Suspend agent pending review

Organize response expectations with a simple three-tier model: Tier 1: informational (log, dashboard, no immediate human action) → Tier 2: Warning (ticket or audit within a business day — e.g. sustained reasoning quality decline or escalation-rate increase) → Tier 3: Critical (page or suspend — e.g. error spikes, cost overruns past policy, or any out-of-scope action). Map each alert row above to the tier that matches its Urgency and Action columns.

The key distinction: threshold alerts indicate something broke. Trend alerts indicate something is slowly breaking. Both matter. Only trend alerts catch the subtle failure mode that caused that Slack message.

Exercise: Design a Monitoring Dashboard

⏱ 30–45 minutes for the complete exercise; 15–20 minutes for a focused 2-row draft

Design the observability system for your production agent. Paste the prompt below into your assistant of choice and use the reply to refine your monitoring and alerting design.

Optional (Ascend): To stress-test the same logic in your deployment, open Otto there and run the same prompt.

I'm designing an observability system for a production agentic data pipeline. The agent monitors our orders_daily pipeline for schema drift and data quality anomalies. It runs nightly and has been in production for 6 weeks with a 97% success rate.

I need help designing the monitoring and alerting layer. Specifically:

What are the three most important metrics I should track beyond basic success/failure rate — and why?
Read this short trace: Agent receives query → searches → returns "No results found" even though results exist. At which step did reasoning diverge from execution? What would you instrument to catch this?
A pipeline shows rising cost per output but stable latency. Which pillar signals this, and what threshold would trigger investigation?
My agent's weekly sample accuracy score dropped from 94% to 91% to 88% over the last 3 weeks. The pipeline is still running successfully. Should I be concerned? What does this pattern typically indicate?
How should I tier my alerts? Give me three tiers with specific thresholds and response actions for each.

What to notice: Does the response distinguish between a point-in-time dip and a sustained trend? For alert tiers, check whether threshold breaches and gradual drift get different treatment (threshold for acute spikes, trend rules for slow degradation). A strong answer often gives a proportionate middle tier for sustained score decline — for example, a warning path with a ticket and follow-up within a business day — without treating a multi-week slide like a single bad hour. If the assistant collapses trend and threshold into the same tier, that's the gap this module's alerting framework addresses.

Your turn — design for your pipeline:

Panel	What pipeline/agent?	Metrics to track	Alert threshold
Example: Reasoning Quality	`orders_daily` nightly schema drift monitor	Weekly sample accuracy on labeled drift cases (%); reference-task pass rate (%); escalation rate (% of runs vs 90-day baseline)	Sample accuracy < 90% in any week or reference pass rate down 3 consecutive weeks → Warning (audit outputs, check for provider or data drift); escalation rate > 25% above baseline for 1 week → Review escalations for pattern
Pipeline Health
Reasoning Quality
Example: Cost & Efficiency	Same agent	Cost per run vs 30-day rolling average; tokens (or equivalent) per successful output; cost per labeled correct output if you have weekly samples	Cost > 150% of 30-day avg for 3+ runs → same-day investigate tool volume and prompt length; cost per output up 20%+ for 2 weeks while latency flat → trend ticket (check for retrieval bloat or redundant tool calls)
Cost & Efficiency
Example: Governance	Same agent	Human approval gate hits (count and % of runs); out-of-scope action attempts (target 0); audit log completeness (% of runs with immutable trace IDs)	Any out-of-scope attempt → Critical, suspend pending review; approval gate denials spiking vs baseline → Warning, review policy vs actual tasks; missing audit/trace for a run → informational Tier 1 with same-day backfill
Governance

Rubric: Each row should include: at least 1 signal, at least 1 threshold or trigger, and at least 1 action. A strong answer for Pipeline Health might track record count variance (threshold: >15% drop triggers investigation) with a data quality audit response.

Key takeaways

Traditional monitoring tells you if it ran. Agentic observability tells you if it reasoned correctly. Add a second monitoring layer covering reasoning quality, not just execution success. LLM behavior changes over time — assume it and monitor for it.
Reasoning traces are your primary debugging tool. Without them, agentic failures are opaque. Instrument trace capture before you ship to production, not after something goes wrong.
Alert on trends, not just thresholds. Reasoning quality degradation is gradual. Weekly sampling, reference task comparison, and escalation rate tracking surface the slow drift that aggregate success metrics miss.

Your system is observable, your agents are monitored, and you have a clear picture of what's working. The next production challenge isn't technical — it's organizational. How do you drive adoption of this across the rest of your team and your company?

With your observability system designed, the next step is driving adoption across your organization — and building the business case that sustains it.

Next: Adoption Roadmap →

Additional Reading

Research on LLM performance drift over time — Chen, Zaharia, and Zou evaluated model outputs against test suites at two points — March and June 2023 — and found measurable differences between snapshots driven by opaque provider-side updates; useful background for why reasoning quality monitoring matters beyond pipeline health.
Production ML monitoring patterns — Foundational framing for monitoring systems that produce probabilistic outputs — the conceptual bridge from traditional pipeline monitoring to agentic observability.
MAST: a taxonomy of failure modes in multi-agent systems — 14 failure modes in multi-agent systems, including the partial failure and silent degradation patterns that observability is designed to surface.

The four pillars of agentic observability​

Reasoning quality: the hardest pillar to measure​

Reasoning traces: your primary debugging tool​

Alerting strategy: trend-based, not threshold-only​

Additional Reading​

The four pillars of agentic observability

Reasoning quality: the hardest pillar to measure

Reasoning traces: your primary debugging tool

Alerting strategy: trend-based, not threshold-only

Additional Reading