Skip to main content

You open your AI coding tool and type: "Write me a dbt model that joins orders to customers and calculates 30-day lifetime value (LTV)." It produces SQL. You run it. It works. Confident, you try something harder: "Look at the orders_daily pipeline. It's been failing intermittently. Fix it." The result this time is plausible-sounding SQL that references columns that don't exist in your schema. When you point this out, the model apologizes and tries again with different column names — still wrong.

The tool wasn't broken. It was doing exactly what it was designed to do: generate text that looks like a reasonable continuation of your prompt. It doesn't know your schema, your pipeline's execution history, or what "failing intermittently" actually looks like in your logs. For the first task, the training data was enough. For the second, the gap between "knows SQL patterns" and "understands your production environment" became the problem. Closing that gap is what turning an LLM into an agent is actually about.

How Agents Work — From LLM Calls to Agentic Systems

By the end of this module, you will be able to: (1) describe the six architectural layers that turn an LLM into an agent, (2) explain why non-determinism changes how you test agentic systems, (3) identify four recurring failure modes and their design countermeasures, (4) apply the agent trace exercise to a real pipeline failure.

New here?

This module builds on What Is ADE. If you haven't read it yet, start there — it covers the foundational definitions and the automation spectrum this module extends.

An LLM is a text-in, text-out function. You send a string of tokens (the text chunks the model counts for input, output, and billing — often roughly three-quarters of a word each in English); you receive a string of tokens. The model is stateless between calls: it has no memory of previous conversations, no awareness of what's running in your environment, no access to anything that wasn't in its training data or in the prompt window you sent it.

The output is probabilistic, not deterministic. Even at temperature zero — the setting that minimizes how randomly the model samples its next tokens — there is meaningful variation in outputs. Research measuring LLM performance across repeated identical prompts found accuracy variations of up to 15% across runs at temperature zero — and a best-vs-worst spread of up to 70%, per a 2024 study on LLM output consistency. This isn't a bug — it reflects how the underlying sampling process works. But it has real implications for how you design, test, and govern agent behavior.

Math, not minds. An LLM samples from a probability distribution over tokens given the input context. It doesn't reason, plan, or understand in any human sense. The words "think," "decide," and "understand" are metaphors. Useful ones — but metaphors. Teams that internalize this build more reliable systems.

Build up in layers

An agent isn't a fundamentally different thing from an LLM call — it's what happens when you wrap LLM calls with structure that makes them useful for multi-step work in the real world.

The clearest way to understand how the layers connect isn't a static list — it's a data flow. Start with the simplest possible thing and watch what changes as each layer is added.

Layer 1 — The raw LLM: A text-in, text-out function. Tokens go in; a completion comes out. The model is stateless between calls — no memory of your previous conversation, no awareness of your environment, no access to anything outside the prompt window you sent it.

From model to harness (Layers 2–4)

Each of the next three layers adds structure to what flows in — or reshapes how the output flows out. Together, they form the harness that turns a raw model into a domain-specific actor.

Layer 2 — System message: Before your prompt reaches the model, a standing instruction block is prepended — every call, without exception. This is where you encode the agent's role, its constraints, and the standards it operates under. A data engineering agent's system message might specify: investigate before acting, never modify downstream consumers without surfacing them first, follow team SQL conventions. It's the job description that never changes between turns.

Layer 3 — Structured output: The model's response is forced into a defined schema — JSON, XML, or a typed format. This matters because the next consumer of the output isn't a human reading it; it's code routing the response to a tool call. Without schema enforcement, parsing is unpredictable — and a parsing failure in the middle of a multi-step chain is a silent failure mode.

Layer 4 — Tool calls: Now the agent can act in the world. When the model outputs a structured tool-call specification, the harness intercepts it and runs the actual function: read a file, execute a SQL query, call an API, post a Slack notification. The result flows back into the next prompt as new context. This is the moment the system gains the ability to touch your production environment — and where tool scope decisions begin to matter significantly.

Layers 2–4 — system message, structured output, tool calls — form the harness: standing context, output discipline, and reach into the world. Without these three layers in place, there is nothing yet to loop over.

From harness to loop (Layers 5–6)

Layers 1–4 shape a single call. Layers 5–6 are what turn a sequence of calls into a system that handles multi-step work.

Layer 5 — The agent loop: The model observes its current context (including tool results from the last step), decides what action to take next, executes that action via a tool call, observes the new result, and repeats. This cycle — observe → decide → act — continues until a stopping condition fires: the goal is reached, the step budget is exhausted, or the model determines it needs a human. The "for loop with an LLM call" formulation describes this exactly: the loop is the agent.

Layer 6 — Multi-agent: For complex tasks, you decompose the work across specialized agents: an investigator agent reads logs and diagnoses the failure; a proposer agent drafts the fix; a validator agent runs it against staging. Each is its own loop; a coordinator routes between them. The architecture provides two things a single loop can't: parallel execution on independent subtasks, and error isolation — a failure in one agent doesn't cascade into the others.

Layers 1–6 at a glance

LayerDescriptionExample
1 — LLMRaw text-in, text-out model; everything above this is harnessTokens in the prompt window become a completion
2 — System messageStanding instructions prepended to every call: role, constraints, standards"You are a data engineering agent that investigates pipeline failures"
3 — Structured outputEnforce a structured format (JSON / schema) so output is machine-parseableParser routes the model's reply into the next tool step
4 — Tool callsActions that change state: read files, run queries, call APIs, notifyRun a warehouse query or post an update
5 — Agent loopObserve → decide → act → repeat until resolved or a stopping condition firesThe "for loop with an LLM call" pattern
6 — Multi-agentSpecialized agents coordinated in parallel; errors can be isolated per agentInvestigator, fix proposer, and staging validator

The agentic harness defines behavior

Two products can implement all six layers with identical underlying models and produce dramatically different outcomes. What separates a capable data engineering agent from a capable chatbot isn't the model — it's the harness: the specific context it starts each turn with, the specific tools it's authorized to call, and the specific events that set it in motion.

This is why "just use the model" misses the point. A frontier model with access to your lineage graph, schema history, execution logs, and team standards will diagnose a pipeline failure with a specificity that a general-purpose chat interface can't — not because the model is smarter, but because the inputs are richer. The harness is what makes domain expertise available to the model.

How this works in Ascend

Otto is an AI agent built into the platform. The platform surfaces context about your pipelines — including run history, lineage, and execution state — so Otto can reason about your specific data environment rather than generic best practices. The harness is pre-built around the data stack — which is what makes the hands-on lab work the way it does. The principle applies regardless of platform: the richer the harness, the more useful the agent.

The practical implication: when evaluating or building an agentic system, audit the harness before the model. What does the agent know before it acts? What can it do? What's it not allowed to do? How is it activated? The answers map directly to the three design decisions Context, Tools, and Triggers addresses in depth.

Non-determinism is real

Back to the 15% accuracy variation finding: even with temperature set to zero, repeated identical prompts to the same model produce meaningfully different outputs, as research on LLM output consistency has shown. The variation is affected by framing, ordering, and the specific examples present in the prompt.

For single-turn tasks — "write me a query," "explain this error" — this level of variation is manageable. For a multi-step agent loop operating on production data, it accumulates. Variation introduced at step 3 of a 10-step chain doesn't stay isolated — it shapes the context for every subsequent step, so small errors early compound into large failures at the end.

Non-determinism changes how you test and trust

You cannot test an agentic system the way you test deterministic code. A single successful run doesn't mean the system will succeed reliably. Design for it: use structured output enforcement to reduce surface area for variation, build checkpoints where intermediate state is inspected before proceeding, set step budgets that escalate to humans when chains run long, and run agents through representative scenarios repeatedly before putting them in production paths. "It worked once" is not a test.

What can go wrong

Multi-agent systems fail at rates that will surprise you if you've only tested single-agent behavior. The MAST taxonomy found reported task non-success rates across open-source multi-agent systems ranging from 41% to 86.7%. The lesson isn't a specific number; it's that failure rates in multi-agent evals can be very high and vary sharply by setup. Research identifying recurring failure archetypes and Microsoft's production-derived AI failure taxonomy each highlight patterns worth understanding — and which good harness design directly addresses:

Failure typeWhat it looks likeDesign countermeasure
Premature actionAgent acts before sufficiently grounding its understanding — proposes a fix based on misdiagnosisRequire investigation before action; enforce explicit reasoning trace
Over-helpfulnessAgent fills in missing context with confident-sounding confabulation (hallucination — generating plausible-sounding but incorrect output)Load required context explicitly; treat missing context as a stopping condition
Context pollutionIrrelevant or distracting information in the context window degrades decision qualityUse retrieval over full-context dumps; structure context by priority
Fragile executionReasoning degrades under cognitive load — long chains, many tools, complex dependenciesSet step budgets; use multi-agent decomposition for complex tasks

All four are addressable. None of them are reasons not to build. They're reasons to build with observability, checkpoints, and explicit step budgets from the start.

What agents can do today

Single-turn tool use is the most reliable capability. Top-performing models on the Berkeley Function-Calling Leaderboard consistently demonstrate strong accuracy on combined tool-use tasks — check the live leaderboard for current figures, as model rankings shift frequently. That's a meaningful reliability floor for production use, especially with retry logic and human review gates.

Multi-step reasoning is where things get both more interesting and more prone to failure. The teams with the most reliable agentic systems in production share a common design pattern:

Capability classReliability profileDesign pattern
Single-turn tool useHigh accuracy; retry logic handles edge casesStart here; add human review gates for production paths
Multi-step investigationMore variable; bounded scope helpsAgent-led investigation with explicit stopping conditions
Autonomous production changesElevated failure risk without guardrailsHuman approval required; expand scope progressively as trust builds
Exercise: Watch the Loop

⏱ 10 minutes

The observe-decide-act loop is easier to understand once you've seen it run. This prompt asks an LLM to demonstrate it — and the response itself is an example of the loop in action.

Open any LLM — Claude, ChatGPT, or Gemini work well — and paste this:

Act as a data pipeline monitoring agent. You have access to these tools:
- read_logs(pipeline_name, n_lines) — returns the last n log lines
- query_schema(table_name) — returns column names and types
- check_lineage(component_name) — returns upstream and downstream dependencies
- open_pr(title, description) — opens a pull request
- notify_slack(channel, message) — sends a Slack message

Your task: The orders_daily pipeline just failed. Walk me through your reasoning loop step by step — what you observe, what you decide, what tool you call, and what you observe after each result. Show the actual tool calls you'd make and what each one returns. Stop when you've either resolved the issue or determined you need human input.

What to notice: Each tool result in the response feeds the next decision — that's the observe-decide-act loop. Notice two things in particular: does the agent investigate before proposing a fix, or jump straight to a solution (premature action)? And when it hits an ambiguous result, does it ask a clarifying question or fill in the gap with a confident-sounding assumption (over-helpfulness)? Both are failure modes named in this module — seeing them appear in a live response makes them concrete.

Key takeaways
  • An agent is a for loop with an LLM call. The harness — system message, structured output, tools, triggers, guardrails — is what makes it useful for data engineering work. Two products using the same model can produce dramatically different results based on the harness alone.
  • Non-determinism is a feature and a risk. Up to 15% accuracy variation even at temperature zero, per research on LLM output consistency, means testing requires repeated runs, checkpoints, and human review gates — not just one successful pass.
  • Multi-step failure rates are real. The MAST taxonomy found non-success rates ranging from 41% to 86.7% across open-source multi-agent systems. The lesson is that failure rates in multi-agent evals can be very high and vary sharply by setup — addressable through the four failure-mode design countermeasures, but only with deliberate infrastructure from the start.
You'll use this in practice

The six-layer harness model gets applied directly in Context, Tools, and Triggers →, where you'll design the context stack, tool scope, and trigger configuration for a production data engineering agent.

You now know what's inside the loop. The next module answers the more important question: what goes in before the loop starts — and what keeps it from running off a cliff.

Next: Context, Tools, and Triggers →

Additional Reading

  • Berkeley Function-Calling Leaderboard — The most practical benchmark for evaluating agent tool-use reliability. Use this to compare models on the specific task class you're building for, not on general benchmarks.
  • MAST: A Multi-Agent System Taxonomy — Identifies recurring failure patterns across open-source multi-agent systems with non-success rates ranging from 41% to 86.7%. Essential reading for anyone designing multi-agent data engineering systems.
  • Failure archetypes in LLM agents — Recurring failure archetypes with concrete examples — premature action, over-helpfulness, context pollution, fragile execution. Practical for designing countermeasures.
  • Microsoft's production AI failure taxonomy — Production-derived failure taxonomy with an enterprise deployment perspective; complements the MAST academic taxonomy.
  • LLM output consistency at temperature zero — The empirical basis for the 15% accuracy variation finding. Critical for anyone designing test suites for agentic systems.