Skip to main content

Multi-Agent Orchestration

You now have specialized agents: one watching your ingestion pipeline for schema drift, one running quality checks against your transformation models, one monitoring pipeline schedule health, one routing alerts to Slack. Each works well in isolation. Agentic data engineering means using AI agents as first-class operators on your data stack (not just copilots for one-off queries). The next step is connecting those agents through multi-agent orchestration — and this is where a surprising number of production systems run into trouble.

The problem isn't that multi-agent systems don't work. It's that the failure modes are different from anything in traditional pipeline engineering. A single agent that fails throws an error. A multi-agent system can fail partially — some agents complete successfully while others silently produce wrong outputs — and the failure propagates in ways that are hard to detect and harder to debug. Orchestration done well multiplies the leverage of your individual agents. Orchestration done poorly creates a system that's more fragile than any of its components.

By the end of this module, you will be able to:

  • Select the right coordination pattern for a given task type and complexity
  • Interpret compound reliability math and estimate a reasonable per-agent reliability target before chaining
  • Identify concrete handoff validations that catch incomplete or degraded outputs before they propagate
  • Choose a state synchronization pattern suited to concurrent writes, shared state, and audit history (see contention risk in State synchronization)

Four coordination patterns

Different coordination patterns suit different task types. Choosing the wrong one doesn't just hurt performance — it creates coordination complexity you'll spend months managing.

Sequential handoff

One agent completes its task and passes its output to the next. Simple, auditable, and easy to debug. Each agent's output is the next agent's input.

Best for

Tasks with clear dependencies where each step needs the prior step's output before it can begin. Most pipeline incident response workflows fit this pattern.

Parallel fan-out / fan-in

One orchestrator (the central controller that dispatches tasks to sub-agents and collects results) dispatches multiple agents simultaneously. Each works on an independent subtask. Results are collected and synthesized.

Best for

Diagnostic tasks where multiple independent data points are needed simultaneously — root cause analysis, multi-source data quality checks.

Hierarchical (orchestrator → specialists)

An orchestrator agent decomposes a complex task and dispatches it to specialist agents based on the nature of each subtask. The orchestrator synthesizes results and may re-route based on what it learns.

Best for

Complex tasks where the decomposition itself requires judgment — the orchestrator doesn't know all the subtasks upfront. Incident triage that may require different specialist combinations depending on what the investigation reveals.

Event-driven mesh

Agents subscribe to event streams and act when relevant events arrive. No central orchestrator — each agent decides whether to act based on the event.

Best for

High-frequency, continuous monitoring scenarios where events arrive constantly and the set of relevant agents isn't known in advance. Schema change monitoring across a large pipeline portfolio.

The reliability math

Before you build a multi-agent system, do the reliability math. This calculation is where most orchestration plans quietly break.

Compound reliability

Compound reliability — the combined success rate when multiple agents run in sequence, calculated by multiplying each agent's individual reliability rate.

Individual agent reliability compounds when agents are chained. If five agents each have 85% individual reliability and are chained sequentially, the system-level reliability is:

0.85 × 0.85 × 0.85 × 0.85 × 0.85 = 0.44 (44%)

Individual reliability3-agent chain5-agent chain8-agent chain
85%61%44%27%
90%73%59%43%
95%86%77%66%
97%91%86%78%
99%97%95%92%

A system that succeeds less than half the time is often not production-ready as framed here — it is often incompatible with typical production SLOs under the independence and first-attempt assumptions in this lesson. The math is unforgiving, and it doesn't care how good each individual agent is in isolation.

Production chaining threshold

A common heuristic is 97%+ individual reliability before chaining under independence assumptions — at 97% per agent, five chained agents yield about 0.97^5 ≈ 86% end-to-end reliability. Treat that as a starting benchmark, not a universal guarantee; your acceptable bar depends on blast radius, retries, and validation gates.

Below a lower per-agent bar, compounding failure dominates: the table above shows how fast long chains fall apart.

Reliability here means the rate at which the agent returns a correct, complete output on the first attempt, without retries. Single-attempt benchmarks tend to overestimate production reliability; measure across multiple runs in shadow mode before using a figure to set SLOs (the reliability targets your system is operationally required to meet).

The practical consequence: measure individual agent reliability in shadow mode (running the agent in parallel with production, observing outputs without acting on them) before chaining. An agent that looks reliable in manual spot-checks often has lower measured reliability in production — and the compounding effect means even a small gap matters.

Partial failures fail silently

The most dangerous failure mode in multi-agent systems is not total failure — it's partial failure.

When one agent in a chain produces degraded output without crashing, downstream agents process that output and produce their own outputs, which may also be degraded. The system appears to work. The data is wrong.

What partial failure looks like:

  • Agent A returns a response that's syntactically valid but semantically incorrect — a schema diff that misses one field
  • Agent B receives this and assesses impact based on an incomplete diff — the impact assessment is wrong, but plausible
  • Agent C proposes a fix based on the incomplete impact — the fix is partial
  • Agent D deploys a partial fix — the pipeline still fails downstream, but in a different way

No errors were thrown. Every agent returned a result. The system logged success.

Three structural defenses
  • Output validation at every handoff — each agent validates the structure and completeness of its input before processing. A schema diff that's missing required fields should throw an error, not get passed forward.
  • State checkpoints — write intermediate state to a persistent store at each handoff. If the chain fails, you have the intermediate outputs for debugging. You can also detect silent degradation by comparing intermediate outputs to expected baselines.
  • Explicit completeness signals — require agents to return not just an answer, but a confidence or completeness signal. "I found 3 affected tables" is better than just the table list — the downstream agent can validate the count against its own lineage query.

State synchronization

Contention risk — the likelihood that two agents will attempt to write to the same shared state simultaneously, producing conflicts or data corruption.

Multi-agent systems share state — a task queue, a log of completed work, a registry of active agents. How that state is managed determines whether the system is reliable under concurrent load. The wrong pattern shows up first as flaky writes or inconsistent reads, which is why Observability and Scaling both assume you have made explicit choices here about contention and ownership.

PatternData-engineering anchorIn practice
Optimistic lockingSame idea as checking a row version or updated_at in a metadata table before you commit — avoiding blind overwrites when two jobs touch the same recordRead → compute → write only if the version still matches; on conflict, retry with fresh state.
Event sourcingSame shape as append-only run logs, audit tables, or lineage events: the log is the source of truth; current state is derived by replaying or projectingAgents append events; readers rebuild state; you get history, replay, and debugging without destructive updates.
Agent-scoped stateLike split ownership of datasets or namespaces so no one shares a mutable “hot” table — coordination happens through contracts (schemas, APIs) instead of contested writesEach agent owns its store; others interact via structured handoffs or messages, not a shared writable document.
If the jargon feels unfamiliar

Treat locking like merge conflicts on a shared config: two writers, one wins, the other refreshes and retries. Treat event sourcing like immutable job history or DQ findings streams you never UPDATE in place. Treat agent-scoped state like keeping team-owned tables separate and syncing only through agreed exports.

Optimistic locking — each agent reads shared state, processes it, and writes back with a version check. If another agent wrote in the meantime, the write fails and the agent retries. Works well for low-contention scenarios.

Event sourcing — instead of mutating shared state, each agent appends events to an immutable log. State is derived from the event history. Naturally handles concurrent writes, provides a complete audit trail, and makes it easy to replay or debug. The right choice when multiple agents write to the same shared log simultaneously.

Agent-scoped state — each agent owns its own state store. Coordination happens through well-defined interfaces — structured handoffs, event subscriptions — rather than through shared mutable state. The safest pattern for avoiding race conditions, at the cost of more explicit interface design.

When orchestration helps vs. hurts in agentic data engineering

Multi-agent orchestration isn't always the right answer. One preprint (arXiv:2601.14652; Ke et al., 2026) proposes MAS-Orchestra—a training-time framework that treats multi-agent orchestration as holistic function-calling reinforcement learning—and MASBENCH, a controlled benchmark that varies tasks along depth, horizon, breadth, parallelism, and robustness. The authors report that multi-agent gains are not universal: they depend on task structure, verification protocols, and orchestrator and subagent capabilities. That work is LLM multi-agent research rather than a direct study of production data-pipeline orchestration, but the task-dependent takeaway is a useful pattern reference when you choose coordination approaches here.

Task typeMulti-agent approachBetter alternative
Sequential dependency chain, single domainAdds coordination overhead without parallelization benefitSingle agent with multi-step reasoning
Parallel independent subtasksStrong fit — fan-out/fan-in shinesMulti-agent (fan-out/fan-in pattern above)
Complex decomposition with judgmentStrong fit — hierarchical orchestratorMulti-agent (hierarchical orchestrator pattern above)
Simple lookup or transformationMassive overhead for minimal gainDirect tool call or traditional pipeline
High-frequency monitoringEvent-driven mesh works well at volumeRule-based alerting for simple thresholds

The decision criterion: does the task have independent subtasks that benefit from parallelization, or require specialist knowledge that exceeds what a single agent can hold in context? If yes, multi-agent orchestration earns its overhead. If not, a single well-designed agent is almost always faster, cheaper, and easier to debug.

Worked example — orders_daily quality incident:

When a customer_id null rate alert fires on orders_daily (baseline: <0.1%, current: 8.3%), a hybrid pattern fits: a hierarchical orchestrator frames the incident, then parallel fan-out to three specialists with fan-in at synthesis (one place merges lineage, source, and impact):

Each handoff must return a specific, verifiable output — not vague "findings":

  • Lineage Agent → Synthesis: specific source table name (e.g. orders_api.customers)
  • Source Agent → Synthesis: timestamp of last API change, or explicit "no change detected"
  • Impact Agent → Synthesis: count of affected downstream models
Exercise: Design an Orchestration for a Quality Incident

⏱ 10–15 minutes

Use an AI assistant you already use for work (IDE assistant, chat product, or in-product assistant if your platform provides one). Ask it to reason through the coordination pattern for an orders_daily incident and surface the reliability tradeoffs you would face in production.

Optional: Otto setup

If you use Ascend with Otto, agentic AI can use managed Bedrock models or your own OpenAI / Azure OpenAI keys — see AI providers for setup. The exercise works with any assistant; only provider configuration differs.

No assistant access?

Use the same rubric as a written self-assessment: sketch your pattern choice, reliability math, handoff contract, and silent-failure controls in a doc or shared ticket. No paste required.

Strong answer checklist (self-check):

  • Names a coordination pattern consistent with this module (e.g. hybrid hierarchical orchestrator + parallel fan-out with fan-in at synthesis for this scenario).
  • States how compound reliability applies if agents were chained vs run in parallel, using the module’s framing.
  • Lists concrete handoff fields (e.g. source table name, API change timestamp or explicit “none,” affected downstream count) rather than vague “findings.”
  • Describes how you would detect or bound silent failure (validation at receive, completeness signals, checkpoints, or human gate).

Then paste this:

I'm designing a multi-agent system to investigate quality incidents in our orders_daily pipeline. When a null rate alert fires on customer_id (baseline: <0.1%, current: 8.3%), I need three investigations to run simultaneously: trace lineage to the source table, check whether the upstream orders API changed recently, and count how many downstream models are affected.

Which coordination pattern from sequential handoff, parallel fan-out/fan-in, hierarchical orchestration, and event-driven mesh best fits, and why? If the answer combines ideas (for example orchestrator plus parallel specialists), say so explicitly. Walk me through the reliability math if each individual agent has 95% reliability and the work were chained end-to-end vs parallelized before synthesis. Then list what specific output each agent must return at its handoff to the synthesis step, and how you would catch silent partial failure if one investigator degrades without erroring.

What to notice: You might see a hybrid answer (orchestrator plus parallel fan-out into a synthesizer) — compare that to the worked example above. You might see different handoff field names; check whether they are verifiable against data you could query (table name, timestamp, counts), as in this module’s pattern, rather than narrative-only summaries. :::

Key takeaways
  • Match the coordination pattern to the task. Sequential handoff for dependent steps. Fan-out for parallel investigation. Hierarchical for complex decomposition. Event-driven for high-frequency monitoring. The wrong pattern adds coordination overhead without benefit.
  • Do the reliability math before you build. At 85% individual reliability, five chained agents yield only 44% system reliability — a common heuristic is about 97%+ per agent before long sequential chains, but treat that as a benchmark to sanity-check plans, not a guarantee.
  • Design for partial failure from the start. Output validation at every handoff, state checkpoints at each step, and explicit completeness signals are the structural defenses. Partial failures don't throw errors — they propagate silently until you find them in the data.

Your agents are orchestrated and running. But how do you know they're still performing the way they were six weeks ago? Agent reasoning degrades in ways that metrics dashboards don't catch. This module sits in the broader ADE 301 — Production arc; when you are ready to gate releases and runbooks, pair these orchestration choices with Production readiness. Use the quiz below to continue to Observability, where we cover what to monitor when agents are in production.

State synchronization mapping — reference for the last quiz question
  • Low contention on one shared mutable record: overlapping writes to the same row or aggregate are rare.
  • Many concurrent appenders: many writers add events to one authoritative, append-only timeline without destructive overwrites.
  • No shared writable store: agents do not co-own one mutable table; coordination is through messages, APIs, or exports only.

Reliable orchestration is the foundation — next, learn how to observe what agents actually do.

Next: Observability →

Additional Reading

Custom agents

When your platform supports user-defined agents, Introducing Custom Agents: Build Your Own AI Specialists in Minutes is a feature overview with example YAML configs. The orchestration patterns above apply on any stack that supports multiple agents.