Context, Tools, and Triggers — The Three Building Blocks
You've inherited a broken pipeline: the nightly Airflow DAG has been red for two hours. You open your agent interface and type: "Fix the broken pipeline."
The agent gets to work. Forty-five minutes later, you have a result — maybe after a flurry of Slack threads you did not have to write. The pipeline runs. And then you look more carefully — the agent has rewritten the entire ingestion flow from scratch. It dropped the incremental load logic your team spent two sprints tuning. It replaced your partner team's carefully negotiated schema contract with a new one the downstream consumers have never seen. The job runs, yes. The data is wrong.
Four words of context — "Fix the broken pipeline" — for a decision that touched every downstream consumer. This is what thin context looks like at production scale.
This didn't happen because the agent was dumb. It happened because "fix the broken pipeline" told it almost nothing useful: not what "fix" meant in your environment, not which pipeline, not what constraints applied, not what actions were available and which were off-limits. The agent did exactly what it was designed to do — it resolved ambiguity with its best guess. The problem was yours: you handed a capable actor the wrong inputs and were surprised by the output.
The difference between an agent that does something useful and an agent that does something destructive is almost always the same three things: context, tools, and triggers.
Three design decisions determine whether an ADE system does something useful or something destructive: the context it starts with, the tools it can act with, and the triggers that set it in motion.
By the end of this module, you'll be able to: (1) design a context stack for a data engineering agent, (2) define minimum tool scope for a production use case, (3) identify the right trigger types for a given class of pipeline problem, (4) explain why benchmark accuracy overstates production reliability.
The three building blocks
Every well-designed agentic system is defined by these three elements:
In code terms, an agent is implementable as a for loop with an LLM call — context, tools, and triggers are what you put into that loop each iteration: what the model sees, what actions it may take, and what event starts the next turn.
- Context — everything the agent knows before it acts
- Tools — every action the agent is capable of taking
- Triggers — every event that causes the agent to activate
Get these three right, and you have a system that reliably handles a class of work that used to require human attention. Get them wrong — especially context and tool scope — and you have something that can do real damage quickly. This isn't a reason not to build; it's a reason to build deliberately.
The rest of this module walks through each one in depth, then shows how they work together in a concrete end-to-end example. The most costly design mistake is building a capable agent on a thin context stack — capability without calibration.
Context — everything the agent knows
Context isn't a modifier on agent quality — it is agent quality. The quality of the decisions your agent makes is bounded, almost entirely, by the quality of the information it starts with.
The same model, with richer context, will make better decisions, catch more edge cases, and stay aligned with your team's standards. With thin context, it will hallucinate, generalize from training data, and fill in the gaps with its best guess. "Best guess" against production data with real downstream consequences is not where you want to be.
Here's what a real context stack looks like for a data engineering agent:
- System instructions: The agent's standing rules and persona. What role is it playing? What is it never allowed to do? These are your non-negotiables, set once and applied to every interaction.
- Rules and conventions: Your team's standards, encoded as files the agent reads before acting.
code_standards_python.md.operations_scheduling.md.data_contract_v2.md. If a standard lives only in someone's head, the agent doesn't know it exists. If it's a file in the repository the agent can reference, it's part of the context stack. - Schema metadata and lineage: What does this dataset look like? Where did it come from? What are the field types, nullability constraints, known upstream quirks? Without this, an agent working on a transformation is guessing at the data model.
- Data profiles: Distributions, null rates, cardinality, historical ranges — what "normal" looks like for this dataset. This is how an agent tells the difference between a legitimate upstream change and something that's actually broken.
- Past decisions — if logged: Agents don't retain memory across invocations by default; each run starts with a fresh context window — the running total of text the model can hold in mind at once, measured in tokens. Making past reasoning available requires deliberately writing structured decision logs to a store (a database, a file, a vector index) and retrieving relevant records into context when the agent activates. Platforms that support this gain compounding value over time — but it doesn't happen automatically.
- Business rules: The "why" behind data requirements. Why does this column never contain nulls in production even though the schema allows them? Why is this join key a composite instead of a surrogate? Business rules that aren't written down become invisible to every system that isn't a person who was in the room.
If a standard lives only in someone's head, the agent doesn't know it exists. Context stacks are only as complete as what's written down and accessible to the agent.
The practical implication: before you write a single prompt, ask yourself what the agent would need to know to make the decision you'd make. Then make sure that information is actually available to it — not just somewhere in a Confluence page, but accessible, structured, and legible to the agent.
Loading all six context categories simultaneously sounds straightforward until you're pulling 50,000 lines of execution history plus a full lineage graph. Frontier models now support very large context windows — and hitting those limits isn't even the main risk. Simply appending more context can actively hurt decision quality: research on long-context performance found that performance degrades significantly when relevant information is in the middle of long contexts, with highest performance when the most important information appears at the beginning or end. More context doesn't mean better results — it sometimes means the right signal is buried.
The discipline of context engineering includes knowing what to leave out: retrieve the most relevant logs rather than all logs, summarize rather than append, prioritize the information most likely to affect the specific decision at hand. At production scale, retrieval-augmented approaches (fetching relevant documents before the model generates a response) that surface the right context on demand are typically more cost-efficient for high-volume, targeted lookups than passing full context windows — meaning context engineering is as much a cost decision as a quality decision.
Context engineering vs. prompt engineering
Most conversations about working with AI agents focus on prompt engineering: the specific language you use when you make a request. It matters. But it's the smallest part of the leverage.
Context engineering is the broader discipline — giving the system access to useful information via rules, tools, and configuration that apply across every interaction. A well-engineered context stack means every agent interaction starts from a strong baseline, regardless of how the prompt is written. Prompt engineering optimizes individual interactions. Context engineering optimizes the system.
| Prompt engineering | Context engineering | |
|---|---|---|
| Scope | One interaction | Every interaction |
| Lever | Wording of the request | Rules, metadata, tools, configuration |
| Compounds? | No — each prompt optimizes one turn | Yes — each standards file improves all future turns |
| Payoff timeline | Immediate | Weeks and months |
In practice: once your basic scaffolding is in place, spending most of a week writing and organizing your team's standards into structured reference files delivers more durable value than the same time iterating on prompt wording. Both matter — but if forced to choose, context stack investment compounds; individual prompt refinements don't.
Context stack investment compounds. Individual prompt refinements don't.
Writing better intent specifications
When you do write a prompt, the single biggest lever is clarity about outcome, not steps. The agent will figure out the steps. You need to specify what success looks like, what the constraints are, and what the agent should not do.
Here's the same request, written two ways:
# Vague — don't do this
Fix the broken pipeline.
# Specific — do this
The `orders_daily` pipeline failed on the most recent run. Investigate the root cause.
If the upstream schema has changed, identify the affected components downstream and
propose a patch. Flag any downstream consumers that will need a coordinated update.
Do not modify any downstream consumer configurations directly.
The second prompt is longer. It's also dramatically more likely to produce a useful result — because the agent now knows what it's working on, what it's allowed to do, what it's not, and what the output should look like.
A few principles that consistently improve intent specification:
- Clarity — specify what outcome you want, not what steps to take
- Role — give the agent a clear function: investigator, not fixer; proposer, not deployer
- Examples — one concrete example of correct output is worth several paragraphs of description
- Calibration — treat the first prompt as a hypothesis, not a final answer; iterate from what you observe
In Ascend, Otto surfaces lineage, metadata, and execution context within your workspace and with user-granted access — without manually stitching lineage and run metadata before each turn. The platform draws on the unified metadata, lineage, and run history it maintains, while teams author standards files and business rules to complete the context stack. The architectural goal is the same regardless of platform; the variable is how much you build versus configure.
Context tells the agent what to know. But knowing and doing are different things — which is where tools come in.
Tools — what the agent can act on
Without tools, an agent can only talk. It can analyze, suggest, explain, and plan — all inside the conversation window. The moment you give it tools, it can act in the world: read a file, run a query, commit code, call an API, send a notification.
The tools available to a data engineering agent typically include:
| Tool type | Example | Risk if unscoped |
|---|---|---|
| File operations | Read, edit, create source files — components, transformations, configs | Overwrite production configs or schema contracts |
| Query execution | Run SQL against dev/staging to validate logic | Execute against production without row limits or guardrails |
| Git operations | Read commit history, diff changes, open pull requests | Merge to main without review; force-push history rewrites |
| Code execution | Run Python or shell scripts, execute data quality checks | Run destructive scripts in the wrong environment |
| Metadata inspection | Read schema definitions, lineage graphs, dependency maps | Expose sensitive catalog data outside authorized roles |
| External integrations via Model Context Protocol (MCP) | Notification systems, ticketing tools, data catalogs, external APIs | Fire production alerts or open tickets without human review |
(MCP = Model Context Protocol — a standard for connecting agents to external tools and data sources.)
That last row deserves more explanation. MCP — the Model Context Protocol — is an open protocol announced by Anthropic in November 2024 as "a new standard for connecting AI assistants to the systems where data lives." Support varies widely by runtime — verify your stack before depending on MCP in production; the community implementation count is growing fast (GitHub: model-context-protocol topic).
MCP standardizes how agents connect to external services across three primitives: tools (actions the agent can invoke), resources (data the server exposes directly into context — schema definitions, catalog entries, lineage data), and prompts (reusable templates). Instead of writing custom integration code for every combination, you implement an MCP server once per service.
Having an MCP server for an external service doesn't automatically make it available to your agent. Your agent runtime also needs to support MCP as a client. Verify client support in your specific stack before depending on MCP in production — ecosystem coverage is growing fast but uneven.
The hardest MCP decision isn't technical — it's a policy question: which external connections should the agent have, and under what conditions?
Tool scope is a risk decision
Every tool you give an agent is a potential action in the world. An agent with read access to a staging database is a different risk profile than one with write access to production. An agent that can open pull requests is a different risk profile than one that can merge and deploy them.
The practical guidance: scope tools to the minimum necessary for the task, and keep humans in the loop for any tool that touches production state or has irreversible consequences. "The agent could do this" and "the agent should have access to do this" are different questions. When you're designing an agent for a new use case, list every tool it will need — and explicitly decide what it won't have access to. The list of what the agent can't do is as important as the list of what it can do. A well-scoped tool list is a forcing function for clearer agent design: if you can't decide what the agent shouldn't do, you don't yet have a clear picture of what it should.
Tool reliability in multi-step pipelines
Individual tool calls are reasonably reliable: leading models achieve strong but variable scores on tool-calling benchmarks — see current rankings from Berkeley's Function-Calling Leaderboard for the latest numbers across models and task types. In multi-step pipelines, that's not the number that matters. Errors compound.
Benchmark scores overstate production-like reliability — you need to test under input variation and simulated faults, not only headline leaderboard numbers. ReliabilityBench found that minor rewording of tasks alone produced material accuracy drops before any infrastructure stress, and that standard single-run benchmark scores significantly overstated reliability under simulated production conditions. Combining semantic variation with simulated tool and API failures produced further degradation.
Research analyzing agent failure archetypes from simulation-based evaluation identifies four recurring patterns:
- Premature action without grounding — acting before verifying the current state
- Over-helpfulness substituting missing entities — fabricating a missing data element rather than asking for clarification
- Context pollution from distractor data — irrelevant context overriding relevant signals
- Fragile execution under load — degraded reasoning when the task complexity or context size increases
Every item on this list is an argument for design review before deployment — not more prompting after the fact.
All four can often be mitigated through better context and tool design — which is why context and tool design decisions have such outsized impact on system reliability. The implication for tool design: build checkpoints, log intermediate states, and escalate to humans when the reasoning chain gets long.
Context and tools define an agent's operating envelope. Triggers are what set it in motion.
Triggers — when and why the agent activates
A trigger is the event that tells an agent it's time to act. Without a trigger, an agent is a capability waiting to be invoked. With the right trigger, it's a system that handles a class of problems automatically, without requiring a human to notice and respond.
The main trigger types in data engineering contexts:
- User prompts: Direct invocation — you type a request, the agent responds. Useful for exploratory and one-off tasks, but requires a human to be in the loop.
- Schedules: Time-based triggers. Unlike cron, which executes a fixed script on a timer, a scheduled agent can reason about what it finds — interpreting ambiguous conditions, drafting responses, and choosing between courses of action without pre-programmed branches.
- Data arrival events: A new partition lands in the warehouse, a file appears in a watched location, an API pushes a webhook. The data arriving is the trigger.
- Pipeline failures: The job fails; the agent investigates. This is the category that most immediately addresses the "3am alert problem."
- Schema changes: An upstream API changes its response format; the agent detects the diff and starts the impact assessment before a human even knows the change happened.
- Quality alerts: A data quality check fails — unexpected nulls, distribution shift, referential integrity violation — and the agent begins root-cause analysis rather than just filing an alert.
The best triggers are the ones that currently wake you up at 3am. If there's a category of alert that reliably requires someone to be interrupted, context-switch, dig through logs, and make a decision — that's a trigger worth engineering. The agent doesn't need to solve the problem autonomously every time. A trigger that wakes the agent instead of waking you, and surfaces the right information so you can make a faster decision, is already a significant win.
With the right trigger architecture in place, entire categories of interrupt work — schema drift, quality gate failures, incremental pipeline repairs — can transition from reactive human response to structured agent workflows. The agent handles the diagnostic loop; you handle the judgment calls.
Putting it together
Here's how context, tools, and triggers work together in a realistic scenario.
At 03:47 UTC, an upstream API deploys a new version that renames a field. Your orders_daily pipeline runs at 04:00, hits the mismatch, and fails.
-
Trigger fires — The
orders_dailypipeline fails at 04:00 UTC. The failure event activates the agent. -
Context assembled — Before acting, the agent reads: last 5 execution logs for this pipeline, git history of the failed component (in Ascend, a reusable data transformation or connector in a flow — see Your first pipeline), component metadata and downstream consumers, the last-known-good schema definition, and
code_standards_python.mdfor parser conventions. -
Agent decides — It identifies the cause: an upstream API renamed a field. It plans a targeted parser update, not a full rewrite.
-
Tools invoked — Code editor updates the parser. Query runner validates the fix in staging. Notification tool opens a PR and sends a Slack summary to
#data-oncall.
Result: Parser updated. Fix validated. PR opened with full context. No one's phone buzzed at 3:47am. When the team comes online, the PR review takes three minutes.
⏱ 10 minutes
The module's core claim is that context quality is the highest-leverage variable in agent behavior. This prompt makes that visible by showing the same request handled two ways — with thin context and with rich context.
Open any LLM — Claude, ChatGPT, or Gemini work well — and paste this:
I'm going to give you two versions of the same request. For each, tell me what you would produce and what the likely outcome would be.
Request A:
"Fix the broken orders_daily pipeline."
Request B:
"The orders_daily pipeline failed on this morning's run at 04:00 UTC. The error in the logs is: KeyError: 'order_status'. The pipeline reads from a source API that deploys new versions weekly. Investigate the root cause — check whether an upstream field was renamed or removed. If you find the cause, propose a targeted fix to the parser only. Do not modify any downstream transforms. Flag all downstream components that consume order_status for a coordinated review. Before proposing any fix, confirm the list of affected downstream components."
What's the difference in what you'd produce from A vs. B, and why?
What to notice: For Request A, notice how the LLM resolves the ambiguity — it fills in the blanks with something that sounds reasonable. That's exactly what an agent does, at machine speed, across production systems. Request B's constraints aren't just politeness — each one ("do not modify downstream transforms," "confirm before proposing") is a guardrail that limits scope and creates a verification step before action. The difference between the two responses is a live demonstration of why context engineering is the highest-leverage skill in ADE.
- Context engineering is the highest-leverage skill in ADE. The quality of the context stack — the rules, metadata, lineage, standards, and historical reasoning available to the agent — determines the quality of its decisions. More context isn't always better; knowing what to leave out is part of the discipline.
- Tools separate a chat-only assistant from an agent. Scope them to the minimum necessary. Every tool is a potential action in the world — the list of what the agent can't do is as important as the list of what it can.
- Benchmark scores overstate production-like reliability — evaluate under input variation and simulated faults, not headline numbers alone. ReliabilityBench found material accuracy drops from minor task rewording alone and large gaps between standard single-run benchmark scores and simulated production conditions.
- Good triggers connect agents to real events. The best place to start is whatever category of problem is currently waking someone up at 3am — even partial automation that surfaces the right context is faster than manual log-diving.
The harder question is where to point it first — and how the same context, tools, and triggers logic applies differently depending on where you are in the data lifecycle. You now have a working framework for any agent: what it knows, what it can do, and what sets it in motion. The next module maps context, tools, and triggers across the full DataOps lifecycle, from ingest to modernization, so you can identify which parts of your team's work are the highest-value candidates for an agentic approach.
Next: Agents Across the DataOps Lifecycle →
Additional Reading
- Model Context Protocol — Announcement — The original MCP launch post from Anthropic: the open spec, reference SDKs, and initial Claude Desktop server integrations.
- MCP Specification Documentation — The protocol's three core primitives — tools, resources, and prompts — with implementation details for server and client developers.
- Lost in the Middle: How Language Models Use Long Contexts — The empirical basis for context ordering and retrieval discipline: performance degrades when relevant information appears in the middle of long contexts.
- ReliabilityBench: Evaluating LLM Agent Reliability Under Production-Like Stress Conditions — Minor task rewording and simulated production conditions reveal systematic gaps between leaderboard scores and real-world agent reliability; essential reading before designing evaluation plans.
- coSTAR: How We Ship AI Agents at Databricks Fast Without Breaking Things — Databricks Blog, Mar 2026. The coSTAR methodology for building and testing production agents at Databricks: LLM-judge alignment, scenario-driven evaluation, and continuous refinement using MLflow.