The Retrieval Contract Between Pipeline and Agent

What's the Concept?

When data engineering and AI engineering teams ship things separately, the friction lives at the boundary: an agent that worked yesterday breaks today because a column got renamed; a new agent capability is blocked because the data team doesn't know what shape it needs.

The fix is to make the boundary explicit. The set of tools the agent can call — their names, parameters, return shapes, performance and freshness guarantees — is a versioned interface between the pipeline and the agent. Both sides commit to it. Breaking changes go through deprecation; capabilities are added in versioned form.

How It Works

The contract has five parts, each worth writing down:

1. Tool inventory. A canonical list of tools the agent can call. For each: name, one-paragraph description, schema, owning team. Lives in a YAML file in the repo or a Confluence/Notion doc, but it's a source of truth — not informally maintained.

# agent_tools/billing_agent.yaml
agent_id: billing_agent
version: 1.4.0
owning_team: agents
 
tools:
  - name: get_billing_context
    description: Fetch a customer's billing summary.
    input_schema: { ... }
    output_schema:
      type: object
      properties:
        customer:
          type: object
          properties: { customer_id, email, plan_name, spend_last_90d_cents, ... }
    freshness_sla: 1 hour
    cost_ceiling_bytes: 10_000_000
    owning_team: platform-data
 
  - name: search_similar_tickets
    description: Find tickets similar to a query for this customer.
    input_schema: { ... }
    output_schema:
      type: array
      items: { ... }
    freshness_sla: 24 hours
    cost_ceiling_bytes: 50_000_000
    owning_team: platform-data

2. Freshness contract. Each tool declares how stale its data can be. "Refreshed hourly" vs. "near-real-time, sub-30-second lag" vs. "daily batch — latest available is yesterday's close-of-day." The agent's prompt can reference this, and the tool response should include a _refreshed_at timestamp.

3. Schema versioning. Output schemas are explicit. Adding fields is non-breaking; renaming or removing fields requires a new tool version. The deprecation policy is written: "v1.x is supported for 90 days after v2 ships."

4. Cost ceiling. Each tool declares a maximum bytes-billed per call. The implementation enforces it (BigQuery's maximum_bytes_billed). Agent-side rate limiting on top of that gives a cost-per-conversation bound.

5. Failure modes. What does each tool return when the lookup fails? Empty array? Null object? Error code? Pick a convention and stick to it. Surprising error shapes are the most common cause of "the agent went off the rails."

Why It Matters

Coordination cost drops. With the contract written, the data team can change implementations without the agent team noticing. The agent team can build against the contract before all the data lands.
Drift becomes detectable. Every release that touches a tool runs a contract test: "does the live response match the declared schema?" Schema drift in production is impossible by construction.
Rollback is cheap. A bad new tool version can be hidden behind a feature flag and rolled back without touching the agent code. Just point the registry back at v1.
New agents come up faster. A second agent that needs the same data uses the same tools. Tools are reusable across agents.

Key Technical Details

The output schema is the most important part. If it's not strict, downstream parsing has to handle every variation. Use JSON Schema or Pydantic models; reject responses that don't validate before they reach the agent.
Treat tool inventory as code. PRs that modify it require review from both the data and agent teams. CI runs the schema validation tests.
Tool descriptions are part of the model's prompt — they're billed. Keep them tight but not ambiguous; "tight enough that the model knows when to call it, loose enough that the description still makes sense to a human."
Always include _refreshed_at and _source in the response. The agent can include them in its answer if confidence demands it.

Common Misconceptions

"We can iterate later." Tools without a contract become legacy in weeks. Each fix to one agent breaks another; each new agent re-invents fields. Write the YAML on day one — five lines saves a quarter.

"The model handles missing fields." It does, usually, but unpredictably. Defining the schema means the agent's behavior is predictable, which is the actual product property you care about.

"Tools are owned by the agent team." The implementation might be, but the data shape inside the tool is the data team's responsibility. Co-ownership, with the contract as the boundary, works best.

Connections to Other Concepts

Module 06 — Pipeline orchestration uses the contract to schedule refreshes that meet the freshness SLA.
Module 07 — Monitoring tracks compliance with the contract: schema validation rates, freshness lag, cost-per-call.
Module 08 — The capstone implements a small but complete contract end-to-end.