Batch Ingestion From APIs

What's the Concept?

The most common source for an agent's data is a third-party API: Stripe billing, Salesforce records, a partner's product catalog, an internal microservice that holds the source of truth. None of these are warehouses. You can't query them directly without rate-limit pain. The job is to copy their data — on a schedule that matches your freshness requirement — into bronze, where everything else can take over.

The naive version is a Python script in a cron job. The production version handles pagination, rate limits, retries, schema drift, late-arriving data, and provenance. The shape is otherwise identical.

How It Works

A batch puller's anatomy:

# Pseudo-code for a typical batch puller
def pull_orders(since: datetime) -> int:
    """Pull orders from the source API into GCS bronze."""
    bucket = "myco-lake-bronze"
    prefix = f"source=salesforce/entity=orders/ingestion_date={today()}"
 
    cursor = None
    page = 0
    written = 0
    while True:
        resp = http.get(
            "https://api.salesforce.com/orders",
            params={"updated_since": since, "cursor": cursor, "limit": 500},
            headers=auth_headers(),
            timeout=30,
        )
        resp.raise_for_status()           # let retries handle 5xx
        rows = resp.json()["data"]
        if not rows:
            break
 
        # Land raw — do not transform here
        gcs_write_jsonl(
            bucket=bucket,
            path=f"{prefix}/page={page:05d}.jsonl.gz",
            rows=rows,
            metadata={"source_cursor": cursor, "pulled_at": now_iso()},
        )
        written += len(rows)
 
        cursor = resp.json().get("next_cursor")
        if not cursor:
            break
        page += 1
 
    return written

Key choices visible above:

Land JSONL, gzipped, partitioned by ingestion_date. Bronze is append-only; the partition key is the date of ingestion, not the date of the event (those are different and both matter).
Watermark on updated_since. The puller pulls only what's changed since its last run, plus a small overlap window for safety. This is how you keep batch jobs cheap as the dataset grows.
Don't transform. Every transformation belongs downstream of bronze. Save the raw payload and a few _meta fields. That's it.

For GCP, the typical home for this puller is Cloud Run (HTTP service) triggered by Cloud Scheduler, or for heavier jobs, a Cloud Composer Airflow DAG. Both write to the same GCS bucket using the same path convention.

Why It Matters

APIs are the dominant data source for SaaS-shaped businesses. Most agent stacks ingest from at least one API; many ingest from five or ten.
Batch is "good enough" for most freshness requirements. Hourly batches give sub-hour freshness, which is fine for the vast majority of agent use cases.
Done right, batch is cheap. A Cloud Run puller pulling 100k rows hourly costs single-digit dollars per month.

Key Technical Details

Always pass an idempotency key when writing to GCS — the path itself is fine. Re-running the same job should overwrite the same file, not create a parallel one.
Respect rate limits with backoff (exponential, capped at ~60s). Most APIs publish a Retry-After header; honor it.
Keep watermarks in a small BigQuery table (ingest_watermarks) or Firestore. Don't store them in environment variables — they need to survive deploys.
Standard timeout / retry budget: connect 5s, read 30s, three retries, total request budget ≤2 minutes.

Common Misconceptions

"Just full-load every time." Works at small scale; becomes infeasible at production scale. Watermarked incremental pulls are the production pattern.

"Schema validation belongs in the puller." No — validate in the bronze→silver step. The puller's job is to capture, not judge. Schema drift should produce a silver-layer alert, not a dropped record.

Connections to Other Concepts

02-event-streams-with-pub-sub — When batch isn't fresh enough.
04-files-and-bulk-loads-into-gcs — The same shape for files instead of APIs.
Course 06-pipeline-orchestration/01-orchestrating-with-cloud-composer — Scheduling and managing the puller in production.