What's the Concept?

For most batch transformations, BigQuery SQL is enough. The query plans, the parallelism, the scaling — BigQuery handles all of it. You write SELECT, it scans terabytes, you get an answer.

There are cases where SQL isn't enough:

  • Streaming with stateful windows. Counting events in a five-minute sliding window across a continuous stream. SQL can express this, but the streaming semantics get hairy.
  • Per-row enrichment via external services. Calling an embedding API for every row, with backpressure and retry. Doable in SQL with ML.GENERATE_EMBEDDING, but for non-Vertex APIs you need code.
  • Massive joins across multiple streams. Joining two real-time event streams on a key, with watermarks and late data handling.
  • Complex JSON unflattening. Parsing deeply nested payloads that don't fit BigQuery's struct semantics neatly.

Dataflow is built for these. It's a managed runner for Apache Beam, a unified batch + streaming framework. You write a pipeline in Python or Java; Dataflow runs it on autoscaling workers.

How It Works

A Dataflow pipeline that enriches a Pub/Sub stream with embeddings and lands the result in BigQuery:

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
 
def call_embedding_api(row: dict) -> dict:
    """Call Vertex AI Embeddings on the row's text content."""
    from google.cloud import aiplatform
    client = aiplatform.gapic.PredictionServiceClient(...)
    resp = client.predict(...)
    row["embedding"] = resp.predictions[0]["embeddings"]["values"]
    return row
 
with beam.Pipeline(options=PipelineOptions(streaming=True)) as p:
    (
        p
        | "ReadFromPubSub" >> beam.io.ReadFromPubSub(
            subscription="projects/myco-prod/subscriptions/tickets-sub"
        )
        | "ParseJSON"      >> beam.Map(lambda b: json.loads(b.decode("utf-8")))
        | "Filter"         >> beam.Filter(lambda r: r.get("status") == "created")
        | "Embed"          >> beam.Map(call_embedding_api)
        | "WindowFiveMin"  >> beam.WindowInto(beam.window.FixedWindows(300))
        | "ToBigQuery"     >> beam.io.WriteToBigQuery(
            table="myco.silver.tickets_embedded",
            schema="ticket_id:STRING,content:STRING,embedding:FLOAT64,_processed_at:TIMESTAMP",
            write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
        )
    )

What Dataflow does for you:

  • Autoscaling. It spins workers up and down based on the input rate. A burst from Pub/Sub gets more workers; quiet hours scale to one.
  • Watermarks. Streaming windows have correct "this window is closed" semantics even with late-arriving data.
  • Retries with backpressure. A failing embedding call backs off and retries; Pub/Sub messages aren't acked until processing succeeds.
  • Unified batch + stream. The same Beam code runs in batch mode against a historical GCS dataset for backfills.

Why It Matters

  • Streaming is genuinely different. Batch SQL has a clean concept of "the data" — you query whatever's there. Streaming needs windows, watermarks, exactly-once semantics. Dataflow handles all of this without you writing it.
  • External API calls in a pipeline are non-trivial. Rate limits, retries, batching, parallelism — Dataflow's I/O transforms handle them. Doing the same in a Cloud Run script means re-implementing distributed semantics.
  • Beam pipelines are portable. Same code, different runner (Flink, Spark) — useful if you ever leave GCP, or run the same pipeline locally for tests.

Key Technical Details

  • Dataflow's pricing is per worker-hour, per vCPU and per GB of RAM. A streaming job at 1 vCPU runs about $35/month sustained.
  • Use Dataflow Flex Templates for parameterized pipelines you launch from Composer or Cloud Run. Templates separate "the code" from "the run."
  • Dead-letter sinks catch records that consistently fail — write them to a GCS quarantine bucket for manual review.
  • Streaming jobs run forever. Define explicit upgrade and shutdown procedures; "kill and restart" loses in-flight messages without proper draining.

Common Misconceptions

"Use Dataflow for everything." Dataflow has cold-start cost and operational complexity that BigQuery doesn't. For batch SQL, BigQuery is simpler. For sparse events, Cloud Run is cheaper. Dataflow is the right answer specifically when SQL or simple Cloud Run aren't enough.

"Dataflow replaces BigQuery." They complement each other. Dataflow processes; BigQuery stores and queries. The output of a Dataflow pipeline is usually a BigQuery table.

"Streaming means low latency." Dataflow streaming pipelines have ~10-second end-to-end latency in steady state — fine for most agent uses, not "real-time" in the sub-second sense. For sub-second you need to drop windowing entirely.

Connections to Other Concepts

Further Reading

  • Google Cloud, "Dataflow overview" + "Apache Beam programming guide."
  • Tyler Akidau, Slava Chernyak, Reuven Lax, "Streaming Systems" (O'Reilly) — Co-authored by the team that built Dataflow; the canonical streaming reference.
  • "Dataflow Cookbook" GitHub repo — Reference pipelines for common patterns.