What's the Concept?

When your source of truth is a production database — Postgres holding customer accounts, MySQL holding orders — you have two ways to bring it into the lake. You can issue periodic SELECT * FROM customers WHERE updated_at > ? queries (batch). Or you can read the database's own change log and replay every row mutation as it happens (CDC).

CDC is the production answer when you need near-real-time data in the warehouse without putting query load on the OLTP database. The database is already writing every change to a write-ahead log for its own crash recovery. CDC tools tail that log and republish the events.

How It Works

On GCP, the canonical CDC service is Datastream. It's a managed agent that:

  1. Connects to your Postgres / MySQL / Oracle source via a private connection.
  2. Reads the WAL (write-ahead log) — pgoutput plugin for Postgres, binlog for MySQL.
  3. Publishes change events to a destination, typically GCS bronze or BigQuery directly.
   Postgres production     Datastream                BigQuery
   ────────────────         ─────────                 ────────
   COMMIT  ──── WAL ────▶  tail  ────▶  events  ──▶  bronze.orders_raw
                                                     (one row per change)

                                                          │ silver step

                                                     silver.orders
                                                     (one row per order,
                                                      latest version)

Each CDC event has at least:

  • An operation: INSERT, UPDATE, or DELETE.
  • A primary key.
  • The new row state (and often the previous row state).
  • A commit timestamp (the LSN/binlog position is the authoritative ordering).

The bronze table is append-only and contains every change. The silver table collapses the change history into the current state per entity, using ROW_NUMBER() OVER (PARTITION BY id ORDER BY commit_ts DESC) or a similar merge pattern.

Why It Matters

  • Zero load on the source database. Reading the WAL is something the DB does anyway for replication. CDC piggybacks on that infrastructure.
  • Captures deletes, which batch can't. A SELECT * query has no way to notice a row that's been deleted. CDC explicitly emits delete events.
  • Sub-minute freshness, end-to-end. Datastream + BigQuery typically lands changes within 30–60 seconds of commit.
  • Audit trail comes free. Because bronze is append-only, you have a complete history of every row mutation — useful for debugging and for time-travel queries the agent might need.

Key Technical Details

  • Postgres requires wal_level=logical and at least one replication slot. The slot retains WAL until it's consumed — if Datastream goes down, the source's disk usage grows. Monitor this.
  • MySQL needs binlog_format=ROW and at least 7 days of binlog retention. Otherwise a Datastream pause can lose data.
  • DDL changes (schema migrations) need careful handling. Datastream can detect column adds; column drops and type changes usually break the stream and need manual recovery.
  • Cost model is per-GB processed. For a moderately busy OLTP database, expect $50–200/month.

Common Misconceptions

"CDC means I don't need batch." You still might. CDC is great for tables that change frequently; for slow-changing reference data, a nightly batch is simpler and equally fresh enough.

"CDC is real-time replication." It's near-real-time. There's a small lag from commit to lake — usually under a minute — and that lag can spike during high write volume or replication-slot back-pressure.

"CDC will replace my warehouse ETL." No — CDC fills bronze. Silver and gold modeling are still entirely your job; CDC just makes the input fresher.

Connections to Other Concepts

Further Reading

  • Google Cloud, "Datastream overview" docs.
  • Martin Kleppmann, "Designing Data-Intensive Applications" (O'Reilly) — chapter 11 on log-based replication is the best primer on what's happening inside Datastream.
  • Debezium documentation — the open-source equivalent; reading their docs gives a clearer mental model of CDC mechanics.