Data Governance From Day One

What's the Concept?

Governance gets a bad reputation because it's usually introduced after a compliance incident, by a committee, as a process tax. Done at the start, it's much cheaper: a few labels on buckets, an IAM convention, a retention rule, and a one-page document of "who owns what." That's all most early-stage agent pipelines need to stay safe.

Four pieces matter:

Ownership — Every dataset has a named owning team.
Classification — Each dataset is tagged with a sensitivity level (public, internal, confidential, restricted).
Access — IAM grants follow ownership + classification, not personal request.
Retention — Each dataset has a deletion or archive policy.

Get those four written down and applied, and the rest of governance is incremental.

How It Works

The minimum implementation on GCP:

Labels on buckets and BigQuery datasets capture ownership and classification:

labels = {
  owner: "platform-data",
  sensitivity: "confidential",
  source: "stripe",
  pii: "true",
  cost_center: "engineering",
}

These labels show up in billing reports, IAM audit logs, and asset inventory queries. They're free, queryable, and durable.

IAM at the project + bucket level follows a small set of named roles:

data-engineer@ — write to bronze + silver, query everything.
data-analyst@ — query silver and gold, no writes.
agent-service-account@ — query gold only, no writes, scoped to specific tables.
pii-reviewer@ — granted manually, time-bounded, for any access to PII-tagged datasets.

No one gets roles/owner on the project after initial setup. Personal access is via group membership, audited weekly.

Retention is set at three points:

GCS lifecycle rules auto-archive bronze and delete partitions older than the retention window.
BigQuery expirationMs on partition-level — old partitions delete themselves.
A documented schedule for any silver/gold table containing PII (often 13 months to honor GDPR-style requests).

A datasets doc in the repo lists every silver/gold table: purpose, owner, refresh cadence, retention, downstream consumers (including which agent uses it).

Why It Matters

Audits become a SELECT. With labels in place, "which datasets contain PII?" is a one-line query against the asset inventory. Without labels, it's a multi-week project.
Onboarding speeds up. New engineers find what's available, who owns it, and how to access it without asking around.
Production incidents stay contained. When a service account leaks, you know exactly which tables it could read. The blast radius is bounded.
Cost attribution works. Labeling by cost_center and team means the billing report tells you, by row, who's spending what.

Key Technical Details

BigQuery supports column-level tagging via Policy Tags (Data Catalog). For PII columns specifically — emails, phone numbers, names — wrap them in a policy tag and grant access to the tag, not the column. Even SELECT * respects the policy.
VPC Service Controls (VPC-SC) add a network-perimeter layer on top of IAM. For regulated workloads (HIPAA, PCI), this is the standard.
Cloud DLP can scan GCS and BigQuery for PII; useful as a periodic audit, not as a real-time gate.
Cloud Audit Logs are on by default; "data access" logs are off by default and worth turning on for sensitive datasets.

Common Misconceptions

"We're too small to need governance." The marginal cost at small scale is one afternoon. The marginal cost at large scale is six months. Pick.

"IAM groups are enough." They're necessary but not sufficient. Groups handle who can access; labels and tags handle what they're accessing. You need both.

"Cloud DLP redacts our data." Cloud DLP can detect PII and produce redacted copies, but it's a separate pipeline you have to wire up. It's not automatic. We'll cover that in Module 07.

Connections to Other Concepts

IAM and Security for Agent Data Paths — Deep dive on the IAM patterns.
Handling PII and Redaction Pipelines — DLP, tokenization, and redaction.
Schema-on-Read vs Schema-on-Write — Governance is easier when silver schemas are explicit.