What's the Concept?

A bucket is a flat namespace; "folders" are just /-delimited prefixes. The way you choose those prefixes is the most consequential decision you'll make about your lake, because it determines:

  • Which downstream tools can read partitions without scanning the whole bucket.
  • How easy it is to add a new source without rewriting consumers.
  • How permissions split between teams and use cases.
  • How lifecycle rules apply (you can target by prefix).

The canonical layout is source → entity → time partition → file. Once committed, every ingester writes to that shape, and every downstream consumer expects it.

How It Works

The exact path convention used throughout this course:

gs://<bucket>/source=<src>/entity=<ent>/ingestion_date=<YYYY-MM-DD>/<file>

Each component has a job:

  • source=<src> — Which external system this came from. stripe, salesforce, intercom, internal-app-events. Stable forever; rename = breaking change.
  • entity=<ent> — What kind of object. charges, customers, tickets. Maps 1:1 with the downstream silver tables.
  • ingestion_date=<YYYY-MM-DD> — When we landed it. Not the event timestamp inside the data — those will disagree, and that's important to track.
  • <file> — Whatever the ingester wrote. Common: page=00000.jsonl.gz, part-<uuid>.parquet, <batch-id>.csv.

Why ingestion_date and not event_date? Because ingestion is the only date the ingester unambiguously knows. Event date lives inside the payload and might be missing, malformed, or reflect a different time zone. Late-arriving data — common in CDC and webhooks — would scramble an event-date partition. By partitioning on ingestion_date, you keep bronze append-only and time-ordered by what actually happened in your system.

For sources where partitioning by event date matters downstream (which is most of them), the silver layer re-partitions correctly. That's a refinement-layer concern, not a bronze one.

Why It Matters

  • Queryability without enumeration. BigQuery, Dataflow, and dbt can read gs://bucket/source=stripe/entity=charges/ingestion_date=2026-05-*/ directly. No metadata service needed.
  • Backfills become a gcloud cp away. Need to reprocess April? You know exactly which files to point a job at.
  • Permissions follow prefixes. A consumer that only needs Stripe data gets IAM grants on source=stripe/* and can't see anything else.
  • Lifecycle rules work cleanly. "Move anything older than 90 days to Coldline" applies at the bucket level and respects the partition layout automatically.

Key Technical Details

  • Keep partition values URL-safe and lowercase. Avoid spaces, capital letters, and special characters in source= and entity= — they'll bite you in shell scripts.
  • File size sweet spot: 100 MB to 1 GB per file. Smaller wastes BigQuery overhead; bigger slows parallel reads. Ingesters that produce many tiny files should compact on schedule.
  • Format preference inside the partition: gzipped JSONL for record-shaped data, Snappy-compressed Parquet for analytical-shaped data. Skip CSV when you can.
  • Avoid entity=*/ingestion_date=*/source=*/… (column order shuffled). BigQuery's external-table partition discovery requires consistent order.

Common Misconceptions

"Just put the date first; that's how Hive does it." Hive-style is key=value, but the order of keys is up to you. Putting source first keeps related data together for tools that scan by prefix; lifecycle rules and IAM both benefit. Date-first works but groups unrelated sources together.

"I'll figure out the layout when I need to." This is the worst regret. A bad layout chosen early ossifies — every job written against it makes the migration harder. Spend the hour up front.

Connections to Other Concepts

Further Reading

  • Google Cloud, "BigQuery external tables and Hive-style partitioning" docs.
  • Spark documentation on partition pruning — explains why this layout makes downstream reads cheap across many tools.
  • Maxime Beauchemin, "Functional Data Engineering" essay — Background on why immutable, partition-keyed data is the right shape.