Bucket Layouts and Partitioning

What's the Concept?

A bucket is a flat namespace; "folders" are just /-delimited prefixes. The way you choose those prefixes is the most consequential decision you'll make about your lake, because it determines:

Which downstream tools can read partitions without scanning the whole bucket.
How easy it is to add a new source without rewriting consumers.
How permissions split between teams and use cases.
How lifecycle rules apply (you can target by prefix).

The canonical layout is source → entity → time partition → file. Once committed, every ingester writes to that shape, and every downstream consumer expects it.

How It Works

The exact path convention used throughout this course:

gs://<bucket>/source=<src>/entity=<ent>/ingestion_date=<YYYY-MM-DD>/<file>

Each component has a job:

source=<src> — Which external system this came from. stripe, salesforce, intercom, internal-app-events. Stable forever; rename = breaking change.
entity=<ent> — What kind of object. charges, customers, tickets. Maps 1:1 with the downstream silver tables.
ingestion_date=<YYYY-MM-DD> — When we landed it. Not the event timestamp inside the data — those will disagree, and that's important to track.
<file> — Whatever the ingester wrote. Common: page=00000.jsonl.gz, part-<uuid>.parquet, <batch-id>.csv.

Why ingestion_date and not event_date? Because ingestion is the only date the ingester unambiguously knows. Event date lives inside the payload and might be missing, malformed, or reflect a different time zone. Late-arriving data — common in CDC and webhooks — would scramble an event-date partition. By partitioning on ingestion_date, you keep bronze append-only and time-ordered by what actually happened in your system.

For sources where partitioning by event date matters downstream (which is most of them), the silver layer re-partitions correctly. That's a refinement-layer concern, not a bronze one.

Why It Matters

Queryability without enumeration. BigQuery, Dataflow, and dbt can read gs://bucket/source=stripe/entity=charges/ingestion_date=2026-05-*/ directly. No metadata service needed.
Backfills become a gcloud cp away. Need to reprocess April? You know exactly which files to point a job at.
Permissions follow prefixes. A consumer that only needs Stripe data gets IAM grants on source=stripe/* and can't see anything else.
Lifecycle rules work cleanly. "Move anything older than 90 days to Coldline" applies at the bucket level and respects the partition layout automatically.

Key Technical Details

Keep partition values URL-safe and lowercase. Avoid spaces, capital letters, and special characters in source= and entity= — they'll bite you in shell scripts.
File size sweet spot: 100 MB to 1 GB per file. Smaller wastes BigQuery overhead; bigger slows parallel reads. Ingesters that produce many tiny files should compact on schedule.
Format preference inside the partition: gzipped JSONL for record-shaped data, Snappy-compressed Parquet for analytical-shaped data. Skip CSV when you can.
Avoid entity=*/ingestion_date=*/source=*/… (column order shuffled). BigQuery's external-table partition discovery requires consistent order.

Common Misconceptions

"Just put the date first; that's how Hive does it." Hive-style is key=value, but the order of keys is up to you. Putting source first keeps related data together for tools that scan by prefix; lifecycle rules and IAM both benefit. Date-first works but groups unrelated sources together.

"I'll figure out the layout when I need to." This is the worst regret. A bad layout chosen early ossifies — every job written against it makes the migration harder. Spend the hour up front.

Connections to Other Concepts

01-cloud-storage-as-a-lake — The bucket-level setup.
03-schema-on-read-vs-on-write — How layout interacts with schema discovery.
Course 04-refinement-in-bigquery/04-incremental-and-idempotent-pipelines — Layout is what makes incremental refinement tractable.