What's the Concept?
A bucket is a flat namespace; "folders" are just /-delimited prefixes. The way you choose those prefixes is the most consequential decision you'll make about your lake, because it determines:
- Which downstream tools can read partitions without scanning the whole bucket.
- How easy it is to add a new source without rewriting consumers.
- How permissions split between teams and use cases.
- How lifecycle rules apply (you can target by prefix).
The canonical layout is source → entity → time partition → file. Once committed, every ingester writes to that shape, and every downstream consumer expects it.
How It Works
The exact path convention used throughout this course:
gs://<bucket>/source=<src>/entity=<ent>/ingestion_date=<YYYY-MM-DD>/<file>Each component has a job:
source=<src>— Which external system this came from.stripe,salesforce,intercom,internal-app-events. Stable forever; rename = breaking change.entity=<ent>— What kind of object.charges,customers,tickets. Maps 1:1 with the downstream silver tables.ingestion_date=<YYYY-MM-DD>— When we landed it. Not the event timestamp inside the data — those will disagree, and that's important to track.<file>— Whatever the ingester wrote. Common:page=00000.jsonl.gz,part-<uuid>.parquet,<batch-id>.csv.
Why ingestion_date and not event_date? Because ingestion is the only date the ingester unambiguously knows. Event date lives inside the payload and might be missing, malformed, or reflect a different time zone. Late-arriving data — common in CDC and webhooks — would scramble an event-date partition. By partitioning on ingestion_date, you keep bronze append-only and time-ordered by what actually happened in your system.
For sources where partitioning by event date matters downstream (which is most of them), the silver layer re-partitions correctly. That's a refinement-layer concern, not a bronze one.
Why It Matters
- Queryability without enumeration. BigQuery, Dataflow, and dbt can read
gs://bucket/source=stripe/entity=charges/ingestion_date=2026-05-*/directly. No metadata service needed. - Backfills become a
gcloud cpaway. Need to reprocess April? You know exactly which files to point a job at. - Permissions follow prefixes. A consumer that only needs Stripe data gets IAM grants on
source=stripe/*and can't see anything else. - Lifecycle rules work cleanly. "Move anything older than 90 days to Coldline" applies at the bucket level and respects the partition layout automatically.
Key Technical Details
- Keep partition values URL-safe and lowercase. Avoid spaces, capital letters, and special characters in
source=andentity=— they'll bite you in shell scripts. - File size sweet spot: 100 MB to 1 GB per file. Smaller wastes BigQuery overhead; bigger slows parallel reads. Ingesters that produce many tiny files should compact on schedule.
- Format preference inside the partition: gzipped JSONL for record-shaped data, Snappy-compressed Parquet for analytical-shaped data. Skip CSV when you can.
- Avoid
entity=*/ingestion_date=*/source=*/…(column order shuffled). BigQuery's external-table partition discovery requires consistent order.
Common Misconceptions
"Just put the date first; that's how Hive does it." Hive-style is key=value, but the order of keys is up to you. Putting source first keeps related data together for tools that scan by prefix; lifecycle rules and IAM both benefit. Date-first works but groups unrelated sources together.
"I'll figure out the layout when I need to." This is the worst regret. A bad layout chosen early ossifies — every job written against it makes the migration harder. Spend the hour up front.
Connections to Other Concepts
01-cloud-storage-as-a-lake— The bucket-level setup.03-schema-on-read-vs-on-write— How layout interacts with schema discovery.- Course
04-refinement-in-bigquery/04-incremental-and-idempotent-pipelines— Layout is what makes incremental refinement tractable.
Further Reading
- Google Cloud, "BigQuery external tables and Hive-style partitioning" docs.
- Spark documentation on partition pruning — explains why this layout makes downstream reads cheap across many tools.
- Maxime Beauchemin, "Functional Data Engineering" essay — Background on why immutable, partition-keyed data is the right shape.