Data Lakehouse Architecture
A Data Lakehouse is an architecture that combines the cheap, flexible storage of a data lake (S3, ADLS, GCS) with the ACID transactions, schema enforcement, and fast SQL of a data warehouse. The technical breakthrough is open table formats — Apache Iceberg, Delta Lake, Apache Hudi — which sit on top of Parquet files in object storage and provide warehouse-like semantics (transactions, time travel, schema evolution, performant queries) without locking data into a proprietary engine. The strategic appeal: store data once in open formats, query it from any engine (Spark, Trino, Snowflake, Databricks SQL, DuckDB), and avoid vendor lock-in. The trade-off vs a pure cloud warehouse (Snowflake, BigQuery): more flexibility and lower storage cost, but more engineering complexity to operate well. The lakehouse is now the dominant architecture for new data platforms at scale (>5 PB).
The Trap
The trap is adopting a lakehouse architecture for a 50-person company with 10 TB of data because the engineering blogs say it's the future. At that scale, Snowflake or BigQuery will be cheaper, faster, and dramatically simpler to operate than a self-managed Iceberg + Spark + Trino stack. The other trap: choosing a table format (Iceberg vs Delta vs Hudi) based on which engineering blog is loudest, then realizing 18 months in that the format doesn't integrate well with your downstream consumers. The most expensive failure is the 'lakehouse' that's actually just a S3 bucket of Parquet files with no table format, no governance, no ACID — i.e., a swamp wearing a lakehouse t-shirt.
What to Do
Apply a scale + use-case test before adopting lakehouse architecture. (1) Below ~1 PB and ~50 sources: cloud warehouse (Snowflake/BigQuery) is almost always cheaper and faster — skip lakehouse. (2) 1-5 PB or 50-200 sources or significant ML/data science workloads on raw data: hybrid (warehouse for BI + open formats for data science). (3) 5+ PB or strong vendor-lock-in concerns or polyglot engine requirements: full lakehouse with Iceberg/Delta. Then choose the table format based on your dominant engine (Delta if Databricks-centric, Iceberg if multi-engine / Snowflake / Trino, Hudi if heavy CDC/streaming workloads). Invest in catalog (Unity, Polaris, AWS Glue) and governance from day one — without these, lakehouse becomes swamp.
Formula
In Practice
Apple, Netflix, Apple, Pinterest, and Shopify run massive Iceberg-based lakehouses. Netflix is a particularly well-documented case: they created Iceberg specifically because Hive table format was breaking at their scale (hundreds of PB, thousands of concurrent queries). Iceberg solved schema evolution, hidden partitioning, and atomic writes that Hive couldn't. Today Netflix runs ~hundreds of PB on Iceberg, queried from Spark, Trino, Flink, and Pinot — one storage layer, many engines. Without Iceberg, Netflix would have had to either commit to a proprietary warehouse (expensive at PB scale) or accept the limitations of Hive (which were causing real production incidents). The decisive insight: at hyperscale, the cost difference between proprietary warehouse compute and open-format lakehouse is hundreds of millions per year.
Pro Tips
- 01
Choosing a table format is a 5-year commitment. Iceberg is winning the multi-engine race (now supported by Snowflake, BigQuery, Databricks, Trino, Spark, Flink). Delta has the best Databricks experience but weaker non-Databricks support. Hudi excels at CDC/streaming but has narrower adoption. Pick based on your engine future, not blog volume.
- 02
The catalog matters as much as the table format. Without a strong catalog (Unity Catalog, Apache Polaris, AWS Glue), a lakehouse devolves into ungoverned files. The catalog is what enforces schemas, permissions, lineage, and consistency across engines. Plan catalog architecture before storage architecture.
- 03
Cloud warehouse vendors (Snowflake, BigQuery, Databricks SQL) have all added lakehouse interop with open formats — meaning the binary 'warehouse vs lakehouse' choice is dissolving. The pragmatic 2024+ architecture is often: warehouse engine for BI + open table format for storage + multi-engine read for ML. You get warehouse simplicity AND open format flexibility.
Myth vs Reality
Myth
“Lakehouses always replace data warehouses”
Reality
For most companies under ~1 PB, a cloud warehouse is faster, cheaper, and simpler. Lakehouses become economically dominant only at large scale (5+ PB) or when polyglot engines are required (Spark + Trino + Flink + ML frameworks). Below that, the operational overhead of lakehouse exceeds the cost savings vs Snowflake or BigQuery.
Myth
“An S3 bucket of Parquet files is a lakehouse”
Reality
Without an open table format (Iceberg/Delta/Hudi) and a catalog, a Parquet-on-S3 setup has no ACID transactions, no schema evolution, no time travel, and no query optimization — it's a data lake (and likely a swamp). The 'house' part of 'lakehouse' is what the table format and catalog provide. Skipping them is the most common form of fake lakehouse adoption.
Try it
Run the numbers.
Pressure-test the concept against your own knowledge — answer the challenge or try the live scenario.
Knowledge Check
A 200-person Series B SaaS company has 25 source systems and ~8 TB of total data growing to ~50 TB in 3 years. The CTO is excited about adopting a Databricks-based lakehouse. What's the right answer?
Industry benchmarks
Is your number good?
Calibrate against real-world tiers. Use these ranges as targets — not absolutes.
Storage Architecture by Data Volume
Industry architecture norms 2024 across enterprise data platforms<10 TB
Cloud warehouse (Snowflake, BigQuery)
10 TB - 1 PB
Cloud warehouse, optional Iceberg interop
1-5 PB
Hybrid warehouse + open format
>5 PB
Full lakehouse (Iceberg/Delta) on object storage
Source: https://www.databricks.com/glossary/data-lakehouse
Real-world cases
Companies that lived this.
Verified narratives with the numbers that prove (or break) the concept.
Netflix
2018-present
Netflix created Apache Iceberg specifically because Hive table format was breaking at their scale. With hundreds of PB across thousands of tables, Hive's lack of atomic writes, slow partition listing, and schema evolution problems were causing real production incidents. Iceberg introduced hidden partitioning, snapshots, time travel, and atomic operations — and Netflix open-sourced it in 2018. Today Netflix runs ~hundreds of PB on Iceberg, queried by Spark, Trino, Flink, and Pinot from one storage layer. Iceberg has since been adopted by Apple, Pinterest, Shopify, Snowflake, and most major data platforms.
Data on Iceberg
Hundreds of PB
Query Engines
Spark, Trino, Flink, Pinot
Open-Sourced
2018
Industry Adoption
Now de facto multi-engine standard
At hyperscale, open table formats deliver compounding value: cost savings, engine flexibility, and freedom from any one vendor. The lakehouse is the architectural answer for the largest data estates in the world.
Uber
2017-present
Uber created Apache Hudi to handle the unique requirements of incremental data lake updates from CDC streams. With hundreds of PB and constant updates from operational systems (trips, payments, user state), Uber needed a lakehouse table format that could handle upserts efficiently. Hudi provides incremental query, record-level upsert, and time travel on top of Parquet/ORC files. Today Uber runs much of their analytics and ML data platform on Hudi, with Spark, Presto, Hive, and Flink as compute engines. Hudi has been adopted by ByteDance, Walmart, and Robinhood for similar CDC-heavy lakehouse use cases.
Data on Hudi
Hundreds of PB
Use Case Strength
CDC, upserts, streaming
Open-Sourced
2017
Architecture Type
Streaming-first lakehouse
The right table format depends on your dominant workload. Iceberg for batch + multi-engine. Delta for Databricks-centric. Hudi for CDC and streaming-heavy. The choice locks you in for years — analyze workload first, blogs second.
Hypothetical: Series B SaaS
2022
A 180-person SaaS company with 12 TB of data adopted a self-managed Iceberg + Spark + Trino lakehouse after the CTO returned from a conference. The migration took 11 months and required hiring 2 platform engineers. After 18 months operating the lakehouse, total cost (engineering + infrastructure) was 2.4x what Snowflake would have cost for the same workload. BI users complained about slower dashboards. The team migrated back to Snowflake at month 22, retaining Iceberg only for a small data science workload. Total opportunity cost: ~$2.5M and 22 months.
Data Volume
12 TB
Migration Time
11 months
Total Cost vs Snowflake
2.4x
Eventual Outcome
Migrated back at month 22
Lakehouse complexity is justified by scale. At 12 TB, the operational overhead of a self-managed lakehouse dwarfs any storage cost savings. Architecture must match scale, not aspiration.
Decision scenario
The Lakehouse Migration Pitch
You're CTO at a 1,400-person retailer. Current state: on-prem Hadoop cluster with ~1.5 PB of data, increasingly unreliable. The Databricks team pitches a Delta-based lakehouse ($1.6M/year). The Snowflake team pitches their warehouse with Iceberg interop ($2.4M/year). Your data team is 35 people: 20 on BI/analytics, 15 on data engineering and platform. You have a 6-month deadline before the Hadoop cluster reaches end of vendor support.
Data Volume
1.5 PB
Annual Workloads
BI + ML + ad-hoc analytics
Data Team Size
35
Deadline
6 months
Budget Range
$1.6M - $2.4M/year
Decision 1
The CFO wants the cheapest option (Databricks lakehouse). The BI team wants Snowflake because it 'just works' for SQL analysts. The data science team wants Databricks for Spark and ML. You have to choose one to meet the deadline.
Choose Snowflake despite higher cost — BI team familiarity reduces migration risk and meets the 6-month deadline confidentlyReveal
Choose Databricks lakehouse with Delta as the table format. Migrate BI workloads to Databricks SQL Warehouse and ML to Spark, both reading the same Delta tables under Unity Catalog. Accept the harder upskill curve to unify the data layer.✓ OptimalReveal
Related concepts
Keep connecting.
The concepts that orbit this one — each one sharpens the others.
Beyond the concept
Turn Data Lakehouse Architecture into a live operating decision.
Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.
Typical response time: 24h · No retainer required
Turn Data Lakehouse Architecture into a live operating decision.
Use Data Lakehouse Architecture as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.