Data Lake — What It Is and When to Use One

What a Data Lake Is

A data lake is a centralised storage repository that holds raw data at scale — structured tables, semi-structured JSON/CSV/Parquet files, unstructured documents, images, and log files — all in their original format, without requiring a schema to be defined before storage. You ingest first and define structure at query time (schema-on-read). The contrast with a data warehouse is important: a warehouse imposes schema-on-write, meaning you define exactly what columns and types you expect before data arrives, and data must be transformed to fit.

The value of keeping raw data is that you can reprocess it as requirements change. An analyst who needs a new dimension from raw event logs can compute it from source without waiting for a pipeline to be built and backfilled. A machine learning team can access all historical user events as features, not just the pre-aggregated summaries that fit in a reporting table.

Data lakes are typically built on object storage: Amazon S3, Azure Data Lake Storage Gen2 (ADLS), or Google Cloud Storage. Storage costs at these layers run at fractions of a cent per GB per month, making it economical to retain years of raw data.

Data Lake vs Data Warehouse — Real Differences

The debate is frequently framed as a binary choice. It is not. Most mature data platforms use both.

Dimension	Data Lake	Data Warehouse
Schema	Defined at read time	Defined at write time
Data types	Any (raw, semi-structured, unstructured)	Structured/tabular
Cost per TB	Very low (object storage)	Higher (managed warehouse compute + storage)
Query performance	Slower on raw data without optimisation	Fast for SQL analytics
Flexibility	High — schema can evolve without migration	Lower — schema changes require ETL updates
Best for	ML, raw event data, archive, exploration	BI reports, dashboards, structured analytics

When a data lake wins: you need to store data whose schema is not yet defined, you need ML feature access to raw events, you have very high data volumes where warehouse storage costs are prohibitive, or you need to support multiple downstream consumers with different transformation requirements.

When a data warehouse wins: you need fast, reliable SQL queries for business reporting, your data is well-structured and schema-stable, and end users are analysts who need dashboards and ad-hoc SQL, not ML engineers who need raw features.

In practice, the most common pattern is a data lake as the source of truth (raw ingestion layer) feeding a data warehouse (curated, transformed layer for analytics). The lake is the ground truth; the warehouse is the serving layer.

Core Architecture

A data lake has four functional layers:

Ingestion

Data arrives from source systems via batch (daily exports, ETL jobs) or streaming (Kafka, Kinesis, Pub/Sub). The ingestion layer lands data in a raw zone — often a dedicated S3 prefix or container — without transformation. Ingestion tools include Apache Kafka, AWS Glue, Azure Data Factory, Fivetran, and custom Spark streaming jobs.

Storage

Object storage is the foundation: S3, ADLS Gen2, or GCS. Data is stored in columnar formats where possible — Parquet and ORC are standard for structured data because they compress well and enable column pruning (only reading the columns a query needs, not full rows). The storage layer is organised by zone: raw (untouched source data), curated (cleaned, validated), and serving (modelled, aggregated for specific use cases).

Processing

Distributed compute processes data at scale. Apache Spark is the standard engine — it runs on EMR (AWS), Databricks (multi-cloud), HDInsight (Azure), or Dataproc (GCP). Spark handles batch transformations, streaming with Spark Structured Streaming, and ML training on large datasets. Databricks has become the dominant commercial platform for Spark-based lake processing.

Cataloging and Metadata

Without a catalog, a data lake becomes a data swamp — nobody knows what’s in it. AWS Glue Data Catalog, Apache Atlas, Databricks Unity Catalog, and Microsoft Purview all solve this: they track what datasets exist, their schemas, where they came from (lineage), and who can access them. A functioning catalog is not optional for a production data lake.

Lakehouse: Why Delta Lake and Iceberg Emerged

The original data lake had a serious production problem: no ACID transactions. If a write job failed halfway through, you had partial data with no rollback. Concurrent reads and writes could return inconsistent results. Updates and deletes (necessary for GDPR compliance, among other things) were painful or impossible without full rewrites.

Delta Lake (open source, originated at Databricks) and Apache Iceberg (originated at Netflix, now broadly adopted) solved this by adding a transaction log on top of object storage. Every write is recorded; reads always see a consistent snapshot; failed writes roll back automatically. Delta Lake and Iceberg also add time travel (query data as it was at any past point), schema evolution, and partition evolution without full rewrites.

This combination — the flexibility of a data lake with the reliability of a data warehouse — is what “lakehouse” describes. Databricks and Snowflake both use this term, though they approach it from different directions (Databricks from the lake, Snowflake from the warehouse).

For any new data lake built in 2024 or later, Delta Lake or Iceberg is the default choice. Building on raw Parquet without a table format is building in reliability debt.

Common Use Cases

ML feature stores. ML models need features: aggregated, normalised representations of raw events. A data lake with a processing layer is the natural place to compute and store these features at scale. The lake holds raw events; Spark jobs compute features; a feature store (Feast, Tecton, or a custom solution) serves them to training and inference pipelines.

Raw event data at scale. Clickstream, application logs, IoT sensor readings — data that arrives at high volume and whose full analysis requirements aren’t known in advance. Object storage handles the volume; schema-on-read handles the flexibility.

Multi-source consolidation. An enterprise with 30 source systems (CRM, ERP, e-commerce platform, support ticketing, ad platforms) needs a single place where all raw data lands before transformation. A data lake is that landing zone — source systems write there, and multiple downstream consumers (warehouse, ML platform, reporting) read from it with different transformations.

Archive and compliance. Regulatory requirements often mandate retaining raw transaction or communication data for 5–10 years. Object storage costs make this affordable. A data lake doubles as the compliance archive.

Build Considerations

Data catalog from day one. Teams that defer the catalog end up with an ungoverned lake that no one trusts. Implement cataloging (schema registration, ownership, lineage) as part of the initial build, not as a later addition.

Data quality checks at ingestion. Great Expectations, Soda, or custom validation jobs should run at the ingestion layer, not discovered months later when a downstream model starts producing bad predictions.

Partitioning strategy. Badly partitioned data (or no partitioning) means Spark jobs scan entire datasets for queries that should only touch a small slice. Partition by the columns most commonly used in filters: date is almost always one; entity ID often a second.

Access control. Column-level and row-level security matter, especially for PII. Unity Catalog, AWS Lake Formation, and ADLS ACLs provide this — but they require upfront design decisions about which teams get access to what.

Common Mistakes

The swamp problem. A data lake without governance, cataloging, and data quality is a data swamp: data lands, nobody knows what it is, quality is unknown, and teams stop trusting it. The fix is governance discipline from the start — not a technology purchase.

Using the lake for structured reporting. If business users need fast SQL queries on well-structured data, put that data in a warehouse. Running Spark jobs or Athena queries against raw Parquet files for every dashboard refresh is slow, expensive, and unreliable. The lake is for storage and processing; the warehouse is for serving.

No schema evolution plan. Source systems change their schemas. Without a table format (Delta/Iceberg) and a schema evolution policy, a schema change in a source system can silently corrupt downstream tables or fail ingestion jobs.