That Blue Cloud

From Warehouses to Lakehouses: The Journey

Historically, database technologies came with their data and compute capabilities coupled. If you wanted to use the query capabilities of a specific database, you had to copy/move the data into that database. This all changed with Data Lakes and Lakehouses.
From Warehouses to Lakehouses: The Journey

Historically, database technologies came with their data and compute capabilities coupled. If you wanted to use the query capabilities of a specific database, you had to copy/move the data into that database. The same applied to Data Warehouses, which are essentially databases on steroids. You had to put the data in to get some insights out.

Most databases have specialised query engines that work with their proprietary file formats, and they require you to push that data into them by interacting with that engine (except DuckDb, which is spectacular). Your data becomes isolated behind a wall with no access to it besides the query engine itself. You can’t query that data by other means, and for that engine to work performantly you either have to tweak the world out of it, or you have to throw money at it. You have to learn its language to talk to it, to ask questions about your very own data.

Enter Data Lakes

That changed a lot with Data Lakes. We’ve gained the ability to keep our data separate from our computation engines with Data Lakes, which allowed us to invest in different types of computation clusters for separate job types. Do you need more CPU power during your ETL processes for 1 hour every day? You can have it. You want to have a specialised AI cluster that would run on GPUs? You can have that, too.

You don’t even need to copy your data somewhere else. Your data is available in your lake in a readable file format. You don’t need a specialised runtime to read that. You can write simple code and read the data and write it back. You can write an Azure Function in C# to read/write that information. You can write the same function with Python and process it using Pandas data frames. The sky is the limit.

Midjourney's version of Azure SQL Database

Into Delta Lakes

Data Lakes were great, but they didn’t offer the same level of comfort in the data pipelines as databases. The food was on the table, but the taste was stale. You didn’t feel comfortable with it, and the adoption was slow. Spark made it insanely easy to process that data (and you had SQL), but something wasn’t clicking in place.

That’s where the Delta Lakes came in: You could have an experience that would equal a database or a data warehouse and still keep your data separate. You could have transaction support, merge capabilities, schema-check-on-write, etc. with Delta Lakes. Now that rivalled the Data Warehouses with the processing capabilities and they were cheaper. This all started with Databricks’ Delta file format, but quickly the industry filled that gap with Apache Hudi and Apache Iceberg.

Still, the industry didn’t move towards the Delta Lake right away. Why? It has still required to hire people with skills that they didn’t have. They had to hire Python developers who knew how to use Spark and work with DataFrames and stuff. That wasn’t very palpable for an industry that had mainly SQL as their primary language. Even though you could use Hive tables and query them with SQL within the Spark cluster, it was still not the same experience.

Then the Lakehouses entered the picture.

Scene for Lakehouses

The premise was simple: You could define a Warehouse on top of your Data Lake, without needing to move the data. You could adopt it easily: You can work with SQL, define tables and materialised views, define facts and dimensions, without knowing a single line of Python. You could give the underlying infrastructure to your infra department and go crazy with it. You don’t have to move your data, it’s cheaper, and you can create more than one Lakehouse if you needed to. When people needed to access your data, you could ask them to bring their clusters.

What is a Lakehouse?
Lakehouses are basically Data Warehouses built on top of Data Lakes using primarily Spark-based data processing technologies. They make the data processing, reporting, and analytics a breeze, but there’s nothing groundbreaking about them.

Lakehouses didn’t just equalise the playground with Data Warehouses, they went beyond. It was now easier to implement data analytics with less risk than try to do that with SQL Server Data Warehouse and fail three times.

Conclusion

Are Lakehouses perfect? No.

Is this the final destination? No.

Does it make reporting and analysis easier? Yes.

Lakehouses still don’t give you the same first-class data warehousing capabilities that Synapse Dedicated Pool (a.k.a. SQL Data Warehouse) gives, but they mitigate that with the ability to write better custom code capabilities that don’t depend solely on SQL runtimes.

It may not be exactly the same experience, but it certainly offers a lot of advantages.

Harun Legoz

Harun Legoz

I’m a cloud solutions architect with a coffee obsession. Have been building apps and data platforms for over 18 years, I also blog on Azure & Microsoft Fabric. Feel free to say hi on Twitter/X!

That Blue Cloud

Design awesome data platforms using Microsoft Fabric

That Blue Cloud

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to That Blue Cloud.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.