Big Data comes with a big problem: Big Storage. Choosing how and where to store the massive amounts of information needed for large-scale analytics is a technical and economic question with many possible answers.
Nowadays, organizations aren’t restricted to traditional data warehouses. Their options include data lakes, and more recently, data lakehouses. Here’s how those work.
What is a data lake?
The term “data lake” was first coined in 2010, according to Dataversity, by James Dixon, founder and former CTO of business intelligence firm Pentaho. The intent of a data lake is to store a quantity of Big Data so voluminous it couldn’t easily be organized or navigated via SQL tools. Since the value of the information is predominantly derived from its sheer quantity, it made economic sense to come up with a different architecture which required fewer relative resources to run.
“In the early days, when you had on-prem data...you’re talking about having to pay for really high-performing computers with lots of storage,” Alex Merced, developer advocate at data lake analytics platform Dremio, told IT Brew. This also resulted in situations where expanding storage might have meant someone paid for unnecessary processing at the time.
Apache Hadoop, a distributed file system that manages cluster nodes of commodity hardware, enabled an alternative structure using cheaper computers focused on storage, Merced said.
Where a traditional data warehouse stores information in a file or folder hierarchy, a data lake is designed to store large amounts of unstructured, semi-structured, or structured data in its raw, native format. Data lakes use flat architecture, and Hadoop uses distributed file storage or object storage.
Since lakes use a schema-on-read approach, data doesn’t need processing or a formal structure at the time it’s written. That means data lakes scale well and have a low cost to operate, but the information often requires additional expertise to read.
Use cases for data lakes are diverse, but they’re ideal for data analytics or data science projects that involve vast amounts of information, as well as predictive modeling and machine learning.
“The number one use case for having a data lake is just to make sure that you have retention of the data in the easiest, lowest-cost form possible,” Merced told IT Brew. Data lakes are also useful for flexibility, as they remove the need to copy and sync data to multiple data warehouses performing different specialized tasks.
Top insights for IT pros
From cybersecurity and big data to cloud computing, IT Brew covers the latest trends shaping business tech in our 4x weekly newsletter, virtual events with industry experts, and digital guides.
One size doesn’t fit all
The architecture of a given data lake can vary widely, and environments can be on-prem, hybrid, or native cloud. While many data lakes rely on Hadoop, others might use object storage methods like Amazon S3 or similar offerings from Azure or Google, and any of these might run alongside different analytics engines or database technologies.
Merced said some key innovations in the history of data lakes were Apache Hive, a Big Data tool that allows extraction and analytics via SQL-like queries; unified analytics engine Apache Spark; and Apache Iceberg, a high-performance format for analytics tables. Cloud data lakes also allow decoupling of storage and compute requirements.
Lakes and lakehouses
A “data lakehouse” is a newer concept that combines the cheap, scalable storage benefits of data lakes with the data management capabilities of a data warehouse. They apply data warehouse-like structures and schema to the kinds of raw information that data lakes were designed to host, allowing centralized, on-demand storage of data from multiple sources.
Oliver Ratzesberger, VP of product data and analytics at Google Cloud, told IT Brew that lakehouses arose from the realization that the larger a data lake becomes, the harder it becomes to locate any given information. The data flowing in also wasn’t sanity checked, leading to errors.
Lakehouses utilize three layers: the storage layer, a staging layer that catalogs data objects with metadata, and a semantic layer where users can actually perform tasks with the data. They can also perform ACID (atomicity, consistency, isolation, and durability) transactions on massive workloads, guaranteeing data integrity.
“These things have started to come together again where the warehouse and the lakehouse and the data lake are all starting to merge into one suite of tools,” Ratzesberger said, such as providing support for tools like SQL, semi-structured processing, and algorithmic access where necessary. With universal metadata, tools like Google’s BigQuery can now handle data ranging from structured to completely unstructured, he added.
“If you look at a modern digital company, you will quickly find yourself in the spectrum of highly structured data—because it’s transactional data—all the way to behavioral data, where customers are just expressing their behaviors through doing something, all the way to unstructured data, where they send you an image of something or an email,” Ratzesberger said.