The Who, What, and Why of Data Lake Table Formats

May 16, 2024 48 min Free

Description

This presentation delves into the intricacies of data lake table formats, explaining what a data lakehouse is and the evolution of data storage from traditional databases to data lakes and finally to data lakehouses. The talk focuses on three primary table formats: Apache Iceberg, Apache Hudi, and Delta Lake, detailing their architectures, how they manage data and metadata, and their respective strengths and weaknesses. It covers key features such as ACID transactions, schema evolution, copy-on-write vs. merge-on-read strategies, Z-ordering for data optimization, and partitioning evolution. The speaker also touches upon format interoperability and the importance of these formats in enabling modern data analytics and AI/ML workloads.