All things Delta!
What is Delta Lake?🔗
Delta Lake is an open-source storage layer that brings reliability to data lakes by adding a transactional storage layer on top of data stored in cloud storage (on AWS S3, Azure Storage, and GCS). It allows for ACID transactions, data versioning, and rollback capabilities. It allows you to handle both batch and streaming data in a unified way.
Delta tables are built on top of this storage layer and provide a table abstraction, making it easy to work with large-scale structured data using SQL and the DataFrame API.
What are Delta Tables?🔗
Delta table is the default data table format in Databricks and is a feature of the Delta Lake open source data framework. Delta tables are typically used for data lakes, where data is ingested via streaming or in large batches.
What are Lakeflow Declarative Pipelines?🔗
Lakeflow Declarative Pipelines manage the flow of data between many Delta tables, thus simplifying the work of data engineers on ETL development and management. The pipeline is the main unit of execution for Lakeflow Declarative Pipelines. Lakeflow Declarative Pipelines offers declarative pipeline development, improved data reliability, and cloud-scale production operations.
Users can perform both batch and streaming operations on the same table and the data is immediately available for querying. You define the transformations to perform on your data, and Lakeflow Declarative Pipelines manages task orchestration, cluster management, monitoring, data quality, and error handling. Lakeflow Declarative Pipelines enhanced autoscaling can handle streaming workloads which are spiky and unpredictable.
Delta Tables vs DLT🔗
Delta table is a way to store data in tables, whereas Lakeflow Declarative Pipelines allows you to describe how data flows between these tables declaratively. Lakeflow Declarative Pipelines is a declarative framework that manages many delta tables, by creating them and keeping them up to date. In short, Delta tables is a data table architecture while Lakeflow Declarative Pipelines is a data pipeline framework.
Delta Lake Transaction Log🔗
A single source of truth tracking all changes that users make to the table and the mechanism through which Delta Lake guarantees atomicity.
The transaction log is key to understanding Delta Lake, because it is the common thread that runs through many of its most important features:
- ACID transactions
- Scalable metadata handling
- Time travel
Check link for more : Github Delta Transaction Protocol