Data Engineering and Cloud Platform Knowledge Baseπ
This repository is where I keep track of what Iβm learning and experimenting with across data engineering, distributed systems, and cloud platforms. Itβs part study notes, part code lab, and part reference guide β essentially a place to capture concepts as I work through them and to revisit later when I need a refresher.
I focus on technologies like Apache Spark, Kafka/Redpanda, Databricks, AWS, and Azure, along with supporting tools and practices that are important for building reliable data systems. Youβll find a mix of summaries, deep dives into tricky concepts, performance tuning notes, and small experiments that test how things work in practice.
The main goal of this repo is to strengthen my own understanding, but I also hope it can be useful to anyone else navigating similar topics. Think of it as a learning log that balances hands-on exploration with professional best practices - something between a personal notebook and a practical guide.
π Data Engineering Knowledge Baseπ
Airflowπ
- What Is Airflow
- Airflow Vs Databricks Jobs
- Building Blocks Airflow
- Airflow Arch Metadata Db
- Lifecycle Of Dag Run
- Airflow Scheduler Deepdive
- Index
Azureπ
- Azure Integration Databricks
- Azure Portal Subscriptions Resourcegroups
- Azure Cli Scenarios
- Azure Powershell Scenarios
- Azure Arm Templates
- Azure Bicep Templates
- Overview Of Azure Storage
- Blob Storage Fundamentals
- Adls Gen2 Overview
- Azure Rbac Acl
- Azure Types Of Storage
- Azure Storage Replication Strategies
- Soft Delete Pitr Azure Storage
- Azure Shared Access Signature
- Azure Lifetime Management Policies
- Eventgrid Integration Azure
- Azure Encrpytion Standards
- Azure+ Private Endpoints
- Cross Region Replication Azure
- Azure Storage Rest Api
- Introduction Azure Data Factory
- Azure Data Factory Vs Synapse
- Azure Data Factory Architecture
- Adf Triggers Intro
- Adf Parameters
- Index
Data-formatsπ
Databricksπ
- Azure Databricks Uc Creation
- Databricks Uc Introduction
- Databricks Managed External Tables Hive
- Uc Managed External Tables
- Uc External Location Storage Credentials
- Databricks Managed Location Catalog Schema Level
- Ctas Deep Clone Shallow Clone Databricks
- Rbac Custom Roles Serviceprincipals
- Deletion Vectors Delta Lake
- Liquid Clustering Delta Lake
- Concurrency Liquid Clustering
- Copy Into Databricks
- Autoloader Databricks
- 12.1 Autoloader Databricks Schema Inference
- Intro Databricks Lakeflow Declarative Pipelines
- Dlt Batch Vs Streaming Workloads
- Dlt Data Storage Checkpoints
- Databricks Secret Scopes
- Databricks Controlplane Dataplane
- Databricks Dlt Code Walkthrough
- Databricks Serverless Compute
- Databricks Warehouses
- Databricks Lakehouse Federation
- Databricks Metrics Views
- Databricks Streaming Materialized Views Sql
- Databricks Cli Setup
- Index
Docs-deep-diveπ
Databricksπ
- What Is Lakehouse
- Lakehouse Vs Delta Lake Vs Warehouse
- All Delta Things Databricks
- High Level Architecture
- Databricks Acid Guarantees
- Databricks Medallion Architecture
- Databricks Single Source Of Truth Arch
- Databricks Scope Of Lakehouse Arch
- Databricks Architecture Guiding Principles
- Databricks Objects Catalogs
- Databricks Objects Volumes Tables
- Databricks Views
- Databricks Governed Tags
- Databricks Connecting To Cloud Object Storage Intro
- Databricks Managed Storage Location Hierarchy
- Databricks Service Credentials
- Databricks Connecting To Managed Ingestion Sources Intro
- Databricks Query Federation
- Image
Scenariosπ
Adfπ
Databricksπ
Kafkaπ
- Why Closed Segments Files Open
- How Does Producer Guarantee Exactly Once
- Does Seq No Remain Same After Producer Goes Down
- What Happens When Reelection Happens
- Give Walkthrough Of Leader Epoch Log Truncation
- Explain Diff Dirtyratio Dirtybackgroundratio
- Are Kafka Consumers Thread Safe
- Is Retention Ms Defined Partition Level
- Difference Btwn Sticky Cooperative Sticky Assignor
- How Does Kafka Ensure Partial Idempotence
- How Does Kafka Know Which Messages Are Processed Not Just Read
- Index
Sparkπ
Sparkπ
- Spark Architecture Yarn
- Spark Driver Oom
- Types Of Memory Spark
- Spark Dynamic Partition Pruning
- Spark Salting Technique
- What Is Spark
- Why Apache Spark
- Hadoop Vs Spark
- Spark Ecosystem
- Spark Ecosystem
- Spark Architecture
- Schema In Spark
- Handling Corrupt Records Spark
- Spark Transformations Actions
- Spark Dag Lazy Eval
- Spark Json Data
- Spark Sql Engine
- Spark Rdd
- Spark Writing Data Disk
- Spark Partitioning Bucketing
- Spark Session Vs Context
- Spark Job Stage Task
- Spark Transformations
- Spark Union Vs Unionall
- Spark Repartition Vs Coalesce
- Spark Case When
- Spark Unique Sorted Records
- Spark Agg Functions
- Spark Group By
- Spark Joins Intro
- Spark Join Strategies
- Spark Window Functions
- Spark Memory Management
- Spark Executor Oom
- Spark Submit Command
- Spark Deployment Modes
- Spark Adaptive Query Execution
- Spark Dynamic Resource Allocation
- Spark Dynamic Partition Pruning
- Spark Executor Tuning
- Index
Streamingπ
Architectureπ
Flinkπ
- Introduction To Flink
- Jobmanager In Flink
- Taskmanager In Flink
- Slots Vs Cores Flink Tm
- Operating Chaining Flink
- Stateful Streaming Flink Pt1
- Processing Vs Event Time Flink
- Checkpoints Savepoints Flink
- Stateful Upgrades Flink
Kafkaπ
- Kafka Kraft Setup
- Kafka Broker Properties
- Topic Default Properties
- Kafka Hardware Considerations
- Kafka Configuring Clusters Broker Consideration
- Kafka Broker Os Tuning
- Kafka Os Tuning Dirty Page Handling
- Kafka File Descriptors Overcommit Memory
- Kafka Production Concerns
- Kafka Message Types
- Kafka Configuring Producers Pt1
- Kafka Configuring Producers Pt2
- Kafka Serializers Avro Pt1
- Kafka Serializers Avro Pt2
- Kafka Partitions
- Kafka Headers
- Kafka Interceptors
- Kafka Quotas And Throttling
- Kafka Consumer Eager And Cooperative Rebalance
- Kafka Consumer Static Partitioning
- Kafka Poll Loop
- Kafka Configuring Consumers Pt1
- Kafka Configuring Consumers Pt2
- Kafka Partition Assignment Strategies
- Kafka Commits Offsets Intro
- Kafka Types Of Commits
- Kafka Rebalance Listeners
- Kafka Consuming Records With Spec Offset
- Kafka Exiting Consumer Poll Loop
- Kafka Deserialisers
- Kafka Standalone Consumers
- Kafka Internals Zookeeper
- Kafka Raft Consensus Protocol
- Kafka Controller Quorum
- Kafka Replication Concepts
- Kafka Insync Outofsync Replicas
- Kafka Request Processing Pt1
- Kafka Request Processing Pt2 Produce Requests
- Kafka Fetch Requests Pt1
- Kafka Fetch Requests Pt2
- Kafka Physical Storage Introduction
- Kafka Tiered Storage
- Kafka Partition Allocation
- Kafka File Formats Intro
- Kafka Message Batch Headers
- Kafka Indexes
- Kafka Compaction
- Kafka Tombstoning Records
- Kafka Reliability Guarantees
- Kafka Replication Procedures
- Kafka Broker Config Replication Factor
- Kafka Broker Configuration Unclean Leader Election
- Kafka Log Truncation On Out Of Sync Leader
- Kafka Keeping Replicas In Sync
- Kafka Using Producers Reliable System Scenarios
- Kafka Producer Retries Additional Error Handling
- Kafka Using Consumers In Reliable System Intro
- Kafka Important Consumer Properties Intro
- Kafka Consumer Properties Pt2
- Kafka Explicitly Commiting Offsets Pt1
- Kafka Explicitly Commiting Offsets Pt2
- Kafka Validating Configuration
- Kafka Monitoring In Production
- Index