Skip to content

Data Engineering and Cloud Platform Knowledge BaseπŸ”—

This repository is where I keep track of what I’m learning and experimenting with across data engineering, distributed systems, and cloud platforms. It’s part study notes, part code lab, and part reference guide β€” essentially a place to capture concepts as I work through them and to revisit later when I need a refresher.

I focus on technologies like Apache Spark, Kafka/Redpanda, Databricks, AWS, and Azure, along with supporting tools and practices that are important for building reliable data systems. You’ll find a mix of summaries, deep dives into tricky concepts, performance tuning notes, and small experiments that test how things work in practice.

The main goal of this repo is to strengthen my own understanding, but I also hope it can be useful to anyone else navigating similar topics. Think of it as a learning log that balances hands-on exploration with professional best practices - something between a personal notebook and a practical guide.

πŸ“‘ Data Engineering Knowledge BaseπŸ”—

AirflowπŸ”—

  1. What Is Airflow
  2. Airflow Vs Databricks Jobs
  3. Building Blocks Airflow
  4. Airflow Arch Metadata Db
  5. Lifecycle Of Dag Run
  6. Airflow Scheduler Deepdive
  7. Index

AzureπŸ”—

  1. Azure Integration Databricks
  2. Azure Portal Subscriptions Resourcegroups
  3. Azure Cli Scenarios
  4. Azure Powershell Scenarios
  5. Azure Arm Templates
  6. Azure Bicep Templates
  7. Overview Of Azure Storage
  8. Blob Storage Fundamentals
  9. Adls Gen2 Overview
  10. Azure Rbac Acl
  11. Azure Types Of Storage
  12. Azure Storage Replication Strategies
  13. Soft Delete Pitr Azure Storage
  14. Azure Shared Access Signature
  15. Azure Lifetime Management Policies
  16. Eventgrid Integration Azure
  17. Azure Encrpytion Standards
  18. Azure+ Private Endpoints
  19. Cross Region Replication Azure
  20. Azure Storage Rest Api
  21. Introduction Azure Data Factory
  22. Azure Data Factory Vs Synapse
  23. Azure Data Factory Architecture
  24. Adf Triggers Intro
  25. Adf Parameters
  26. Index

Data-formatsπŸ”—

  1. Data Format Deep Dive Pt1
  2. Parquet Format Internals

DatabricksπŸ”—

  1. Azure Databricks Uc Creation
  2. Databricks Uc Introduction
  3. Databricks Managed External Tables Hive
  4. Uc Managed External Tables
  5. Uc External Location Storage Credentials
  6. Databricks Managed Location Catalog Schema Level
  7. Ctas Deep Clone Shallow Clone Databricks
  8. Rbac Custom Roles Serviceprincipals
  9. Deletion Vectors Delta Lake
  10. Liquid Clustering Delta Lake
  11. Concurrency Liquid Clustering
  12. Copy Into Databricks
  13. Autoloader Databricks
  14. 12.1 Autoloader Databricks Schema Inference
  15. Intro Databricks Lakeflow Declarative Pipelines
  16. Dlt Batch Vs Streaming Workloads
  17. Dlt Data Storage Checkpoints
  18. Databricks Secret Scopes
  19. Databricks Controlplane Dataplane
  20. Databricks Dlt Code Walkthrough
  21. Databricks Serverless Compute
  22. Databricks Warehouses
  23. Databricks Lakehouse Federation
  24. Databricks Metrics Views
  25. Databricks Streaming Materialized Views Sql
  26. Databricks Cli Setup
  27. Index

Docs-deep-diveπŸ”—

DatabricksπŸ”—

  1. What Is Lakehouse
  2. Lakehouse Vs Delta Lake Vs Warehouse
  3. All Delta Things Databricks
  4. High Level Architecture
  5. Databricks Acid Guarantees
  6. Databricks Medallion Architecture
  7. Databricks Single Source Of Truth Arch
  8. Databricks Scope Of Lakehouse Arch
  9. Databricks Architecture Guiding Principles
  10. Databricks Objects Catalogs
  11. Databricks Objects Volumes Tables
  12. Databricks Views
  13. Databricks Governed Tags
  14. Databricks Connecting To Cloud Object Storage Intro
  15. Databricks Managed Storage Location Hierarchy
  16. Databricks Service Credentials
  17. Databricks Connecting To Managed Ingestion Sources Intro
  18. Databricks Query Federation
  19. Image

ScenariosπŸ”—

  1. Index

AdfπŸ”—

  1. Architectures

DatabricksπŸ”—

  1. Aws Reference Arch Databricks
  2. Reference Architectures Pt1
  3. Index

KafkaπŸ”—

  1. Why Closed Segments Files Open
  2. How Does Producer Guarantee Exactly Once
  3. Does Seq No Remain Same After Producer Goes Down
  4. What Happens When Reelection Happens
  5. Give Walkthrough Of Leader Epoch Log Truncation
  6. Explain Diff Dirtyratio Dirtybackgroundratio
  7. Are Kafka Consumers Thread Safe
  8. Is Retention Ms Defined Partition Level
  9. Difference Btwn Sticky Cooperative Sticky Assignor
  10. How Does Kafka Ensure Partial Idempotence
  11. How Does Kafka Know Which Messages Are Processed Not Just Read
  12. Index

SparkπŸ”—

  1. Smj Spill To Disk Q1
  2. Smj Spill To Disk Q2
  3. Smj Output During Spill Q3
  4. Cross Vs Broadcast Join
  5. Index

SparkπŸ”—

  1. Spark Architecture Yarn
  2. Spark Driver Oom
  3. Types Of Memory Spark
  4. Spark Dynamic Partition Pruning
  5. Spark Salting Technique
  6. What Is Spark
  7. Why Apache Spark
  8. Hadoop Vs Spark
  9. Spark Ecosystem
  10. Spark Ecosystem
  11. Spark Architecture
  12. Schema In Spark
  13. Handling Corrupt Records Spark
  14. Spark Transformations Actions
  15. Spark Dag Lazy Eval
  16. Spark Json Data
  17. Spark Sql Engine
  18. Spark Rdd
  19. Spark Writing Data Disk
  20. Spark Partitioning Bucketing
  21. Spark Session Vs Context
  22. Spark Job Stage Task
  23. Spark Transformations
  24. Spark Union Vs Unionall
  25. Spark Repartition Vs Coalesce
  26. Spark Case When
  27. Spark Unique Sorted Records
  28. Spark Agg Functions
  29. Spark Group By
  30. Spark Joins Intro
  31. Spark Join Strategies
  32. Spark Window Functions
  33. Spark Memory Management
  34. Spark Executor Oom
  35. Spark Submit Command
  36. Spark Deployment Modes
  37. Spark Adaptive Query Execution
  38. Spark Dynamic Resource Allocation
  39. Spark Dynamic Partition Pruning
  40. Spark Executor Tuning
  41. Index

StreamingπŸ”—

  1. Index

ArchitectureπŸ”—

  1. Use Cases Streaming
  2. Redpanda Vs Kafka Arch Differences
  3. Redpanda Architure In Depth Pt1
  4. Index
  1. Introduction To Flink
  2. Jobmanager In Flink
  3. Taskmanager In Flink
  4. Slots Vs Cores Flink Tm
  5. Operating Chaining Flink
  6. Stateful Streaming Flink Pt1
  7. Processing Vs Event Time Flink
  8. Checkpoints Savepoints Flink
  9. Stateful Upgrades Flink

KafkaπŸ”—

  1. Kafka Kraft Setup
  2. Kafka Broker Properties
  3. Topic Default Properties
  4. Kafka Hardware Considerations
  5. Kafka Configuring Clusters Broker Consideration
  6. Kafka Broker Os Tuning
  7. Kafka Os Tuning Dirty Page Handling
  8. Kafka File Descriptors Overcommit Memory
  9. Kafka Production Concerns
  10. Kafka Message Types
  11. Kafka Configuring Producers Pt1
  12. Kafka Configuring Producers Pt2
  13. Kafka Serializers Avro Pt1
  14. Kafka Serializers Avro Pt2
  15. Kafka Partitions
  16. Kafka Headers
  17. Kafka Interceptors
  18. Kafka Quotas And Throttling
  19. Kafka Consumer Eager And Cooperative Rebalance
  20. Kafka Consumer Static Partitioning
  21. Kafka Poll Loop
  22. Kafka Configuring Consumers Pt1
  23. Kafka Configuring Consumers Pt2
  24. Kafka Partition Assignment Strategies
  25. Kafka Commits Offsets Intro
  26. Kafka Types Of Commits
  27. Kafka Rebalance Listeners
  28. Kafka Consuming Records With Spec Offset
  29. Kafka Exiting Consumer Poll Loop
  30. Kafka Deserialisers
  31. Kafka Standalone Consumers
  32. Kafka Internals Zookeeper
  33. Kafka Raft Consensus Protocol
  34. Kafka Controller Quorum
  35. Kafka Replication Concepts
  36. Kafka Insync Outofsync Replicas
  37. Kafka Request Processing Pt1
  38. Kafka Request Processing Pt2 Produce Requests
  39. Kafka Fetch Requests Pt1
  40. Kafka Fetch Requests Pt2
  41. Kafka Physical Storage Introduction
  42. Kafka Tiered Storage
  43. Kafka Partition Allocation
  44. Kafka File Formats Intro
  45. Kafka Message Batch Headers
  46. Kafka Indexes
  47. Kafka Compaction
  48. Kafka Tombstoning Records
  49. Kafka Reliability Guarantees
  50. Kafka Replication Procedures
  51. Kafka Broker Config Replication Factor
  52. Kafka Broker Configuration Unclean Leader Election
  53. Kafka Log Truncation On Out Of Sync Leader
  54. Kafka Keeping Replicas In Sync
  55. Kafka Using Producers Reliable System Scenarios
  56. Kafka Producer Retries Additional Error Handling
  57. Kafka Using Consumers In Reliable System Intro
  58. Kafka Important Consumer Properties Intro
  59. Kafka Consumer Properties Pt2
  60. Kafka Explicitly Commiting Offsets Pt1
  61. Kafka Explicitly Commiting Offsets Pt2
  62. Kafka Validating Configuration
  63. Kafka Monitoring In Production
  64. Index