Skip to content

Data Engineering and Cloud Platform Knowledge BaseπŸ”—

This repository is where I keep track of what I’m learning and experimenting with across data engineering, distributed systems, and cloud platforms. It’s part study notes, part code lab, and part reference guide β€” essentially a place to capture concepts as I work through them and to revisit later when I need a refresher.

I focus on technologies like Apache Spark, Kafka/Redpanda, Databricks, AWS, and Azure, along with supporting tools and practices that are important for building reliable data systems. You’ll find a mix of summaries, deep dives into tricky concepts, performance tuning notes, and small experiments that test how things work in practice.

The main goal of this repo is to strengthen my own understanding, but I also hope it can be useful to anyone else navigating similar topics. Think of it as a learning log that balances hands-on exploration with professional best practices - something between a personal notebook and a practical guide.

πŸ“‘ Data Engineering Knowledge BaseπŸ”—

AzureπŸ”—

  1. Azure Integration Databricks
  2. Azure Portal Subscriptions Resourcegroups
  3. Azure Cli Scenarios
  4. Azure Powershell Scenarios
  5. Azure Arm Templates
  6. Azure Bicep Templates
  7. Overview Of Azure Storage
  8. Blob Storage Fundamentals
  9. Adls Gen2 Overview
  10. Azure Rbac Acl
  11. Azure Types Of Storage
  12. Azure Storage Replication Strategies
  13. Soft Delete Pitr Azure Storage
  14. Azure Shared Access Signature
  15. Azure Lifetime Management Policies
  16. Eventgrid Integration Azure
  17. Azure Encrpytion Standards
  18. Azure+ Private Endpoints
  19. Cross Region Replication Azure
  20. Azure Storage Rest Api
  21. Introduction Azure Data Factory
  22. Azure Data Factory Vs Synapse
  23. Azure Data Factory Architecture
  24. Adf Triggers Intro
  25. Adf Parameters
  26. Index

Data-formatsπŸ”—

  1. Data Format Deep Dive Pt1
  2. Parquet Format Internals

DatabricksπŸ”—

  1. Azure Databricks Uc Creation
  2. Databricks Uc Introduction
  3. Databricks Managed External Tables Hive
  4. Uc Managed External Tables
  5. Uc External Location Storage Credentials
  6. Databricks Managed Location Catalog Schema Level
  7. Ctas Deep Clone Shallow Clone Databricks
  8. Rbac Custom Roles Serviceprincipals
  9. Deletion Vectors Delta Lake
  10. Liquid Clustering Delta Lake
  11. Concurrency Liquid Clustering
  12. Copy Into Databricks
  13. Autoloader Databricks
  14. 12.1 Autoloader Databricks Schema Inference
  15. Intro Databricks Lakeflow Declarative Pipelines
  16. Dlt Batch Vs Streaming Workloads
  17. Dlt Data Storage Checkpoints
  18. Databricks Secret Scopes
  19. Databricks Controlplane Dataplane
  20. Databricks Dlt Code Walkthrough
  21. Databricks Serverless Compute
  22. Databricks Warehouses
  23. Databricks Lakehouse Federation
  24. Databricks Metrics Views
  25. Databricks Streaming Materialized Views Sql
  26. Databricks Cli Setup
  27. Index

Docs-deep-diveπŸ”—

DatabricksπŸ”—

  1. What Is Lakehouse
  2. Lakehouse Vs Delta Lake Vs Warehouse
  3. All Delta Things Databricks
  4. High Level Architecture
  5. Databricks Acid Guarantees
  6. Databricks Medallion Architecture
  7. Databricks Single Source Of Truth Arch
  8. Databricks Scope Of Lakehouse Arch
  9. Databricks Architecture Guiding Principles

ScenariosπŸ”—

  1. Index

DatabricksπŸ”—

  1. Aws Reference Arch Databricks
  2. Reference Architectures Pt1
  3. Index

SparkπŸ”—

  1. Smj Spill To Disk Q1
  2. Smj Spill To Disk Q2
  3. Smj Output During Spill Q3
  4. Cross Vs Broadcast Join
  5. Index

SparkπŸ”—

  1. Spark Architecture Yarn
  2. Spark Driver Oom
  3. Types Of Memory Spark
  4. Spark Dynamic Partition Pruning
  5. Spark Salting Technique
  6. What Is Spark
  7. Why Apache Spark
  8. Hadoop Vs Spark
  9. Spark Ecosystem
  10. Spark Ecosystem
  11. Spark Architecture
  12. Schema In Spark
  13. Handling Corrupt Records Spark
  14. Spark Transformations Actions
  15. Spark Dag Lazy Eval
  16. Spark Json Data
  17. Spark Sql Engine
  18. Spark Rdd
  19. Spark Writing Data Disk
  20. Spark Partitioning Bucketing
  21. Spark Session Vs Context
  22. Spark Job Stage Task
  23. Spark Transformations
  24. Spark Union Vs Unionall
  25. Spark Repartition Vs Coalesce
  26. Spark Case When
  27. Spark Unique Sorted Records
  28. Spark Agg Functions
  29. Spark Group By
  30. Spark Joins Intro
  31. Spark Join Strategies
  32. Spark Window Functions
  33. Spark Memory Management
  34. Spark Executor Oom
  35. Spark Submit Command
  36. Spark Deployment Modes
  37. Spark Adaptive Query Execution
  38. Spark Dynamic Resource Allocation
  39. Spark Dynamic Partition Pruning
  40. Index

StreamingπŸ”—

  1. Index

ArchitectureπŸ”—

  1. Use Cases Streaming
  2. Redpanda Vs Kafka Arch Differences
  3. Redpanda Architure In Depth Pt1
  4. Index

KafkaπŸ”—

  1. Kafka Kraft Setup
  2. Kafka Broker Properties
  3. Topic Default Properties
  4. Kafka Hardware Considerations
  5. Kafka Configuring Clusters Broker Consideration
  6. Kafka Broker Os Tuning
  7. Kafka Os Tuning Dirty Page Handling
  8. Kafka File Descriptors Overcommit Memory
  9. Kafka Production Concerns
  10. Kafka Message Types
  11. Kafka Configuring Producers Pt1
  12. Kafka Configuring Producers Pt2
  13. Kafka Serializers Avro Pt1
  14. Kafka Serializers Avro Pt2
  15. Kafka Partitions
  16. Kafka Headers
  17. Kafka Interceptors
  18. Kafka Quotas And Throttling
  19. Kafka Consumer Eager And Cooperative Rebalance
  20. Kafka Consumer Static Partitioning
  21. Kafka Poll Loop
  22. Kafka Configuring Consumers Pt1
  23. Kafka Configuring Consumers Pt2
  24. Kafka Partition Assignment Strategies
  25. Kafka Commits Offsets Intro
  26. Kafka Types Of Commits
  27. Index