Data Engineering and Cloud Platform Knowledge Baseπ
This repository is where I keep track of what Iβm learning and experimenting with across data engineering, distributed systems, and cloud platforms. Itβs part study notes, part code lab, and part reference guide β essentially a place to capture concepts as I work through them and to revisit later when I need a refresher.
I focus on technologies like Apache Spark, Kafka/Redpanda, Databricks, AWS, and Azure, along with supporting tools and practices that are important for building reliable data systems. Youβll find a mix of summaries, deep dives into tricky concepts, performance tuning notes, and small experiments that test how things work in practice.
The main goal of this repo is to strengthen my own understanding, but I also hope it can be useful to anyone else navigating similar topics. Think of it as a learning log that balances hands-on exploration with professional best practices - something between a personal notebook and a practical guide.
π Data Engineering Knowledge Baseπ
Azureπ
- Azure Integration Databricks
- Azure Portal Subscriptions Resourcegroups
- Azure Cli Scenarios
- Azure Powershell Scenarios
- Azure Arm Templates
- Azure Bicep Templates
- Overview Of Azure Storage
- Blob Storage Fundamentals
- Adls Gen2 Overview
- Azure Rbac Acl
- Azure Types Of Storage
- Azure Storage Replication Strategies
- Soft Delete Pitr Azure Storage
- Azure Shared Access Signature
- Azure Lifetime Management Policies
- Eventgrid Integration Azure
- Azure Encrpytion Standards
- Azure+ Private Endpoints
- Cross Region Replication Azure
- Azure Storage Rest Api
- Introduction Azure Data Factory
- Azure Data Factory Vs Synapse
- Azure Data Factory Architecture
- Adf Triggers Intro
- Adf Parameters
- Index
Data-formatsπ
Databricksπ
- Azure Databricks Uc Creation
- Databricks Uc Introduction
- Databricks Managed External Tables Hive
- Uc Managed External Tables
- Uc External Location Storage Credentials
- Databricks Managed Location Catalog Schema Level
- Ctas Deep Clone Shallow Clone Databricks
- Rbac Custom Roles Serviceprincipals
- Deletion Vectors Delta Lake
- Liquid Clustering Delta Lake
- Concurrency Liquid Clustering
- Copy Into Databricks
- Autoloader Databricks
- 12.1 Autoloader Databricks Schema Inference
- Intro Databricks Lakeflow Declarative Pipelines
- Dlt Batch Vs Streaming Workloads
- Dlt Data Storage Checkpoints
- Databricks Secret Scopes
- Databricks Controlplane Dataplane
- Databricks Dlt Code Walkthrough
- Databricks Serverless Compute
- Databricks Warehouses
- Databricks Lakehouse Federation
- Databricks Metrics Views
- Databricks Streaming Materialized Views Sql
- Databricks Cli Setup
- Index
Docs-deep-diveπ
Databricksπ
- What Is Lakehouse
- Lakehouse Vs Delta Lake Vs Warehouse
- All Delta Things Databricks
- High Level Architecture
- Databricks Acid Guarantees
- Databricks Medallion Architecture
- Databricks Single Source Of Truth Arch
- Databricks Scope Of Lakehouse Arch
- Databricks Architecture Guiding Principles
Scenariosπ
Databricksπ
Sparkπ
Sparkπ
- Spark Architecture Yarn
- Spark Driver Oom
- Types Of Memory Spark
- Spark Dynamic Partition Pruning
- Spark Salting Technique
- What Is Spark
- Why Apache Spark
- Hadoop Vs Spark
- Spark Ecosystem
- Spark Ecosystem
- Spark Architecture
- Schema In Spark
- Handling Corrupt Records Spark
- Spark Transformations Actions
- Spark Dag Lazy Eval
- Spark Json Data
- Spark Sql Engine
- Spark Rdd
- Spark Writing Data Disk
- Spark Partitioning Bucketing
- Spark Session Vs Context
- Spark Job Stage Task
- Spark Transformations
- Spark Union Vs Unionall
- Spark Repartition Vs Coalesce
- Spark Case When
- Spark Unique Sorted Records
- Spark Agg Functions
- Spark Group By
- Spark Joins Intro
- Spark Join Strategies
- Spark Window Functions
- Spark Memory Management
- Spark Executor Oom
- Spark Submit Command
- Spark Deployment Modes
- Spark Adaptive Query Execution
- Spark Dynamic Resource Allocation
- Spark Dynamic Partition Pruning
- Index
Streamingπ
Architectureπ
Kafkaπ
- Kafka Kraft Setup
- Kafka Broker Properties
- Topic Default Properties
- Kafka Hardware Considerations
- Kafka Configuring Clusters Broker Consideration
- Kafka Broker Os Tuning
- Kafka Os Tuning Dirty Page Handling
- Kafka File Descriptors Overcommit Memory
- Kafka Production Concerns
- Kafka Message Types
- Kafka Configuring Producers Pt1
- Kafka Configuring Producers Pt2
- Kafka Serializers Avro Pt1
- Kafka Serializers Avro Pt2
- Kafka Partitions
- Kafka Headers
- Kafka Interceptors
- Kafka Quotas And Throttling
- Kafka Consumer Eager And Cooperative Rebalance
- Kafka Consumer Static Partitioning
- Kafka Poll Loop
- Kafka Configuring Consumers Pt1
- Kafka Configuring Consumers Pt2
- Kafka Partition Assignment Strategies
- Kafka Commits Offsets Intro
- Kafka Types Of Commits
- Index