Kafka Controller Quorum Concepts
Kafka Controller Architecture🔗
1. Background: ZooKeeper’s role in the existing architecture🔗
In the original Kafka architecture, ZooKeeper played two key roles:
-
Controller election – ZooKeeper was used to elect which Kafka broker would act as the controller.
-
Each broker tried to create an ephemeral znode (say,
/controller). - The broker that successfully created it became the controller.
-
If the controller failed, ZooKeeper automatically deleted its ephemeral znode, triggering a new election.
-
Metadata storage – ZooKeeper stored all cluster-level metadata, including:
-
Registered brokers and their endpoints.
- Topics and their configurations.
- Partition assignments (which broker holds which replica).
- ISR (in-sync replicas) sets.
- Topic-level and broker-level configuration overrides.
The controller, once elected, used ZooKeeper as the backing store while managing:
- Leader elections for partitions.
- Topic creation and deletion.
- Replica reassignment.
- Partition rebalancing.
This setup meant the controller and ZooKeeper were deeply intertwined — the controller executed cluster logic, but the durable state lived in ZooKeeper.
However, as Kafka clusters scaled, this model hit limitations: inconsistent state propagation, ZooKeeper bottlenecks, and long failover times (as explained in your previous questions).
2. The vision for the new controller architecture🔗
The new controller design (introduced as part of the KRaft mode, short for Kafka Raft Metadata mode) aims to completely remove ZooKeeper from Kafka and bring metadata management inside Kafka itself.
The core idea behind the redesign is:
Kafka already uses a log-based architecture for user data. Why not represent cluster metadata the same way?
The log-based principle🔗
In Kafka, a log is a sequence of ordered events. Each new event appends to the end, defining a single, total order of state changes. Consumers read from the log, replaying events to reconstruct state.
By applying this model to metadata:
- The cluster’s metadata becomes a stream of events (e.g., “Topic created,” “Partition 3 leader changed,” “Broker 7 joined”).
- Every controller and broker can replay this log to reach the exact same metadata state.
- This guarantees a consistent and ordered view of cluster metadata across all nodes.
This log-based metadata design aligns Kafka’s internal control plane with its existing data plane philosophy — state as an event stream.
3. Raft-based controller quorum: the foundation🔗
In the new architecture, controller nodes form a Raft quorum. Raft is a consensus algorithm designed for managing replicated logs safely across distributed nodes.
Here’s how it works conceptually in Kafka’s controller quorum:
a. Metadata log🔗
-
The metadata log is a Kafka-internal, Raft-replicated log that records every metadata change event:
-
Topic creation/deletion
- Partition addition
- ISR changes
- Broker registrations
- Configuration changes
- Leader elections
- Every change that was once written to ZooKeeper is now appended as an event to this metadata log.
b. Raft roles🔗
Within the Raft quorum, controller nodes can play three roles:
- Leader (active controller) – the single node that handles all write operations and coordinates the cluster.
- Followers (standby controllers) – replicate the metadata log from the leader and maintain an up-to-date copy.
- Candidate – a transient role during elections when a leader fails or steps down.
c. Leader election without ZooKeeper🔗
- Raft provides internal leader election.
- Controllers elect one among themselves as the active controller (the Raft leader).
- No external system (like ZooKeeper) is needed — Raft’s consensus mechanism ensures only one leader is active at a time.
d. Replication and durability🔗
- Every metadata change is appended to the metadata log on the leader.
- The leader replicates it to the follower controllers.
- Once a majority (quorum) of controllers acknowledge the entry, it’s committed.
-
This guarantees:
-
Consistency (all controllers eventually converge on the same log).
- Durability (metadata changes survive crashes as they are replicated).
4. The “active controller”🔗
The Raft leader — called the active controller — performs the same logical duties the old ZooKeeper-based controller did, but now:
- It writes all changes to the metadata log.
-
It handles RPC requests from brokers for:
-
Topic creation/deletion.
- Partition reassignment.
- Leadership updates.
- Configuration updates.
Because every metadata event is replicated across the Raft quorum, followers always stay hot — they continuously replay and apply updates from the leader.
5. Fast failover and recovery🔗
In the old architecture:
- When a controller failed, a new one had to read all metadata from ZooKeeper — a slow process for large clusters.
- Then it had to propagate this state to all brokers, which also took time.
In the new Raft-based design:
- All follower controllers already have the latest metadata log replicated locally.
-
When the active controller fails:
-
Raft automatically elects a new leader (within milliseconds).
- The new leader already has the entire metadata state (no full reload needed).
- It immediately resumes handling cluster operations.
This eliminates the lengthy reloading period and drastically improves cluster availability.
6. Metadata distribution to brokers (MetadataFetch API)🔗
Old method: Push-based updates🔗
- The old controller “pushed” metadata updates to brokers via ZooKeeper watches or direct RPCs.
- This push model was asynchronous and could result in inconsistent or delayed state across brokers.
New method: Pull-based updates🔗
- In the new design, brokers fetch metadata from the active controller.
- Kafka introduces a MetadataFetch API, similar to how brokers fetch messages from topic partitions.
How it works:🔗
- Each broker maintains a metadata offset — the position in the metadata log it has processed.
- Brokers periodically issue a MetadataFetchRequest to the active controller, asking:
“Give me any metadata updates after offset X.” 3. The controller replies with the delta — only the new metadata events since offset X. 4. The broker applies these updates and advances its local metadata offset.
This mechanism:
- Guarantees brokers stay in sync with the controller.
- Reduces unnecessary data transfer (only incremental updates are sent).
- Simplifies consistency — every broker’s metadata state corresponds to a specific log offset.
7. Broker persistence and faster startup🔗
In the new design:
- Brokers persist metadata to their local disk (not just hold it in memory).
-
This means that if a broker restarts:
-
It already has most of the metadata up to its last fetched offset.
- It simply asks the active controller for new updates since that offset.
This enables very fast broker startup, even in clusters with millions of partitions — a major improvement over the previous need to reload and rebuild all metadata state from ZooKeeper or the controller each time.
8. Summary of improvements🔗
| Area | Old ZooKeeper-Based Controller | New Raft-Based Controller (KRaft) |
|---|---|---|
| Metadata storage | In ZooKeeper (znodes) | In Kafka’s internal Raft metadata log |
| Controller election | ZooKeeper ephemeral znode | Raft consensus within controller quorum |
| Controller failover | Slow — full reload from ZooKeeper | Fast — followers already replicated metadata |
| State consistency | Async writes/reads via ZooKeeper and brokers | Strong consistency via replicated log |
| Metadata propagation | Controller pushes updates | Brokers fetch updates via MetadataFetch API |
| Broker startup | Requires metadata reload | Quick — brokers replay local metadata log |
| Operational complexity | Kafka + ZooKeeper to manage | Kafka only (ZooKeeper removed) |
| Scalability | Limited by ZooKeeper’s structure | Scales with Kafka’s native log architecture |
9. Architectural elegance and conceptual unification🔗
The new controller design has a conceptual elegance: Kafka now uses the same fundamental abstraction — an ordered, replicated log — for everything.
- User data is stored as event logs in Kafka topics.
- System metadata is stored as an event log in the Raft-based metadata log.
Both are:
- Append-only
- Ordered
- Replayable
- Consistent
- Replicated across multiple nodes
This unification simplifies reasoning, improves fault tolerance, and leverages Kafka’s own proven architectural strengths (log replication, offset tracking, incremental consumption).
10. Operational impact🔗
For cluster operators:
- There’s no ZooKeeper to manage anymore — simplifying deployments and reducing moving parts.
- Controller failovers are now sub-second and do not block metadata operations.
- Kafka becomes a self-contained distributed system, which simplifies monitoring, tuning, and upgrades.
For developers and clients:
- Metadata is always consistent across brokers.
- Leadership changes propagate predictably and quickly.
- Administrative operations (topic creation, reconfiguration) are more reliable and faster.
11. The big picture — Kafka’s evolution to a single self-governing distributed system🔗
The shift to a Raft-based controller quorum represents the final step in Kafka’s long-term plan to:
- Eliminate external dependencies (ZooKeeper).
- Achieve linear scalability in both data and metadata planes.
- Provide predictable performance for massive clusters (millions of partitions).
- Simplify the mental and operational model for users.
In short, the new controller makes Kafka a fully self-contained distributed system with:
- Internal consensus (Raft)
- Unified log abstraction
- Fast failover
- Deterministic metadata propagation
- Simpler, more reliable operation at scale
TL;DR🔗
The Kafka community replaced the old ZooKeeper-based controller with a new Raft-based controller quorum (KRaft) because it allows Kafka to:
- Manage metadata internally in a replicated log.
- Elect leaders using Raft instead of ZooKeeper.
- Guarantee consistent metadata across all brokers.
- Eliminate slow controller restarts and reloading phases.
- Scale to millions of partitions with fast failover and simple operations.
Essentially, Kafka now manages both data and metadata the same way — as ordered, replicated event logs — giving it the reliability, scalability, and simplicity that ZooKeeper could never provide.