Kafka Raft Consensus Protocol

KRaft : Kafka's new Raft Based Controller🔗

Below is a complete, detailed explanation of why the Kafka community decided to replace the controller, covering every practical and technical motivation in depth and how those issues drove the design of the replacement (KRaft / controller quorum). I do not skip any of the important details described in your passage and I expand on them to make the consequences and alternatives clear.

High-level motivation🔗

At a conceptual level, the controller is the component that owns cluster metadata: broker membership, topic and partition configuration, leader assignments, ISR membership, and other cluster-scoped state. The historical controller implementation relied on ZooKeeper as the authoritative metadata store and used asynchronous interactions with brokers and ZooKeeper. Over time, that architectural coupling exposed hard limits and a set of correctness and operational problems that became increasingly painful as Kafka was pushed to support many more partitions and larger clusters. The decision to replace the controller arose from the need for:

Better scalability (support far larger partition counts with predictable performance).
Stronger metadata correctness and determinism (fewer subtle, hard-to-detect inconsistency windows).
Faster recovery and bootstrap (reduce time-to-availability after controller restart).
Simpler operational model (do not require operators to run and tune an independent distributed system, ZooKeeper).
Cleaner ownership and simpler architecture (eliminate split responsibilities and duplication of control paths).

Each of these is discussed below in technical detail.

1) Metadata write/read asymmetry → inconsistency windows🔗

The symptom🔗

In the ZooKeeper-backed model, the controller writes metadata updates to ZooKeeper synchronously (so they are durably recorded), but then sends metadata updates to brokers asynchronously. In parallel, brokers receive metadata updates from ZooKeeper asynchronously (they watch znodes and react when changes happen). Because writing to ZooKeeper and the propagation of those writes to brokers are decoupled and asynchronous, there exists a window of time when ZooKeeper, the controller, and brokers have different views of the cluster state.

Why this matters🔗

When metadata (leader assignments, ISR, topic config) is only durably stored in ZooKeeper but not yet known to all brokers, producers or consumers hitting different brokers may get inconsistent responses. A producer might be told a leader is X by one broker, while another broker still believes leader is Y.
These divergence windows create hard-to-detect edge cases: race conditions where the controller believes a transition is complete while some brokers still operate with stale metadata, potentially causing failed writes, misrouted requests, or even split-brain-like behaviors for leadership decisions.
Detecting, reproducing, and reasoning about these races is difficult because the events are asynchronous and orderings can differ per broker and per network partition.

Root cause🔗

The root cause is the split control plane—ZooKeeper is the persistent canonical store, the controller writes there synchronously, but brokers also watch and act asynchronously. That split increases complexity and leaves correctness to careful orchestration of asynchronous events.

2) Controller restart cost — reading entire metadata set from ZooKeeper🔗

The symptom🔗

Whenever the controller process restarts or a new controller is elected, it must reconstruct the global cluster metadata by reading znodes for all topics, partitions, replicas, ISRs, broker registrations, and more. For large clusters with many partitions, that can be a lot of data and many ZooKeeper requests.

Why this matters🔗

Reading and validating all that state is I/O-heavy and, even with pipelined async ZooKeeper calls, can take seconds or longer. That delay prolongs the time the cluster is without a fully functioning controller.
During that time the cluster cannot perform new metadata changes or reliably complete leader elections, reducing the cluster’s ability to respond to broker failures and increasing downtime or unavailability windows.
The restart cost scales with partition count; therefore growth in partitions causes longer outages or longer convergence times.

Root cause🔗

ZooKeeper was not designed as a high-volume metadata log with very large numbers of tiny znodes accessed frequently for reconstruction. The controller’s need to rebuild full in-memory maps on each restart exposed this scalability limit.

3) Fragmented metadata ownership and inconsistent paths for operations🔗

The symptom🔗

As features were added over the years, Kafka’s metadata ownership got split: some operations were mediated by the controller, some were handled by any broker, and some were manipulated directly in ZooKeeper by tools or admin APIs. This produced inconsistent semantics and tight coupling between components.

Why this matters🔗

Different code paths handling metadata increase the cognitive and operational burden on both developers and operators. Bugs and race conditions can arise when different components make assumptions about who “owns” the truth for a particular operation.
The mix of direct ZooKeeper writes and broker/controller-mediated writes complicated reasoning about atomicity and ordering of metadata updates.

Root cause🔗

The architecture evolved organically; ZooKeeper was used both for coordination and as the single source of truth. Over time that created multiple interaction patterns and divergence in responsibility.

4) Operational burden of running ZooKeeper as an additional distributed system🔗

The symptom🔗

ZooKeeper is a separate distributed system with its own deployment, scaling, and tuning requirements: ensemble sizing, disk and network tuning, monitoring, and handling of session timeouts and ephemeral znodes. Kafka users now had to operate two distributed systems.

Why this matters🔗

Running and debugging ZooKeeper failures adds complexity for teams deploying Kafka in production. Knowledge of ZooKeeper internals (for tuning session timeouts, understanding znode behavior, diagnosing quorum issues) became part of running Kafka reliably.
It increases the total operational surface area and failure modes operators must handle.

Root cause🔗

The early design choice to use ZooKeeper as the canonical metadata store meant Kafka could not avoid the operational responsibilities imposed by ZooKeeper.

How these problems motivated replacing the controller🔗

Taken together, the above problems pointed to a fundamental architectural mismatch between the level of scale and dynamism Kafka wanted to support and the existing controller + ZooKeeper model. The key goals for a replacement were to:

Co-locate metadata management with Kafka itself, removing the separate ZooKeeper dependency for metadata.
Introduce a log-based, consensus-backed metadata store within Kafka that provides a single, strongly-consistent source of truth with linearizable semantics for metadata changes.
Enable a controller quorum (multiple controller nodes cooperating via a replicated metadata log) so controller failures don’t require full metadata re-read and can be handled with predictable leader election among controllers.
Allow atomic metadata changes via durable, append-only metadata logs, which makes ordering and replication deterministic and easier to reason about.
Reduce restart time and bootstrap cost by persisting metadata in an internal replicated log; a newly elected controller can pick up the tail of the log and resume without re-reading thousands of znodes.
Simplify operations by removing the need to manage a separate ZooKeeper cluster for metadata; operators manage only Kafka and its internal quorum.

Concrete design directions that followed🔗

The replacement work (often discussed as KRaft and the controller quorum design) therefore implemented:

An internal metadata log (append-only) replicated by a controller quorum using an internal consensus protocol.
Controller leaders and followers behave like other replicated services: leaders append metadata changes; followers replicate; commit is decided by quorum.
Atomic, ordered, and durable metadata modifications that are applied in the same order everywhere, removing the ZooKeeper async propagation window and eliminating the split control plane.
Faster failover since new controllers do not need to rehydrate state by reading many small znodes; they replay/consume the metadata log and are already in sync or can catch up quickly.
Unified ownership of metadata, removing the split between controller-managed and broker-managed updates and offering a single authoritative path for all cluster metadata changes.

Summary🔗

In short, the decision to replace the controller came from a confluence of practical and architectural problems: asynchronous gaps between ZooKeeper and brokers that produced inconsistent metadata windows, slow and expensive controller bootstraps that scaled poorly, fragmented metadata ownership, and the extra burden of operating ZooKeeper as a separate system. Replacing the controller with a quorum-backed, log-based metadata system internal to Kafka addresses these issues by providing a single, durable, ordered metadata log, faster and more deterministic controller failover, and a simplified operational model for users. These changes were necessary to scale Kafka’s metadata plane to the partition counts and operational expectations of modern, large-scale deployments.