Replication in Kafka🔗
Kafka’s Replication Architecture: Deep Dive🔗
Apache Kafka’s replication model underpins its durability, availability, and fault tolerance guarantees. At its core, replication ensures that partition data is copied across multiple brokers, so that even if a broker fails, Kafka can continue to serve reads and writes without losing data.
1. Topics, Partitions, and Replicas🔗
a) Topic🔗
A topic is a named stream of records (e.g., "payments", "orders", "user-activity").
It is a logical category for messages.
b) Partition🔗
Each topic is split into partitions — append-only, ordered logs that Kafka distributes across brokers. Each partition:
- Is stored on multiple brokers (for redundancy).
- Has an offset sequence (0, 1, 2, …) identifying message order.
- Is replicated n times (
replication.factor = n).
c) Replicas🔗
Each replica is a copy of a partition’s data. Replicas are spread across brokers for fault isolation.
Example:
Topic payments has 3 partitions and a replication factor of 3.
| Partition | Leader Broker | Followers |
|---|---|---|
| payments-0 | Broker 1 | Broker 2, 3 |
| payments-1 | Broker 2 | Broker 1, 3 |
| payments-2 | Broker 3 | Broker 1, 2 |
This ensures that every partition has multiple identical copies on different brokers.
2. Types of Replicas🔗
a) Leader Replica🔗
- Each partition has exactly one leader replica at a given time.
- All produce requests (writes) are handled by the leader.
- The leader appends new messages to its local log and coordinates replication to followers.
- Clients can consume from the leader (by default).
Kafka enforces strong ordering and consistency by ensuring that all writes flow through the leader.
b) Follower Replicas🔗
- All other replicas are followers.
- Followers replicate data asynchronously from the leader.
- They continuously fetch new messages from the leader’s log and append them to their own local copy.
- Followers stay in sync with the leader as closely as possible.
If the leader fails, a follower can be promoted to leader, preserving availability.
3. The ISR (In-Sync Replica) Set🔗
The ISR (in-sync replica set) is a dynamic list of replicas that are fully caught up with the leader.
A replica is considered in-sync if:
- It is alive (not crashed).
- It has replicated the leader’s data up to the latest committed offset.
- It responds to FetchRequests from the leader within a configured timeout (
replica.lag.time.max.ms).
a) Leader election and ISR🔗
When the leader fails, only replicas in the ISR are eligible to become the new leader — unless the broker is configured to allow unclean leader elections (unclean.leader.election.enable=true), which can cause data loss.
b) High-Water Mark (HWM)🔗
- The HWM is the highest offset that is committed, i.e., replicated to all replicas in the ISR.
- Consumers can only read messages below or equal to the HWM.
- The leader periodically shares the current HWM with followers.
4. The Replication Protocol (Leader → Follower)🔗
The replication protocol governs how followers fetch data from leaders.
- Each follower sends periodic FetchRequests to its leader, requesting data starting from its last known offset.
- The leader responds with a batch of messages and the current high-water mark (HWM).
- The follower appends the data to its local log and updates its offset tracking.
- The follower acknowledges receipt.
- The leader tracks follower progress and updates the ISR accordingly.
This design ensures:
- Asynchronous replication (low latency for producers).
- High throughput (batch replication, pipelined fetches).
- Consistency at read time (consumers read only committed messages).
5. Fault Tolerance and Leader Election🔗
When a broker (leader) fails:
- The controller detects broker failure (via heartbeats or ZooKeeper/KRaft metadata).
- It determines which partitions lost their leaders.
-
For each affected partition:
-
The controller selects a new leader from the ISR list.
- Updates the cluster metadata and informs all brokers.
- Followers learn the new leader and resume replication.
This ensures fast recovery — typically sub-second in modern Kafka versions.
Reading from Follower Replicas (KIP-392)🔗
Originally, Kafka clients could only consume from leaders. KIP-392 (“Allow consumers to fetch from follower replicas”) changed that, adding rack-aware replica selection and enabling cross-data-center efficiency.
1. Motivation🔗
Large, geo-distributed Kafka deployments face a problem:
- Producers and consumers in multiple data centers (or racks).
- All consumers reading from the leader causes cross-network traffic if the leader is remote.
For example, a topic’s leader may reside in Data Center A, but many consumers live in Data Center B. Every read requires cross-DC data transfer, inflating bandwidth costs and latency.
The solution: Allow consumers to read from nearest followers that are in-sync replicas.
2. Core Mechanism🔗
a) The client.rack setting (consumer side)🔗
- Each consumer identifies its physical or logical location:
b) The replica.selector.class setting (broker side)🔗
- Defines how brokers choose which replica to serve read requests from.
- Default:
LeaderSelector– Always read from leader (default before KIP-392).RackAwareReplicaSelector– Prefer replicas whoserack.idmatchesclient.rack.- Custom implementation – You can plug in your own logic by implementing the
ReplicaSelectorinterface (e.g., based on latency, load, or region).
c) Rack-awareness in brokers🔗
Each broker’s server.properties must include:
This allows Kafka to match client and broker locations.
3. How “read from follower” works internally🔗
- The consumer sends a FetchRequest to the cluster.
-
The broker receiving the request examines:
-
The consumer’s
client.rack. - The available replicas (leader + followers) and their
rack.ids. - Based on the configured
replica.selector.class, the broker chooses which replica to serve the read from. - The chosen replica (often a follower) serves data only up to its high-water mark, ensuring it does not return uncommitted messages.
- If the follower is slightly behind, the consumer sees data a few milliseconds later than it would from the leader.
4. Consistency and Safety Guarantees🔗
a) Commit visibility🔗
- Only committed messages (those below the HWM) are visible on followers.
- Followers fetch the leader’s HWM along with each data batch.
- Consumers therefore never see uncommitted messages, maintaining exactly-once and at-least-once semantics.
b) Potential delay🔗
- There’s a small propagation delay between when data is committed on the leader and when the follower becomes aware of the new HWM.
- Thus, reading from followers may slightly increase consumer lag compared to the leader.
- This is a trade-off: lower cross-network cost at the expense of slightly higher latency.
c) Failure scenarios🔗
- If the leader fails, one of the followers (already serving reads) can quickly be promoted to leader with minimal data loss (since it’s in-sync).
- Consumers automatically re-route based on updated metadata.
5. Benefits of Read-from-Follower (RFF)🔗
| Benefit | Description |
|---|---|
| Reduced network cost | Local consumers can fetch from local replicas instead of remote leaders. |
| Improved read scalability | Load is distributed across replicas rather than all reads hitting the leader. |
| Faster geo-local access | Latency improves when consumers fetch from nearby brokers. |
| Better resource utilization | Followers now share read load, balancing CPU and disk usage. |
6. Trade-offs and Considerations🔗
| Consideration | Description |
|---|---|
| Increased replication delay | Followers may lag the leader slightly; consumers see updates later. |
| Consistency vs latency | Followers serve only committed data, so low-latency uncommitted reads aren’t possible. |
| Complex client routing | Consumers need rack info; misconfiguration can cause suboptimal routing. |
| Operational complexity | Monitoring ISR size and replica lag becomes more important. |
| Only works for fetch (read) | Produce (write) requests still go to the leader. |
7. Practical Example Configuration🔗
Broker side:🔗
broker.id=1
broker.rack=us-east-1
replica.selector.class=org.apache.kafka.common.replica.RackAwareReplicaSelector
Consumer side:🔗
The consumer will now fetch from the nearest in-sync replica on the same rack (us-east-1).
The Bigger Picture: Why This Matters🔗
- Scalability — As Kafka clusters reach thousands of brokers, distributing read load becomes essential.
- Geo-awareness — Multi-region Kafka deployments (via MirrorMaker 2 or Confluent Cluster Linking) benefit from rack-aware fetching.
- Performance — Decreased inter-rack traffic reduces latency spikes caused by cross-zone routing.
- Reliability — Because followers can now serve reads safely, Kafka becomes more resilient to temporary leader overloads or network partitions.
Future Directions (beyond KIP-392)🔗
- Adaptive Replica Selection: Ongoing discussions explore dynamic selection based on network latency and broker load (not just rack IDs).
- Tiered Storage Integration: With Kafka’s tiered storage (KIP-405), replicas may fetch historical data from cloud/object storage, and follower reads might integrate with tiered logs.
- Cross-cluster follower reads: Future designs could allow follower fetches from remote clusters in multi-region architectures.
Summary🔗
| Concept | Description |
|---|---|
| Leader Replica | Handles all writes; default source for reads. |
| Follower Replica | Copies data from leader; eligible for leadership during failover. |
| ISR Set | Replicas fully caught up with leader; only ISR members can become new leaders. |
| High-Water Mark | Offset of last committed record replicated to all ISR members. |
| KIP-392 | Adds support for reading from follower replicas based on rack awareness. |
| client.rack | Consumer config indicating client location. |
| replica.selector.class | Broker config determining how replicas are chosen for read requests. |
| Guarantee | Only committed messages (≤ HWM) are served, preserving reliability. |
| Trade-off | Slight extra delay compared to reading from the leader. |
In essence, Kafka’s replication system ensures that each partition’s data is durable, consistent, and available across brokers. And with KIP-392’s “read-from-follower”, Kafka’s architecture evolves beyond fault tolerance — towards network efficiency, geo-awareness, and horizontal scalability for global-scale deployments.