Skip to content

Replication in Kafka🔗


Kafka’s Replication Architecture: Deep Dive🔗

Apache Kafka’s replication model underpins its durability, availability, and fault tolerance guarantees. At its core, replication ensures that partition data is copied across multiple brokers, so that even if a broker fails, Kafka can continue to serve reads and writes without losing data.


1. Topics, Partitions, and Replicas🔗

a) Topic🔗

A topic is a named stream of records (e.g., "payments", "orders", "user-activity"). It is a logical category for messages.

b) Partition🔗

Each topic is split into partitions — append-only, ordered logs that Kafka distributes across brokers. Each partition:

  • Is stored on multiple brokers (for redundancy).
  • Has an offset sequence (0, 1, 2, …) identifying message order.
  • Is replicated n times (replication.factor = n).

c) Replicas🔗

Each replica is a copy of a partition’s data. Replicas are spread across brokers for fault isolation.

Example: Topic payments has 3 partitions and a replication factor of 3.

Partition Leader Broker Followers
payments-0 Broker 1 Broker 2, 3
payments-1 Broker 2 Broker 1, 3
payments-2 Broker 3 Broker 1, 2

This ensures that every partition has multiple identical copies on different brokers.


2. Types of Replicas🔗

a) Leader Replica🔗

  • Each partition has exactly one leader replica at a given time.
  • All produce requests (writes) are handled by the leader.
  • The leader appends new messages to its local log and coordinates replication to followers.
  • Clients can consume from the leader (by default).

Kafka enforces strong ordering and consistency by ensuring that all writes flow through the leader.

b) Follower Replicas🔗

  • All other replicas are followers.
  • Followers replicate data asynchronously from the leader.
  • They continuously fetch new messages from the leader’s log and append them to their own local copy.
  • Followers stay in sync with the leader as closely as possible.

If the leader fails, a follower can be promoted to leader, preserving availability.


3. The ISR (In-Sync Replica) Set🔗

The ISR (in-sync replica set) is a dynamic list of replicas that are fully caught up with the leader.

A replica is considered in-sync if:

  1. It is alive (not crashed).
  2. It has replicated the leader’s data up to the latest committed offset.
  3. It responds to FetchRequests from the leader within a configured timeout (replica.lag.time.max.ms).

a) Leader election and ISR🔗

When the leader fails, only replicas in the ISR are eligible to become the new leader — unless the broker is configured to allow unclean leader elections (unclean.leader.election.enable=true), which can cause data loss.

b) High-Water Mark (HWM)🔗

  • The HWM is the highest offset that is committed, i.e., replicated to all replicas in the ISR.
  • Consumers can only read messages below or equal to the HWM.
  • The leader periodically shares the current HWM with followers.

4. The Replication Protocol (Leader → Follower)🔗

The replication protocol governs how followers fetch data from leaders.

  1. Each follower sends periodic FetchRequests to its leader, requesting data starting from its last known offset.
  2. The leader responds with a batch of messages and the current high-water mark (HWM).
  3. The follower appends the data to its local log and updates its offset tracking.
  4. The follower acknowledges receipt.
  5. The leader tracks follower progress and updates the ISR accordingly.

This design ensures:

  • Asynchronous replication (low latency for producers).
  • High throughput (batch replication, pipelined fetches).
  • Consistency at read time (consumers read only committed messages).

5. Fault Tolerance and Leader Election🔗

When a broker (leader) fails:

  1. The controller detects broker failure (via heartbeats or ZooKeeper/KRaft metadata).
  2. It determines which partitions lost their leaders.
  3. For each affected partition:

  4. The controller selects a new leader from the ISR list.

  5. Updates the cluster metadata and informs all brokers.
  6. Followers learn the new leader and resume replication.

This ensures fast recovery — typically sub-second in modern Kafka versions.


Reading from Follower Replicas (KIP-392)🔗

Originally, Kafka clients could only consume from leaders. KIP-392 (“Allow consumers to fetch from follower replicas”) changed that, adding rack-aware replica selection and enabling cross-data-center efficiency.


1. Motivation🔗

Large, geo-distributed Kafka deployments face a problem:

  • Producers and consumers in multiple data centers (or racks).
  • All consumers reading from the leader causes cross-network traffic if the leader is remote.

For example, a topic’s leader may reside in Data Center A, but many consumers live in Data Center B. Every read requires cross-DC data transfer, inflating bandwidth costs and latency.

The solution: Allow consumers to read from nearest followers that are in-sync replicas.


2. Core Mechanism🔗

a) The client.rack setting (consumer side)🔗

  • Each consumer identifies its physical or logical location:

client.rack=us-east-2
* This informs brokers where the client is located.

b) The replica.selector.class setting (broker side)🔗

  • Defines how brokers choose which replica to serve read requests from.
  • Default:

replica.selector.class=org.apache.kafka.common.replica.RackAwareReplicaSelector
* Other options:

  • LeaderSelector – Always read from leader (default before KIP-392).
  • RackAwareReplicaSelector – Prefer replicas whose rack.id matches client.rack.
  • Custom implementation – You can plug in your own logic by implementing the ReplicaSelector interface (e.g., based on latency, load, or region).

c) Rack-awareness in brokers🔗

Each broker’s server.properties must include:

broker.rack=us-east-2

This allows Kafka to match client and broker locations.


3. How “read from follower” works internally🔗

  1. The consumer sends a FetchRequest to the cluster.
  2. The broker receiving the request examines:

  3. The consumer’s client.rack.

  4. The available replicas (leader + followers) and their rack.ids.
  5. Based on the configured replica.selector.class, the broker chooses which replica to serve the read from.
  6. The chosen replica (often a follower) serves data only up to its high-water mark, ensuring it does not return uncommitted messages.
  7. If the follower is slightly behind, the consumer sees data a few milliseconds later than it would from the leader.

4. Consistency and Safety Guarantees🔗

a) Commit visibility🔗

  • Only committed messages (those below the HWM) are visible on followers.
  • Followers fetch the leader’s HWM along with each data batch.
  • Consumers therefore never see uncommitted messages, maintaining exactly-once and at-least-once semantics.

b) Potential delay🔗

  • There’s a small propagation delay between when data is committed on the leader and when the follower becomes aware of the new HWM.
  • Thus, reading from followers may slightly increase consumer lag compared to the leader.
  • This is a trade-off: lower cross-network cost at the expense of slightly higher latency.

c) Failure scenarios🔗

  • If the leader fails, one of the followers (already serving reads) can quickly be promoted to leader with minimal data loss (since it’s in-sync).
  • Consumers automatically re-route based on updated metadata.

5. Benefits of Read-from-Follower (RFF)🔗

Benefit Description
Reduced network cost Local consumers can fetch from local replicas instead of remote leaders.
Improved read scalability Load is distributed across replicas rather than all reads hitting the leader.
Faster geo-local access Latency improves when consumers fetch from nearby brokers.
Better resource utilization Followers now share read load, balancing CPU and disk usage.

6. Trade-offs and Considerations🔗

Consideration Description
Increased replication delay Followers may lag the leader slightly; consumers see updates later.
Consistency vs latency Followers serve only committed data, so low-latency uncommitted reads aren’t possible.
Complex client routing Consumers need rack info; misconfiguration can cause suboptimal routing.
Operational complexity Monitoring ISR size and replica lag becomes more important.
Only works for fetch (read) Produce (write) requests still go to the leader.

7. Practical Example Configuration🔗

Broker side:🔗

broker.id=1
broker.rack=us-east-1
replica.selector.class=org.apache.kafka.common.replica.RackAwareReplicaSelector

Consumer side:🔗

client.rack=us-east-1
fetch.max.bytes=1048576
enable.auto.commit=true

The consumer will now fetch from the nearest in-sync replica on the same rack (us-east-1).


The Bigger Picture: Why This Matters🔗

  1. Scalability — As Kafka clusters reach thousands of brokers, distributing read load becomes essential.
  2. Geo-awareness — Multi-region Kafka deployments (via MirrorMaker 2 or Confluent Cluster Linking) benefit from rack-aware fetching.
  3. Performance — Decreased inter-rack traffic reduces latency spikes caused by cross-zone routing.
  4. Reliability — Because followers can now serve reads safely, Kafka becomes more resilient to temporary leader overloads or network partitions.

Future Directions (beyond KIP-392)🔗

  • Adaptive Replica Selection: Ongoing discussions explore dynamic selection based on network latency and broker load (not just rack IDs).
  • Tiered Storage Integration: With Kafka’s tiered storage (KIP-405), replicas may fetch historical data from cloud/object storage, and follower reads might integrate with tiered logs.
  • Cross-cluster follower reads: Future designs could allow follower fetches from remote clusters in multi-region architectures.

Summary🔗

Concept Description
Leader Replica Handles all writes; default source for reads.
Follower Replica Copies data from leader; eligible for leadership during failover.
ISR Set Replicas fully caught up with leader; only ISR members can become new leaders.
High-Water Mark Offset of last committed record replicated to all ISR members.
KIP-392 Adds support for reading from follower replicas based on rack awareness.
client.rack Consumer config indicating client location.
replica.selector.class Broker config determining how replicas are chosen for read requests.
Guarantee Only committed messages (≤ HWM) are served, preserving reliability.
Trade-off Slight extra delay compared to reading from the leader.

In essence, Kafka’s replication system ensures that each partition’s data is durable, consistent, and available across brokers. And with KIP-392’s “read-from-follower”, Kafka’s architecture evolves beyond fault tolerance — towards network efficiency, geo-awareness, and horizontal scalability for global-scale deployments.