In-depth explanation of leader/follower replication, ISR, and related behavior🔗
Basic mechanics: Fetch requests, offsets, and how the leader tracks progress🔗
- Each partition has one leader replica and zero or more follower replicas. Producers write only to the leader; consumers usually read from the leader (though replication uses the same Fetch RPC).
- Followers maintain background fetcher threads that send Fetch requests to the leader. A Fetch request asks for messages starting at a particular offset (the next offset the follower needs).
- Fetch requests are always for a contiguous sequence starting at that offset. Because requests are ordered and offsets are monotonic, the leader can infer exactly which messages each follower has already received: if a follower requests offset
N, the follower has successfully fetched (and therefore stored) all messages up to offsetN-1. -
Two useful offset concepts:
-
Log End Offset (LEO) or Log End: the next offset a replica would assign to a newly appended message (i.e., last offset + 1) — indicates how many messages are present at that replica.
- Follower’s Fetch Offset: the offset the follower requests; the leader uses this to know how far the follower is behind.
In-Sync Replicas (ISR)🔗
- The ISR is the set of replicas that are considered sufficiently up-to-date with the leader to be eligible for leader election.
- A replica is in the ISR if it has been actively fetching and has not lagged behind the leader by more than configured thresholds (see below).
- Only replicas in the ISR are candidates for becoming leader if the current leader fails. This prevents electing a replica that lacks recent writes (which would cause data loss relative to acknowledged writes).
How “out of sync” is determined🔗
-
Kafka uses timing + progress to decide whether a follower is out-of-sync:
-
If a follower stops sending Fetch requests for longer than a configured time window, or
- If a follower continues fetching but fails to catch up to the leader (i.e., its fetched offset remains behind leader's LEO) for too long, then the controller will remove the follower from the ISR.
-
The primary configuration that controls this timeout is:
-
replica.lag.time.max.ms— if a follower’s fetched offset hasn’t advanced to the leader’s latest offset within this window, it can be removed from the ISR. - Other related timeouts and settings include replica fetcher timeouts and request timeouts, but
replica.lag.time.max.msis the main one controlling ISR membership.
Configuration parameters and safety-related flags🔗
-
replica.lag.time.max.ms -
Controls how long a follower can lag (or be inactive) before being removed from the ISR. Shorter values make the ISR strict (followers must stay very current); longer values are more tolerant but increase risk of electing a stale replica.
-
min.insync.replicas(topic-level or broker default) -
When producers use
acks=all(i.e., require acknowledgement from all in-sync replicas), the broker enforces that at leastmin.insync.replicasreplicas are in the ISR; otherwise the broker rejects writes. This prevents acknowledged writes from having too few replicas. -
acks(producer setting) -
acks=0,acks=1,acks=allcontrol client durability semantics.acks=allcombined withmin.insync.replicasgives the strongest durability guarantee. -
unclean.leader.election.enable -
If
false(recommended for durability), Kafka will not allow a follower that is not in the ISR to be promoted to leader. Iftrue, a non-ISR replica can become leader to improve availability at the cost of possible data loss (because the new leader might lack some acknowledged messages). -
replica.fetch.max.bytes, fetcher thread settings, request timeouts, etc. -
Control throughput and replication performance, and thus indirectly affect whether followers can keep up and remain in the ISR.
Leader election, data loss, and trade-offs🔗
-
If the leader fails:
-
The controller selects a new leader from the ISR (if any). Because ISR members have replicated all messages up to the leader’s committed point, electing from ISR preserves acknowledged data.
- If ISR is empty and
unclean.leader.election.enable=true, Kafka may elect a stale follower as leader — this recovers availability faster but can cause data loss (some previously acknowledged writes might be missing). -
Trade-offs:
-
Durability-first: Keep
unclean.leader.election.enable=false, setmin.insync.replicas>= 2 for replication factor >= 3, useacks=all. This increases risk of temporary unavailability (if too many replicas fail or are removed from ISR) but prevents data loss. - Availability-first: Allow unclean leader election (
true) to continue serving reads/writes even when ISR members are not available — higher availability, higher risk of data loss. - The
replica.lag.time.max.mssetting influences both availability and durability: very small values cause replicas to be dropped from ISR quickly (which can block writes ifmin.insync.replicasnot met); very large values risk electing a leader that is behind.
Life-cycle of ISR changes (expansion and shrinkage)🔗
-
Replica falls behind → removed from ISR
-
If a follower stops fetching or lags beyond
replica.lag.time.max.ms, controller marks it out-of-sync and removes it from ISR. - If
min.insync.replicasis configured and ISR size <min.insync.replicas, writes requiringacks=allare rejected until the ISR grows again. -
Replica catches up → re-added to ISR
-
When a replica catches up (its fetched offset reaches the leader’s LEO or within acceptable bounds) and remains active, the controller re-adds it to the ISR.
- Re-adding to ISR is important because it restores redundancy and allows that replica to be a leader candidate again.
Replication internals: fetcher threads, order guarantees, and HWM🔗
- Replication uses the same Fetch protocol as consumers; follower fetchers request messages starting at a given offset and apply them in order.
- Ordering: Because fetches are for ordered offsets, followers replicate messages in the same order as the leader.
-
High Watermark (HW) or Replica High Watermark (rHW):
-
The leader tracks a high watermark which is the highest offset that is known to be replicated to all in-sync replicas. Consumers are only allowed to read messages up to the high watermark; this prevents reading messages that some replicas might not have persisted.
- When followers replicate messages and the leader sees those offsets replicated across the ISR, it advances the high watermark.
Example scenario (concrete)🔗
- Partition has replication factor = 3: Leader L, Followers F1, F2. ISR = {L, F1, F2}.
- Leader appends messages up to offset 100 (LEO = 101).
- F1 fetches up to offset 100; F2 fetches only up to offset 90 because of network slowness.
- If F2 doesn’t fetch any new offsets for longer than
replica.lag.time.max.ms, the controller removes F2 from ISR → ISR becomes {L, F1}. - If
min.insync.replicas = 2and producer usesacks=all, writes succeed because L and F1 are in ISR. Ifmin.insync.replicas = 3, new writes requiringacks=allwill be rejected until F2 returns. - If L fails now, controller elects a new leader from ISR (L or F1). Because F2 is not in ISR, it cannot become leader (so no data loss relative to committed offsets).
- If F2 had been promoted (unclean election enabled), messages offset 91–100 might be lost if F2 never received them.
Operational implications and what to monitor🔗
Monitor these metrics to understand replication health:
- ISR size per partition: sudden drops indicate followers are falling out-of-sync.
- Follower fetch-lag:
LEO - follower_offsetfor each follower over time (shows how many messages behind). - Fetch request latency and errors: high latencies or repeated errors cause followers to lag.
- Under-replicated partitions: partitions with fewer replicas than replication factor (indicates durability risk).
- Controller logs: show when replicas are removed/added to ISR and when leader elections occur.
- Broker CPU, disk I/O, and network: resource saturation causes replication lag. Operational alerts often include:
- Partitions with ISR size < configured threshold.
- Partitions with under-replicated partitions > 0.
- Frequent leader elections for a partition.
Best practices and recommendations🔗
- Use replication factor >= 3 for production topic partitions to tolerate broker failures.
- Use
acks=allfor producers that need strong durability, and setmin.insync.replicasto at least 2 (for RF=3). - Keep
unclean.leader.election.enable=falsefor durability-critical data (accepts temporary unavailability rather than data loss). - Tune
replica.lag.time.max.msto balance responsiveness and tolerance for transient delays. Start with sensible defaults and adapt to your environment’s latency characteristics. - Ensure broker/network resources (IO, CPU, NIC) are sufficient so followers can keep up: replication is network- and disk-bound.
- Monitor the metrics above and test failure scenarios (broker restart, network partition) in a staging environment.
Failure modes to understand🔗
- Network partition isolates broker: follower or leader cannot fetch → follower is removed from ISR or leader is isolated and elections occur.
- Broker crash and restart: replicas on that broker fall behind until restart and catch-up; during this period they may be out of ISR.
- Slow disks or GC pauses: cause long replication delays and ISR shrinkage.
- Unclean leader election allowed: rapid availability recovery at cost of possible data loss.
Summary (short)🔗
- Followers fetch from leaders with ordered Fetch requests; the fetch offset tells the leader exactly how far each follower has replicated.
- The ISR contains replicas considered sufficiently up-to-date; only ISR members are eligible for leader election (unless unclean elections are allowed).
replica.lag.time.max.mscontrols how long a replica can lag before being removed from ISR;min.insync.replicasand producerackssettings determine durability guarantees.- Choices are trade-offs between availability and durability; monitoring replication lag, ISR membership, and under-replicated partitions is essential.