Give me a walkthrough of the Leader Epoch and How it helps in Log Truncation?🔗
🧱 1. The problem these concepts solve🔗
Imagine this scenario:
- You have a topic with replication factor = 3.
-
So for each partition, you have:
-
1 leader replica (accepts writes)
- 2 follower replicas (replicate from the leader)
Let’s say the partition’s replicas are on brokers:
Now the producer sends messages to Broker A (the leader).
The followers (B and C) fetch those messages from A to stay in sync.
Everything is fine — until something goes wrong.
⚠️ 2. The failure scenario🔗
Let’s say the producer sends three messages:
and the replication happens like this:
| Message | Broker A (Leader) | Broker B (Follower) | Broker C (Follower) |
|---|---|---|---|
| M1 | ✅ written | ✅ replicated | ✅ replicated |
| M2 | ✅ written | ✅ replicated | ❌ not yet replicated |
| M3 | ✅ written | ❌ not yet replicated | ❌ not yet replicated |
Then suddenly:
Broker A (the leader) crashes.
Kafka controller needs to pick a new leader.
3. What happens next — leader election🔗
Kafka now elects one of the followers as the new leader.
Say Broker B becomes the new leader.
But look carefully at the data table above:
| Message | Broker A | Broker B |
|---|---|---|
| M1 | ✅ | ✅ |
| M2 | ✅ | ✅ |
| M3 | ✅ | ❌ |
→ Broker A had M3, but Broker B does not.
This means Broker B’s log is shorter — it’s missing M3.
Now we have divergence:
- Broker A’s log =
[M1, M2, M3] - Broker B’s log =
[M1, M2]
So… 👉 What happens when Broker A comes back up?
It still thinks it’s got the “latest data” (including M3). But the cluster’s new truth (after failover) says:
The partition leader is now B, and the official log ends at M2.
If we don’t fix this, clients could read inconsistent data (some see M3, some don’t). That’s a consistency violation.
Kafka solves this using two key mechanisms:
- Leader epochs
- Log truncation
🔢 4. What is a “Leader Epoch”?🔗
A leader epoch is a number that Kafka increments every time a new leader is elected for a partition.
You can think of it like a “version number” for leadership of a partition.
Example:
| Leader Epoch | Leader Broker | Description |
|---|---|---|
| 0 | Broker A | First leader |
| 1 | Broker B | After A crashes |
| 2 | Broker A | If A later becomes leader again |
Each message that’s written to the log is tagged with the current leader epoch.
So if Broker A (epoch 0) wrote M3, the message might be represented as:
When Broker B becomes the leader, it moves to:
Now any new messages written by B will have epoch=1.
So if Broker B writes a new message M4:
That’s how Kafka can tell which leader wrote which messages.
✂️ 5. What is “Log Truncation”?🔗
When Broker A (the old leader) comes back online, it needs to catch up with the current leader (Broker B).
But remember — A still has M3 (epoch 0) that B never had.
Now the cluster says “Broker B’s log is the source of truth” (epoch 1).
When Broker A connects to Broker B to sync again, it compares its log with Broker B’s.
They match up to offset 1 (M1, M2), but Broker A has extra data (M3) that B doesn’t have — and B’s epoch is newer.
So Kafka makes Broker A truncate (delete) M3 from its log, because it was written by an old leader epoch that is no longer valid.
👉 This deletion is called log truncation.
After truncation:
- Broker A’s log =
[M1, M2] - Broker B’s log =
[M1, M2]→ Both are consistent again.
🧭 6. Step-by-step example with epochs and truncation🔗
| Step | Action | Leader Epoch | Leader | Log State |
|---|---|---|---|---|
| 1 | Start | 0 | Broker A | A: [ ] |
| 2 | Producer sends M1, M2, M3 | 0 | Broker A | A: [M1, M2, M3] |
| 3 | Followers replicate partially | 0 | Broker A | B: [M1, M2] |
| 4 | Broker A crashes | — | — | — |
| 5 | Broker B becomes leader | 1 | Broker B | B: [M1, M2] |
| 6 | Producer sends M4 | 1 | Broker B | B: [M1, M2, M4] |
| 7 | Broker A recovers | — | Broker A | A: [M1, M2, M3] |
| 8 | Broker A compares logs with leader (B) | — | — | A has M3 from older epoch |
| 9 | Kafka detects conflict (epoch mismatch) | — | — | M3 (epoch 0) invalid |
| 10 | Broker A truncates M3 | — | — | A: [M1, M2, M4] after sync |
Now the logs are consistent again.
⚙️ 7. How Kafka knows where to truncate🔗
Each broker maintains a small internal map called the Leader Epoch Cache, which records:
Example:
When a follower detects it has messages beyond the current leader’s epoch range, it uses this cache to know exactly where the new leader’s log begins — and truncates everything after that.
🔐 8. Why this is critical for consistency🔗
This combination ensures Kafka’s data consistency and durability guarantees:
| Guarantee | Mechanism |
|---|---|
| No old leader can reintroduce stale messages | Leader epochs prevent writing with outdated epochs |
| All replicas converge to same log | Log truncation ensures followers delete invalid data |
| Clients never see “ghost messages” (like M3 above) | New leader only serves committed data |
| Exactly-once semantics survive failover | Idempotent producers rely on consistent logs |
🧩 9. Summary — Leader Epoch vs Log Truncation🔗
| Concept | Description | Purpose |
|---|---|---|
| Leader Epoch | Incremented every time a new leader is elected for a partition. Each message stores its leader epoch. | Identifies which leader wrote which messages; helps detect divergent logs. |
| Log Truncation | Process where a replica deletes messages written by an old leader that don’t exist in the new leader’s log. | Ensures consistency between replicas after failover. |
✅ 10. In simple words🔗
-
A leader epoch is a “generation number” for each partition leader. Every new leader = new epoch.
-
A log truncation is what Kafka does to remove messages that were written by a previous leader but didn’t get fully replicated before the failover.
Together, they ensure that after a failover, all brokers agree on exactly which messages exist — no duplicates, no lost messages, no ghosts.