Skip to content

Give me a walkthrough of the Leader Epoch and How it helps in Log Truncation?🔗


🧱 1. The problem these concepts solve🔗

Imagine this scenario:

  • You have a topic with replication factor = 3.
  • So for each partition, you have:

  • 1 leader replica (accepts writes)

  • 2 follower replicas (replicate from the leader)

Let’s say the partition’s replicas are on brokers:

Leader → Broker A
Followers → Broker B, Broker C

Now the producer sends messages to Broker A (the leader).

The followers (B and C) fetch those messages from A to stay in sync.

Everything is fine — until something goes wrong.


⚠️ 2. The failure scenario🔗

Let’s say the producer sends three messages:

M1, M2, M3

and the replication happens like this:

Message Broker A (Leader) Broker B (Follower) Broker C (Follower)
M1 ✅ written ✅ replicated ✅ replicated
M2 ✅ written ✅ replicated ❌ not yet replicated
M3 ✅ written ❌ not yet replicated ❌ not yet replicated

Then suddenly:

Broker A (the leader) crashes.

Kafka controller needs to pick a new leader.


3. What happens next — leader election🔗

Kafka now elects one of the followers as the new leader.

Say Broker B becomes the new leader.

But look carefully at the data table above:

Message Broker A Broker B
M1
M2
M3

→ Broker A had M3, but Broker B does not.

This means Broker B’s log is shorter — it’s missing M3.

Now we have divergence:

  • Broker A’s log = [M1, M2, M3]
  • Broker B’s log = [M1, M2]

So… 👉 What happens when Broker A comes back up?

It still thinks it’s got the “latest data” (including M3). But the cluster’s new truth (after failover) says:

The partition leader is now B, and the official log ends at M2.

If we don’t fix this, clients could read inconsistent data (some see M3, some don’t). That’s a consistency violation.

Kafka solves this using two key mechanisms:

  • Leader epochs
  • Log truncation

🔢 4. What is a “Leader Epoch”?🔗

A leader epoch is a number that Kafka increments every time a new leader is elected for a partition.

You can think of it like a “version number” for leadership of a partition.

Example:

Leader Epoch Leader Broker Description
0 Broker A First leader
1 Broker B After A crashes
2 Broker A If A later becomes leader again

Each message that’s written to the log is tagged with the current leader epoch.

So if Broker A (epoch 0) wrote M3, the message might be represented as:

<M3, epoch=0, offset=2>

When Broker B becomes the leader, it moves to:

Leader epoch = 1

Now any new messages written by B will have epoch=1.

So if Broker B writes a new message M4:

<M4, epoch=1, offset=3>

That’s how Kafka can tell which leader wrote which messages.


✂️ 5. What is “Log Truncation”?🔗

When Broker A (the old leader) comes back online, it needs to catch up with the current leader (Broker B).

But remember — A still has M3 (epoch 0) that B never had.

Now the cluster says “Broker B’s log is the source of truth” (epoch 1).

When Broker A connects to Broker B to sync again, it compares its log with Broker B’s.

They match up to offset 1 (M1, M2), but Broker A has extra data (M3) that B doesn’t have — and B’s epoch is newer.

So Kafka makes Broker A truncate (delete) M3 from its log, because it was written by an old leader epoch that is no longer valid.

👉 This deletion is called log truncation.

After truncation:

  • Broker A’s log = [M1, M2]
  • Broker B’s log = [M1, M2] → Both are consistent again.

🧭 6. Step-by-step example with epochs and truncation🔗

Step Action Leader Epoch Leader Log State
1 Start 0 Broker A A: [ ]
2 Producer sends M1, M2, M3 0 Broker A A: [M1, M2, M3]
3 Followers replicate partially 0 Broker A B: [M1, M2]
4 Broker A crashes
5 Broker B becomes leader 1 Broker B B: [M1, M2]
6 Producer sends M4 1 Broker B B: [M1, M2, M4]
7 Broker A recovers Broker A A: [M1, M2, M3]
8 Broker A compares logs with leader (B) A has M3 from older epoch
9 Kafka detects conflict (epoch mismatch) M3 (epoch 0) invalid
10 Broker A truncates M3 A: [M1, M2, M4] after sync

Now the logs are consistent again.


⚙️ 7. How Kafka knows where to truncate🔗

Each broker maintains a small internal map called the Leader Epoch Cache, which records:

(epoch → start_offset)

Example:

Epoch 0 → starts at offset 0
Epoch 1 → starts at offset 3

When a follower detects it has messages beyond the current leader’s epoch range, it uses this cache to know exactly where the new leader’s log begins — and truncates everything after that.


🔐 8. Why this is critical for consistency🔗

This combination ensures Kafka’s data consistency and durability guarantees:

Guarantee Mechanism
No old leader can reintroduce stale messages Leader epochs prevent writing with outdated epochs
All replicas converge to same log Log truncation ensures followers delete invalid data
Clients never see “ghost messages” (like M3 above) New leader only serves committed data
Exactly-once semantics survive failover Idempotent producers rely on consistent logs

🧩 9. Summary — Leader Epoch vs Log Truncation🔗

Concept Description Purpose
Leader Epoch Incremented every time a new leader is elected for a partition. Each message stores its leader epoch. Identifies which leader wrote which messages; helps detect divergent logs.
Log Truncation Process where a replica deletes messages written by an old leader that don’t exist in the new leader’s log. Ensures consistency between replicas after failover.

✅ 10. In simple words🔗

  • A leader epoch is a “generation number” for each partition leader. Every new leader = new epoch.

  • A log truncation is what Kafka does to remove messages that were written by a previous leader but didn’t get fully replicated before the failover.

Together, they ensure that after a failover, all brokers agree on exactly which messages exist — no duplicates, no lost messages, no ghosts.