Scenarios Of Potential Data Loss in Producers🔗
1. Kafka reliability depends on both broker and producer configuration🔗
Kafka’s fault-tolerance and durability model is shared responsibility:
- Brokers ensure replicated storage, consistent logs, and controlled leader election.
- Producers control acknowledgment behavior and error handling.
If either side is misconfigured, data can still be lost — even in a “perfect” cluster.
2. Broker reliability setup (the good part)🔗
In both examples, the brokers are configured properly for reliability:
- Replication factor = 3: Each partition has three replicas on different brokers.
-
Unclean leader election disabled (
unclean.leader.election.enable=false): -
Kafka will not promote an out-of-sync replica to leader.
- This ensures that only replicas with the latest committed data can become leaders.
- Prevents data loss due to “dirty” leader elections.
So from the broker’s point of view, everything is safe and consistent. But reliability also depends on how the producer sends messages and interprets acknowledgments.
3. Scenario 1 — Producer sends with acks=1🔗
This is one of the most common (and dangerous) misconfigurations.
What happens:🔗
- Producer sends a message to the partition leader.
- Leader writes it to its local log (in memory or disk buffer).
- Leader immediately responds “Success” to the producer.
- Followers haven’t yet replicated the message.
- Leader crashes before followers can copy the message.
- A follower is elected as new leader — but it never received the message.
Result:🔗
- The message is lost, because no surviving replica had it.
- The producer thinks the message was stored successfully.
- The consumer never sees the message.
- The system remains consistent (no consumer sees phantom data), but the producer’s data is gone.
Why it happens:🔗
acks=1 tells Kafka:
“Acknowledge once the leader writes the message — I don’t need confirmation that replicas have it.”
This sacrifices durability for latency. If the leader fails immediately after acknowledging, the message vanishes.
Takeaway🔗
- Broker replication can’t protect you if producers don’t wait for replication confirmation.
acks=1= high throughput, low durability.acks=all= slower, but guarantees durability once enough replicas have data.
4. Scenario 2 — Producer uses acks=all (better)🔗
Now we use a more reliable setting:
This means a write is only considered successful when at least two replicas have acknowledged it.
But reliability can still fail if the producer mishandles transient errors.
What happens:🔗
- Producer sends a message to the leader.
- At that moment, the leader crashes before responding.
- Kafka hasn’t yet elected a new leader.
- Broker responds with
LeaderNotAvailable(or possibly a timeout error). - The producer gets an exception.
If the producer is not programmed to retry, it simply moves on. The message was never committed to Kafka — so it’s lost forever.
Why this happens:🔗
The broker side was fine — replication worked correctly. The issue is on the producer side, which failed to:
- Recognize the transient nature of the error.
- Retry the write once the new leader was elected.
What the producer should have done:🔗
Implement idempotent, retryable writes:
acks=all
retries=Integer.MAX_VALUE
enable.idempotence=true
delivery.timeout.ms=120000
max.in.flight.requests.per.connection=1
This ensures that:
- Transient network or leader election errors are retried.
- Duplicate messages aren’t created (thanks to idempotent producer IDs).
- The producer waits long enough for recovery.
5. Difference between producer-side and broker-side data loss🔗
| Type of failure | Root cause | Broker fault? | Producer fault? | Data lost? |
|---|---|---|---|---|
| acks=1, leader crash | Message acknowledged by leader only, not replicated | ❌ | ✅ | ✅ |
| LeaderNotAvailable, no retry | Producer gave up during leader election | ❌ | ✅ | ✅ |
| Broker crash, ISR replication loss | Unclean leader election allowed | ✅ | ❌ | ✅ |
| Network delay / GC pauses | Temporary ISR flapping | ⚠️ | ❌ | Possible transient loss |
6. Best practices for producer reliability🔗
To ensure no acknowledged message is lost, configure producers carefully:
Core reliability configs:🔗
acks=all
retries=Integer.MAX_VALUE
enable.idempotence=true
max.in.flight.requests.per.connection=1
delivery.timeout.ms=120000
linger.ms=5
batch.size=32768
acks=all: Waits for acknowledgment from all in-sync replicas.enable.idempotence=true: Prevents duplicates during retries.retries: Keeps retrying until success or timeout.max.in.flight.requests.per.connection=1: Maintains order guarantees.delivery.timeout.ms: Overall time window for retries.
Optional safety layer:🔗
Use producer transaction APIs (initTransactions(), beginTransaction(), commitTransaction()) if you require exactly-once semantics (EOS).
7. Summary🔗
| Concept | Description |
|---|---|
| Broker replication | Protects data after it’s committed; cannot protect uncommitted leader-only writes. |
acks=1 risk |
Leader crash before replication causes silent data loss. |
acks=all advantage |
Waits for all in-sync replicas; ensures durability once acknowledged. |
| Error handling | Producers must retry transient errors to avoid message loss. |
| Idempotent producers | Prevent duplicates during retries; ensure exactly-once delivery semantics. |
| Unclean leader election (disabled) | Prevents stale data from replacing newer data after failure. |
In essence:🔗
Kafka brokers can be configured for strong durability guarantees, but those guarantees only apply to committed messages — data successfully replicated to the required number of in-sync replicas.
If producers don’t:
- Wait for all replicas (
acks=all), and - Retry after transient errors (especially during leader re-election),
then data can still be lost before it ever becomes committed — even in a perfectly tuned cluster.
True reliability in Kafka requires end-to-end correctness — from producer configuration and retry logic to broker replication and consumer offsets.
Absolutely — let’s go through this in a detailed, technical explanation of how linger.ms works in Kafka producers, and how it fits into the reliability and performance model described above.
1. Context: Producer batching and efficiency🔗
When a Kafka producer sends messages, it doesn’t necessarily transmit them immediately, one by one, over the network. Instead, it batches messages together per partition before sending them to the broker.
Batching helps:
- Reduce network overhead (fewer TCP requests).
- Increase throughput (more messages per request).
- Improve compression efficiency (since larger batches compress better).
The producer collects messages in a buffer (a memory region managed by the producer client), and then sends them either:
- When the batch is full (reaches
batch.sizein bytes), or - When a certain amount of time passes (
linger.ms).
2. What linger.ms does🔗
linger.ms controls how long the producer waits before sending a batch of messages, even if it’s not full.
Definition:
linger.ms= the maximum time (in milliseconds) that the producer will wait for additional messages before sending the current batch to the broker.
So linger.ms adds an intentional, small delay to allow more messages to accumulate in the batch.
Example🔗
If linger.ms=0 (the default):
- The producer sends messages as soon as possible, as soon as there’s at least one record ready to send.
- Batching still happens, but only if multiple messages arrive very close together (within the same poll loop or CPU cycle).
If linger.ms=5:
- The producer will wait up to 5 milliseconds before sending the batch.
- If enough messages arrive during that interval to fill the batch, they are sent together.
- If not, the batch is sent at the end of 5 milliseconds anyway.
3. The relationship between linger.ms and batch.size🔗
Both linger.ms and batch.size control batching behavior, but in different ways:
| Parameter | Description | Effect |
|---|---|---|
batch.size |
Maximum number of bytes per batch (per partition). | When full, batch is sent immediately, even before linger.ms expires. |
linger.ms |
Maximum wait time before sending a batch. | Forces a flush after this delay, even if batch isn’t full. |
You can think of them like two triggers:
- Send the batch if either condition is met first.
4. Why this improves performance🔗
Without linger.ms, each message might be sent in its own request if messages arrive sporadically — this creates a high number of small network requests.
By introducing a short linger period (typically 5–10 milliseconds):
- You allow the producer to group more messages together.
- Network utilization improves dramatically.
- Compression becomes more efficient (e.g., GZIP or Snappy work better with larger batches).
- CPU and broker load decrease because there are fewer total requests to process.
The result is higher throughput and lower overall system overhead.
5. Trade-offs and impact on latency🔗
| Setting | Throughput | End-to-End Latency | Use Case |
|---|---|---|---|
linger.ms = 0 |
Low | Lowest | Real-time, ultra-low-latency use cases |
linger.ms = 5–10 |
High | Slightly higher (adds a few ms) | Most streaming / analytics workloads |
linger.ms = 50+ |
Very high | Noticeable delay | Bulk ingestion, ETL, or log aggregation |
So, increasing linger.ms improves throughput, but introduces a small additional delay — bounded by the configured time.
For most workloads, values between 5–10 ms give a good balance of latency and efficiency.
6. Reliability considerations🔗
linger.ms affects when data is sent, not how reliably it’s acknowledged.
Reliability still depends on:
acksmin.insync.replicasenable.idempotenceretries
However, linger.ms interacts indirectly with reliability in these ways:
- Larger batches mean fewer requests, so fewer chances for network errors per message.
- But if the producer crashes before the batch is sent, those buffered messages are lost (they were never transmitted).
To mitigate this:
- Keep
linger.msmodest. - Ensure the producer has
acks=allandenable.idempotence=trueto guarantee delivery once the batch is sent.
7. How linger.ms fits into the full producer send pipeline🔗
A simplified flow:
- The producer application calls
producer.send(record). - The record is placed in a partition-specific buffer.
-
The producer I/O thread monitors these buffers:
-
If the batch is full (
batch.size), send it immediately. - If the batch is not full, wait up to
linger.msmilliseconds. - When either limit is reached, the batch is sent as one request to the broker.
- The producer waits for acknowledgment based on
acks.
So, linger.ms effectively throttles the send frequency of batches without affecting the acknowledgment semantics.
8. Typical production configurations🔗
| Setting | Typical Value | Purpose |
|---|---|---|
acks=all |
Strong durability | Require all ISR replicas to acknowledge |
enable.idempotence=true |
Exactly-once delivery | Avoid duplicates when retrying |
batch.size=32KB – 128KB |
Efficient batching | Increase throughput |
linger.ms=5 – 10 |
Balanced latency and throughput | Allow small message accumulation |
retries=Integer.MAX_VALUE |
Retry on transient errors | Ensure delivery during leader election |
Together, these settings give high throughput, strong durability, and low probability of message loss.
9. Summary🔗
| Concept | Description |
|---|---|
| Purpose | Controls how long the producer waits before sending a batch of messages, even if it isn’t full. |
| Default | linger.ms=0 — send immediately. |
| Effect | Higher values improve throughput and compression; lower values reduce latency. |
| Interaction | Works with batch.size; batch is sent when either limit is hit. |
| Best Practice | Set between 5–10 ms for typical workloads; higher only for bulk ETL or non-real-time ingestion. |
In summary🔗
linger.ms allows the producer to trade a few milliseconds of delay for significantly improved throughput and network efficiency.
It doesn’t affect message durability — that’s handled by acks, replication, and idempotence — but it does help Kafka scale efficiently by reducing request overhead.