Kafka : Producer Retries and Additional Error Handling๐
1. Two categories of error handling๐
Kafka producer reliability is built on the idea that some errors can be retried safely, and others cannot. When a producer sends a message to a broker, the broker responds with either:
- A success acknowledgment, or
- An error code.
The producer API classifies these into two main categories:
| Error Type | Description | What Happens |
|---|---|---|
| Retriable errors | Temporary problems that may succeed after retrying. | Producer automatically retries. |
| Non-retriable errors | Permanent configuration or authorization problems. | Producer raises an exception to the client code. |
2. Examples of retriable errors๐
These typically occur due to transient cluster events โ temporary network failures, leader elections, or broker restarts. They can be retried safely without losing correctness.
| Error Code | Cause | Typical Resolution |
|---|---|---|
LEADER_NOT_AVAILABLE |
The partitionโs leader broker just failed; a new one is being elected. | Retry after the new leader is established. |
NOT_ENOUGH_REPLICAS |
Some replicas are temporarily out of sync. | Retry once ISR stabilizes. |
NETWORK_EXCEPTION |
Transient network glitch between producer and broker. | Retry automatically. |
REQUEST_TIMED_OUT |
Broker did not respond in time. | Retry after backoff. |
When these occur, the producer client can (and should) retry sending the same message โ either automatically (handled internally by KafkaProducer) or manually if custom logic is needed.
3. Examples of non-retriable errors๐
These represent permanent problems that will not resolve by simply retrying.
| Error Code | Description | Why retrying wonโt help |
|---|---|---|
INVALID_CONFIG |
Producer or topic configuration mismatch. | Misconfiguration needs manual correction. |
TOPIC_AUTHORIZATION_FAILED |
The producer is not authorized to write to the topic. | Requires security policy change. |
UNKNOWN_TOPIC_OR_PARTITION |
The topic doesnโt exist and auto-creation is disabled. | Topic must be created. |
MESSAGE_TOO_LARGE |
The message size exceeds broker limits. | Message must be adjusted. |
For these, Kafka immediately throws an exception to the application; the client must handle or log it โ retries are futile.
4. How automatic retries work๐
Kafkaโs producer client library automatically retries retriable errors without application intervention.
You control retry behavior via two key settings:
| Config | Default | Description |
|---|---|---|
retries |
2147483647 (Integer.MAX_VALUE) |
Maximum number of retry attempts. Effectively infinite by default. |
delivery.timeout.ms |
120000 (2 minutes) |
Maximum total time (across all retries) for the producer to attempt delivery before giving up. |
This means:
- The producer retries indefinitely, but only within the
delivery.timeout.mswindow. - If a message cannot be acknowledged by the broker within that window, the producer drops it and raises an exception.
Important note:๐
Retries happen asynchronously, in the background I/O thread โ the producer batches and resends records efficiently, without blocking your application threads.
5. How retries can cause duplicates๐
While retries solve temporary errors, they also introduce a new risk: duplicate writes.
Consider this sequence:
- Producer sends a message to broker.
- Broker writes the message successfully.
- Broker sends an acknowledgment.
- The acknowledgment is lost due to a network error.
- Producer assumes the send failed and retries.
- Broker receives the retry and writes the same message again.
Now the topic log contains two copies of the same message โ same payload, different offsets.
Without safeguards:๐
This is โat-least-once deliveryโ โ every message is stored at least once, possibly more than once.
6. How enable.idempotence=true fixes this๐
Enabling idempotence transforms producer behavior from at-least-once to exactly-once (per session) by adding deduplication metadata to every record batch.
When enable.idempotence=true:
- Each producer gets a unique Producer ID (PID) from the Kafka cluster controller.
-
Each message batch sent includes:
-
The PID
- A monotonic sequence number
- The broker tracks the last sequence number it has processed for each PID.
When the producer retries a batch:
- If the broker already has a batch with the same PID and sequence number, it silently discards the duplicate.
- The message is written exactly once, even if sent multiple times.
This mechanism ensures:
- Retries never create duplicates.
- Ordering is preserved per partition.
- Exactly-once delivery within the producer session.
7. Retry logic and timing behavior๐
The Kafka producerโs retry process is governed by several key configurations:
| Config | Function |
|---|---|
retries |
Number of retry attempts (default effectively infinite). |
retry.backoff.ms |
Wait time before retrying a failed send (default 100ms). |
delivery.timeout.ms |
Total time allowed from initial send to final acknowledgment before giving up. |
max.in.flight.requests.per.connection |
Maximum concurrent sends; setting this to 1 ensures strict ordering during retries. |
enable.idempotence |
Enables deduplication and exactly-once guarantees for retries. |
The combination of these settings controls how long the producer will persist in resending a record and whether retries are safe from duplication.
8. Developer-handled errors๐
There are still errors that the developer must handle explicitly. These typically occur:
- When the producer gives up after exceeding
delivery.timeout.ms. - When non-retriable errors are raised.
- When the application needs custom logic, such as logging, DLQ (dead-letter queue), or alerting.
The producer API surfaces these through:
- Futures returned by
producer.send(record), or - Callback functions, such as:
producer.send(record, (metadata, exception) -> {
if (exception != null) {
// handle error: log, retry, send to DLQ
} else {
// message successfully acknowledged
}
});
This is where your application decides whether to retry, log, or route the failed record elsewhere.
9. Summary๐
| Concept | Description |
|---|---|
| Retriable errors | Temporary (e.g., LEADER_NOT_AVAILABLE, NETWORK_EXCEPTION); can be retried safely. |
| Non-retriable errors | Permanent (e.g., INVALID_CONFIG, AUTHORIZATION_FAILED); require manual intervention. |
| Automatic retries | Producer retries retriable errors transparently within delivery.timeout.ms. |
| Duplicates risk | Without idempotence, retries may insert duplicate messages. |
Idempotence (enable.idempotence=true) |
Adds PID and sequence metadata for exactly-once delivery within a producer session. |
| Developer responsibility | Handle non-retriable errors, log/report failures, or implement DLQ. |
In essence:๐
Kafkaโs producer automatically retries transient errors to ensure at-least-once delivery, but to achieve exactly-once delivery (within a session) and avoid duplicates, you must enable enable.idempotence=true.
Beyond that, you must still handle irrecoverable errors and timeout conditions in your application logic โ the producer canโt fix those automatically.