Dead Letter Queue (DLQ) Pattern
A Dead Letter Queue (DLQ) is a secondary message queue that receives messages that could not be successfully processed by a consumer after a configured number of attempts. It acts as a safety valve: isolating failed, malformed, or permanently unprocessable messages so they cannot block healthy traffic, exhaust consumer threads, or silently disappear.
The DLQ is not just an error bin โ it is a reliability contract. It guarantees that no message is silently dropped and that every failure can be investigated, fixed, and replayed.
- New learners โ start at The Post Office Analogy and Core Concepts.
- Senior engineers โ jump to Retry Strategy Deep Dive, Alternatives Comparison, Ordering vs. Throughput Trade-off, or Production Playbooks.
The Post Office Analogyโ
Imagine a high-volume post office. Letters flow down a conveyor belt to automated sorting machines. Most letters are stamped, sorted, and routed correctly.
But occasionally a letter arrives that the machine cannot handle:
- The address is illegible (malformed payload โ bad JSON, missing required fields).
- The recipient no longer exists at that address (invalid entity reference โ message references a deleted order).
- The envelope is physically damaged (deserialization error โ corrupted bytes).
- The letter requires a signature but no one is home after 3 attempts (transient failure that exhausted retries).
Without a DLQ: The sorting machine gets stuck on the unprocessable letter. It tries the same letter over and over. The conveyor belt backs up. Healthy mail piles up behind it. The entire post office grinds to a halt โ a poison pill scenario.
With a DLQ: The machine ejects the unprocessable letter into a labeled "Undeliverable Mail" bin after N failed attempts. The main belt keeps moving at full speed. At the end of the day, a human postal worker (the on-call engineer) inspects the bin, diagnoses each letter, fixes the issue, and re-routes the letter for another delivery attempt.
The conveyor belt is your message queue. The sorting machine is your consumer. The undeliverable bin is your DLQ. The re-routing is DLQ redrive.
How It Works: The Full Lifecycleโ
Step-by-Stepโ
1. Message published. Producer sends a message to the main queue. Everything looks normal.
2. Consumer receives and fails. The consumer dequeues the message, attempts to process it, and throws an exception (malformed data, downstream timeout, logic bug).
3. Negative acknowledgement (NACK). The consumer signals failure. In Kafka this means not committing the offset. In SQS this means letting the visibility timeout expire. In RabbitMQ this means calling channel.basicNack().
4. Backoff and retry. The broker waits for the configured backoff period, then re-delivers the message. This repeats up to maxAttempts times, with exponential backoff between attempts.
5. DLQ routing. After all retry attempts are exhausted, the broker automatically routes the message to the configured DLQ. The message is no longer on the main queue โ it cannot block further processing.
6. Alert fires. Monitoring detects the DLQ depth > 0 and pages the on-call engineer.
7. Diagnose and fix. The engineer inspects the DLQ message, identifies the root cause (schema mismatch? null pointer? downstream service bug?), and deploys a fix.
8. Redrive. Once the consumer code is fixed, the engineer triggers a redrive โ moving DLQ messages back to the main queue for reprocessing.
Core Conceptsโ
Poison Pills vs. Transient Failuresโ
The most important distinction in DLQ design. The retry strategy and DLQ routing differ completely depending on the failure type.
| Failure Type | Examples | Correct Response |
|---|---|---|
| Transient | DB connection timeout, downstream HTTP 503, thread pool exhausted | Retry with exponential backoff + jitter. DO NOT route to DLQ immediately. |
| Permanent (Poison Pill) | Malformed JSON, null required field, invalid enum value, schema mismatch | Route to DLQ after N attempts โ retrying will never succeed |
| Business Logic Error | Insufficient funds, order already cancelled, invalid state transition | Depends: either DLQ for human review, or send compensating event |
| Idempotency Duplicate | Message already processed (at-least-once redelivery) | Detect and discard โ do NOT DLQ or retry |
// โ
Good: distinguish failure types before deciding how to handle
@KafkaListener(topics = "orders")
public void processOrder(Order order) {
try {
orderService.process(order);
} catch (JsonProcessingException e) {
// Permanent โ malformed payload, retrying will never help
// Throw to trigger DLQ routing immediately (after configured maxAttempts)
throw new DeserializationException("Malformed order payload", e);
} catch (OrderAlreadyProcessedException e) {
// Idempotency duplicate โ discard silently, no DLQ
log.warn("Duplicate order event, skipping: {}", order.getId());
// Do NOT throw โ ACK the message to remove it from the queue
} catch (InventoryServiceUnavailableException e) {
// Transient โ retry is appropriate, will eventually succeed
// Throw to trigger retry with backoff
throw new RetryableException("Inventory service temporarily unavailable", e);
}
}
Visibility Timeout (SQS) / Acknowledgement Timeout (RabbitMQ)โ
When a consumer takes a message, the broker hides it from other consumers for a visibility timeout period. This gives the consumer time to process and acknowledge it.
Message dequeued by Consumer A
โ Message hidden for visibility_timeout = 30s
โ Consumer A processing...
IF Consumer A ACKs within 30s:
โ Message deleted from queue โ
IF Consumer A crashes or takes > 30s:
โ Visibility timeout expires
โ Message becomes visible again โ delivered to Consumer B
โ receiveCount incremented
โ If receiveCount > maxReceiveCount โ routed to DLQ
The visibility timeout saturation trap:
Processing time average: 25 seconds
Visibility timeout configured: 30 seconds
โ Under heavy load, Consumer A takes 28 seconds (GC pause, slow DB)
โ At second 30: message becomes visible โ Consumer B picks it up
โ Consumer A finishes at second 32 โ also tries to ACK
โ Both consumers process the same message โ DUPLICATE PROCESSING
Fix: Set visibility timeout โฅ 6ร average processing time
Average = 25s โ set visibility timeout = 150s
Max Receive Count / Max Attemptsโ
maxReceiveCount = 3 means:
Attempt 1: fails โ receiveCount = 1
Attempt 2: fails โ receiveCount = 2
Attempt 3: fails โ receiveCount = 3 โ route to DLQ
Choosing the right maxReceiveCount:
- Too low (1โ2): Transient failures (brief network blip, momentary DB unavailability) send messages to DLQ prematurely โ these would have succeeded on attempt 3.
- Too high (10+): Poison pills spend excessive time blocking consumer threads across many retries before reaching the DLQ. The main queue backs up.
- Recommended range: 3โ5 for most systems. Use 5 for external API calls (more transient failure possibility), 3 for internal service calls.
Alternatives and When to Choose Whatโ
A DLQ is not the only error-handling strategy. Understanding the full landscape prevents over-engineering and ensures the right tool for each failure scenario.
Comparison Matrixโ
| Strategy | Delivery Guarantee | Message Preserved? | Ordering Preserved? | Operational Overhead | Best For |
|---|---|---|---|---|---|
| DLQ | At-least-once | โ Yes | โ ๏ธ Broken on failure | Low | Most production systems; diagnosable failures |
| Discard on failure | At-most-once | โ No | โ Yes | None | Non-critical events (click tracking, metrics) |
| Infinite retry | Exactly-once attempt | โ Yes | โ Yes | None upfront | โ Almost never โ causes consumer starvation |
| Retry topic (Kafka) | At-least-once | โ Yes | โ ๏ธ Partial | Medium | High-throughput Kafka; non-blocking retries |
| Pause partition | Exactly-once ordering | โ Yes | โ Yes | High | Strict ordering required (financial ledgers) |
| Compensating event | At-least-once | โ Via event | โ Yes | Medium | Business-logic failures needing rollback |
| Circuit breaker | At-least-once + backpressure | โ Yes | โ ๏ธ Partial | Medium | Downstream service unavailability |
1. Discard on Failure (Drop the Message)โ
Simply log the error and acknowledge (ACK) the message regardless of outcome. It is permanently gone.
@KafkaListener(topics = "click-events")
public void processClickEvent(ClickEvent event) {
try {
analyticsService.record(event);
} catch (Exception e) {
// Non-critical: losing a click event is acceptable
log.warn("Failed to record click event {}, discarding: {}", event.getId(), e.getMessage());
// Do NOT rethrow โ Kafka will ACK and move on
}
}
Choose discard when: The message is non-critical (UI analytics, A/B test impressions, feature flag telemetry). The cost of investigation and redrive exceeds the value of the lost event. Never discard financial transactions, order state changes, or any message that represents an irreversible business event.
2. Retry Topic Pattern (Kafka Non-Blocking Retry)โ
Instead of blocking the main partition on failure, the message is routed to a series of retry topics with increasing delays. After exhausting all retry topics, it goes to the DLQ topic.
Main topic: orders
โ Fail โ orders.retry.2s (retry after 2 second delay)
โ Fail โ orders.retry.30s (retry after 30 second delay)
โ Fail โ orders.retry.5m (retry after 5 minute delay)
โ Fail โ orders.DLQ (give up โ route to DLQ)
@Configuration
public class KafkaRetryConfig {
@Bean
public CommonErrorHandler nonBlockingRetryHandler(KafkaTemplate<Object, Object> template) {
// Routes to: topic.retry.2000, topic.retry.30000, topic.retry.300000, topic.DLQ
DeadLetterPublishingRecoverer recoverer = new DeadLetterPublishingRecoverer(template,
(record, ex) -> {
// Distinguish retryable vs. non-retryable โ skip retries for poison pills
if (ex.getCause() instanceof DeserializationException) {
return new TopicPartition(record.topic() + ".DLQ", record.partition());
}
return new TopicPartition(record.topic() + ".retry.2000", record.partition());
});
ExponentialBackOffWithMaxRetries backOff = new ExponentialBackOffWithMaxRetries(3);
backOff.setInitialInterval(2_000L);
backOff.setMultiplier(15.0); // 2s โ 30s โ 300s
backOff.setMaxInterval(300_000L);
return new DefaultErrorHandler(recoverer, backOff);
}
}
Choose retry topics when: You are using Kafka and cannot afford to block the main partition during retries. High-throughput consumers where a single slow retry holds up the entire partition.
3. Pause Partition (Kafka Strict Ordering)โ
On failure, pause the Kafka partition โ stop consuming from it entirely until the issue is resolved. No messages are skipped, reordered, or DLQ'd.
@Component
@RequiredArgsConstructor
public class OrderingAwareConsumer {
private final KafkaListenerEndpointRegistry registry;
private final AlertService alertService;
@KafkaListener(topics = "account-transactions", groupId = "ledger-processor",
id = "ledger-listener")
public void process(ConsumerRecord<String, Transaction> record) {
try {
ledgerService.applyTransaction(record.value());
} catch (Exception e) {
log.error("FATAL: Failed to process transaction at offset {}, partition {} โ PAUSING",
record.offset(), record.partition());
// Pause this specific partition โ no further messages consumed
MessageListenerContainer container = registry.getListenerContainer("ledger-listener");
container.pause();
// Page on-call immediately โ human must intervene
alertService.pageOnCall("Ledger partition paused โ manual intervention required",
record.topic(), record.partition(), record.offset());
throw e; // Do not ACK โ offset not committed
}
}
}
Choose pause partition when: Message ordering is an absolute requirement (banking ledgers, account balance updates, inventory adjustments where sequence matters). DLQ + redrive would destroy the ordering guarantee and potentially corrupt business state.
4. Compensating Eventโ
For business-logic failures that require domain-level rollback rather than technical retry.
@KafkaListener(topics = "payment-requests")
public void processPayment(PaymentRequest request) {
try {
paymentService.charge(request);
} catch (InsufficientFundsException e) {
// This is a business failure, not a technical failure
// Do NOT DLQ โ publish a compensating event instead
eventPublisher.publish(new PaymentFailedEvent(
request.getOrderId(),
request.getCustomerId(),
PaymentFailureReason.INSUFFICIENT_FUNDS,
e.getMessage()
));
// ACK the original message โ it was processed correctly (just failed business logic)
}
}
Choose compensating events when: The failure is a business-domain outcome, not a technical error. Downstream services need to react to the failure (e.g., order service must cancel the order when payment fails).
Implementation: Spring Boot and Kafkaโ
Production-Grade Configurationโ
@Configuration
@EnableKafka
@RequiredArgsConstructor
public class KafkaDlqConfig {
private final KafkaTemplate<Object, Object> kafkaTemplate;
@Bean
public CommonErrorHandler errorHandler() {
DeadLetterPublishingRecoverer recoverer = new DeadLetterPublishingRecoverer(
kafkaTemplate,
// Route to <topic>.DLQ on the same partition for traceability
(record, ex) -> new TopicPartition(record.topic() + ".DLQ", record.partition())
);
// Enrich DLQ message with diagnostic headers
recoverer.setHeadersFunction((record, ex) -> {
Map<String, Object> headers = new LinkedHashMap<>();
headers.put("dlq-original-topic", record.topic());
headers.put("dlq-original-partition", String.valueOf(record.partition()));
headers.put("dlq-original-offset", String.valueOf(record.offset()));
headers.put("dlq-exception-class", ex.getClass().getName());
headers.put("dlq-exception-message", ex.getMessage());
headers.put("dlq-failed-at", Instant.now().toString());
headers.put("dlq-consumer-group", "order-processor");
return new RecordHeaders(
headers.entrySet().stream()
.map(e -> new RecordHeader(e.getKey(),
e.getValue().toString().getBytes()))
.toList()
);
});
// Exponential backoff: 1s โ 2s โ 4s โ 8s (4 total attempts, then DLQ)
ExponentialBackOffWithMaxRetries backOff = new ExponentialBackOffWithMaxRetries(3);
backOff.setInitialInterval(1_000L);
backOff.setMultiplier(2.0);
backOff.setMaxInterval(30_000L);
DefaultErrorHandler handler = new DefaultErrorHandler(recoverer, backOff);
// Classify non-retryable exceptions โ route to DLQ immediately without retries
handler.addNotRetryableExceptions(
JsonProcessingException.class, // Malformed JSON โ retrying won't help
DeserializationException.class, // Schema mismatch
IllegalArgumentException.class, // Invalid business data
NullPointerException.class // Programming error โ fix the code, not the retry
);
// Classify retryable exceptions explicitly
handler.addRetryableExceptions(
TransientDataAccessException.class, // DB transient error
ResourceAccessException.class, // Network timeout to downstream service
OptimisticLockingFailureException.class // Concurrency conflict
);
return handler;
}
}
DLQ Consumer (Inspector + Reprocess)โ
@Component
@RequiredArgsConstructor
@Slf4j
public class OrderDlqConsumer {
private final ObjectMapper objectMapper;
private final DlqIncidentRepository incidentRepository;
private final MeterRegistry meterRegistry;
@KafkaListener(
topics = "orders.DLQ",
groupId = "orders-dlq-inspector",
// Consume from beginning on startup โ don't miss pre-existing DLQ messages
properties = {"auto.offset.reset=earliest"}
)
public void inspect(ConsumerRecord<String, String> record) {
String originalTopic = getHeader(record, "dlq-original-topic");
String exceptionClass = getHeader(record, "dlq-exception-class");
String exceptionMsg = getHeader(record, "dlq-exception-message");
String failedAt = getHeader(record, "dlq-failed-at");
log.error("DLQ message received | topic={} partition={} offset={} | error={}: {}",
originalTopic,
getHeader(record, "dlq-original-partition"),
getHeader(record, "dlq-original-offset"),
exceptionClass,
exceptionMsg);
// Persist DLQ incident for dashboard + audit trail
DlqIncident incident = DlqIncident.builder()
.originalTopic(originalTopic)
.payload(record.value())
.exceptionClass(exceptionClass)
.exceptionMessage(exceptionMsg)
.failedAt(Instant.parse(failedAt))
.status(DlqStatus.PENDING)
.build();
incidentRepository.save(incident);
// Increment counter for Grafana alerting
meterRegistry.counter("dlq.messages.received",
"topic", originalTopic,
"exception", exceptionClass).increment();
}
private String getHeader(ConsumerRecord<?, ?> record, String key) {
Header header = record.headers().lastHeader(key);
return header != null ? new String(header.value()) : "unknown";
}
}
Redrive Service (Replay DLQ โ Main Queue)โ
@Service
@RequiredArgsConstructor
@Slf4j
public class DlqRedriveService {
private final KafkaTemplate<String, String> kafkaTemplate;
private final DlqIncidentRepository incidentRepository;
/**
* Redrive a single DLQ incident back to the original topic.
* Called manually by an engineer after the root cause is fixed.
*/
@Transactional
public void redriveIncident(UUID incidentId) {
DlqIncident incident = incidentRepository.findById(incidentId)
.orElseThrow(() -> new IncidentNotFoundException(incidentId));
if (incident.getStatus() != DlqStatus.PENDING) {
throw new InvalidIncidentStateException(
"Can only redrive PENDING incidents, current status: " + incident.getStatus());
}
try {
// Send back to the original topic
kafkaTemplate.send(incident.getOriginalTopic(), incident.getPayload()).get();
incident.setStatus(DlqStatus.REDRIVEN);
incident.setRedrivenAt(Instant.now());
log.info("Redriven DLQ incident {} to topic {}", incidentId, incident.getOriginalTopic());
} catch (Exception e) {
incident.setStatus(DlqStatus.REDRIVE_FAILED);
incident.setLastError(e.getMessage());
throw new RedriveException("Redrive failed for incident " + incidentId, e);
} finally {
incidentRepository.save(incident);
}
}
/**
* Bulk redrive โ redrive all PENDING incidents for a given topic.
* Use after deploying a fix that addresses a batch of failures.
*/
@Transactional
public RedriveResult bulkRedrive(String topic, int batchSize) {
List<DlqIncident> pending = incidentRepository
.findPendingByTopicOrderByFailedAt(topic, PageRequest.of(0, batchSize));
int succeeded = 0, failed = 0;
for (DlqIncident incident : pending) {
try {
redriveIncident(incident.getId());
succeeded++;
} catch (Exception e) {
log.error("Failed to redrive incident {}: {}", incident.getId(), e.getMessage());
failed++;
}
}
return new RedriveResult(topic, succeeded, failed, pending.size());
}
}
Implementation: RabbitMQโ
Queue Declaration with DLX (Dead Letter Exchange)โ
RabbitMQ routes failed messages via a Dead Letter Exchange (DLX) โ a regular exchange configured as the dead letter destination for another queue.
@Configuration
public class RabbitMqDlqConfig {
// โโ Main infrastructure โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
@Bean
public TopicExchange mainExchange() {
return new TopicExchange("orders.exchange");
}
@Bean
public TopicExchange deadLetterExchange() {
return new TopicExchange("orders.dlx"); // DLX โ routes to DLQ
}
@Bean
public Queue mainQueue() {
return QueueBuilder.durable("orders.queue")
// Any rejected/expired message routes to the DLX
.withArgument("x-dead-letter-exchange", "orders.dlx")
.withArgument("x-dead-letter-routing-key", "orders.dead")
// Optional: per-message TTL โ message goes to DLQ if not consumed in 24h
.withArgument("x-message-ttl", 86_400_000L)
// Optional: max queue length before head is dead-lettered
.withArgument("x-max-length", 100_000)
.build();
}
@Bean
public Queue deadLetterQueue() {
return QueueBuilder.durable("orders.dlq")
// DLQ has a longer retention than the main queue
.withArgument("x-message-ttl", 1_209_600_000L) // 14 days
.build();
}
// โโ Bindings โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
@Bean
public Binding mainBinding() {
return BindingBuilder.bind(mainQueue())
.to(mainExchange())
.with("orders.#");
}
@Bean
public Binding dlqBinding() {
return BindingBuilder.bind(deadLetterQueue())
.to(deadLetterExchange())
.with("orders.dead");
}
}
RabbitMQ Consumer with Manual Acknowledgementโ
@Component
@RequiredArgsConstructor
@Slf4j
public class OrderRabbitConsumer {
private final OrderService orderService;
private final MeterRegistry meterRegistry;
@RabbitListener(queues = "orders.queue")
public void consume(Message message, Channel channel,
@Header(AmqpHeaders.DELIVERY_TAG) long deliveryTag) throws IOException {
String payload = new String(message.getBody());
try {
Order order = objectMapper.readValue(payload, Order.class);
orderService.process(order);
// โ
ACK โ message successfully processed, remove from queue
channel.basicAck(deliveryTag, false);
meterRegistry.counter("orders.processed").increment();
} catch (JsonProcessingException e) {
// Permanent failure โ malformed JSON, retrying won't help
// requeue=false โ message routes to DLX โ DLQ
log.error("Malformed order payload, routing to DLQ: {}", e.getMessage());
channel.basicNack(deliveryTag, false, false); // false = do NOT requeue
meterRegistry.counter("orders.dlq.routed", "reason", "deserialization_error").increment();
} catch (TransientServiceException e) {
// Transient failure โ requeue for retry
// requeue=true โ message goes back to the head of the queue
log.warn("Transient failure processing order, requeueing: {}", e.getMessage());
channel.basicNack(deliveryTag, false, true); // true = requeue
// Note: RabbitMQ spring-retry handles exponential backoff before requeue
}
}
}
# application.yml โ RabbitMQ retry with exponential backoff
spring:
rabbitmq:
listener:
simple:
acknowledge-mode: manual # Manual ACK gives full control
retry:
enabled: true
max-attempts: 4
initial-interval: 1000ms # 1s first retry
multiplier: 2.0 # 1s โ 2s โ 4s โ DLQ
max-interval: 30000ms
stateless: false # Stateful retry tracks per-message state
Implementation: AWS SQSโ
Terraform Configurationโ
# Main queue with DLQ redrive policy
resource "aws_sqs_queue" "orders" {
name = "orders-queue"
visibility_timeout_seconds = 150 # 6ร average processing time of 25s
message_retention_seconds = 345600 # 4 days
redrive_policy = jsonencode({
deadLetterTargetArn = aws_sqs_queue.orders_dlq.arn
maxReceiveCount = 4 # 4 delivery attempts before DLQ
})
}
resource "aws_sqs_queue" "orders_dlq" {
name = "orders-queue-dlq"
message_retention_seconds = 1209600 # 14 days โ longer than main queue
}
# CloudWatch alarm โ alert the moment DLQ receives any message
resource "aws_cloudwatch_metric_alarm" "orders_dlq_not_empty" {
alarm_name = "orders-dlq-not-empty"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
metric_name = "ApproximateNumberOfMessagesVisible"
namespace = "AWS/SQS"
period = 60
statistic = "Sum"
threshold = 0
alarm_description = "Messages in DLQ indicate consumer failures requiring investigation"
alarm_actions = [aws_sns_topic.on_call_alerts.arn]
dimensions = {
QueueName = aws_sqs_queue.orders_dlq.name
}
}
Spring Boot SQS Consumerโ
@Component
@RequiredArgsConstructor
@Slf4j
public class SqsOrderConsumer {
private final OrderService orderService;
private final SqsTemplate sqsTemplate;
@SqsListener(value = "${aws.sqs.orders-queue-url}", acknowledgementMode = AcknowledgementMode.MANUAL)
public void consume(Order order, Acknowledgement ack,
@Header("ApproximateReceiveCount") int receiveCount,
@Header("MessageId") String messageId) {
log.info("Processing order {} (attempt {})", order.getId(), receiveCount);
try {
orderService.process(order);
ack.acknowledge(); // โ
Delete from SQS
} catch (NonRetryableException e) {
// Permanent failure โ let the message exhaust its receiveCount
// and route to DLQ naturally (do not ACK)
log.error("Permanent failure for order {}, attempt {}/{}: {}",
order.getId(), receiveCount, 4, e.getMessage());
// Not acknowledging โ visibility timeout expires โ message reappears
// After maxReceiveCount (4) it moves to DLQ automatically
} catch (Exception e) {
// Transient failure โ same as above, but log differently
log.warn("Transient failure for order {}, attempt {}/{}: {}",
order.getId(), receiveCount, 4, e.getMessage());
}
}
}
SQS Redrive (AWS CLI + Java SDK)โ
# List messages in the DLQ for inspection
aws sqs receive-message \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789012/orders-queue-dlq \
--attribute-names All \
--message-attribute-names All \
--max-number-of-messages 10
# Start a redrive task โ moves all DLQ messages back to the source queue
aws sqs start-message-move-task \
--source-arn arn:aws:sqs:us-east-1:123456789012:orders-queue-dlq \
--destination-arn arn:aws:sqs:us-east-1:123456789012:orders-queue \
--max-number-of-messages-per-second 10 # Throttle to avoid overwhelming the consumer
// Programmatic SQS redrive via AWS SDK v2
@Service
@RequiredArgsConstructor
public class SqsRedriveService {
private final SqsClient sqsClient;
@Value("${aws.sqs.orders-dlq-arn}")
private String dlqArn;
@Value("${aws.sqs.orders-queue-arn}")
private String sourceQueueArn;
public String startRedrive(int messagesPerSecond) {
StartMessageMoveTaskResponse response = sqsClient.startMessageMoveTask(r -> r
.sourceArn(dlqArn)
.destinationArn(sourceQueueArn)
.maxNumberOfMessagesPerSecond(messagesPerSecond)
);
log.info("Started SQS redrive task: {}", response.taskHandle());
return response.taskHandle();
}
public MessageMoveTaskStatus checkRedriveStatus(String taskHandle) {
ListMessageMoveTasksResponse response = sqsClient.listMessageMoveTasks(r ->
r.sourceArn(dlqArn));
return response.results().stream()
.filter(t -> t.taskHandle().equals(taskHandle))
.findFirst()
.map(t -> new MessageMoveTaskStatus(
t.status(), t.approximateNumberOfMessagesMoved(),
t.approximateNumberOfMessagesToMove()))
.orElseThrow();
}
}
Senior Deep Diveโ
1. The Ordering vs. Throughput Trade-offโ
Routing a failed message to the DLQ breaks the processing order for messages that follow it. For most systems this is acceptable. For some, it is catastrophic.
Main queue: [TXN-100 balance=1000] โ [TXN-101 withdraw=500] โ [TXN-102 withdraw=700]
Scenario: TXN-101 fails and routes to DLQ
TXN-102 is processed: balance=1000, withdraw=700 โ balance=300
Later: TXN-101 is redriven and processed: balance=300, withdraw=500 โ balance=-200 โ
This would be APPROVED โ but should have been REJECTED (balance was only 300 at the time)
The correct final balance should be 1000 - 500 - 700 = -200 REJECTED
Result: Data corruption from out-of-order replay
Decision framework:
| System Type | Ordering Requirement | Recommended Strategy |
|---|---|---|
| Notification service | โ None | DLQ โ order doesn't matter |
| Search index sync | โ None (idempotent upserts) | DLQ โ last write wins |
| E-commerce order status | โ ๏ธ Soft (within order) | DLQ + per-aggregate ordering via partition key |
| Bank account ledger | โ Strict | Pause partition โ no DLQ |
| Inventory adjustments | โ Strict | Pause partition or saga compensation |
Kafka partition key for per-aggregate ordering:
// Ensure all events for the same aggregate go to the same partition
// โ Ordering is preserved per aggregate, DLQ is safe
kafkaTemplate.send(
new ProducerRecord<>(
"orders",
order.getOrderId().toString(), // partition key = order ID
order
)
);
// Events for orderId=123 always go to the same partition โ in-order processing
// A failure on orderId=123 does NOT affect orderId=456 on a different partition
2. Idempotent DLQ Consumersโ
When a message is redriven from the DLQ, the consumer processes it again. If the first processing attempt partially succeeded (e.g., the DB write succeeded but the downstream API call failed), a naive consumer will try to create the same record again.
// โ Non-idempotent consumer โ safe for first delivery, unsafe for redrive
@KafkaListener(topics = "orders")
public void process(Order order) {
orderRepository.save(order); // If first attempt: succeeds
// If redriven: DUPLICATE KEY VIOLATION โ
paymentService.charge(order);
}
// โ
Idempotent consumer โ safe for both first delivery and redrive
@KafkaListener(topics = "orders")
@Transactional
public void process(ConsumerRecord<String, Order> record) {
String eventId = record.topic() + "-" + record.partition() + "-" + record.offset();
// Check if already successfully processed (from first delivery attempt)
if (processedEventRepository.existsById(eventId)) {
log.info("Event {} already processed, skipping (idempotent)", eventId);
return; // ACK and move on
}
Order order = record.value();
orderRepository.save(order); // Idempotent: upsert by business ID
paymentService.charge(order); // Idempotent: check if charge already exists
// Record successful processing โ prevents duplicate on redrive
processedEventRepository.save(new ProcessedEvent(eventId, Instant.now()));
}
3. Schema Evolution and the DLQ Spike Patternโ
One of the most common causes of mass DLQ spikes in production: a schema change that breaks existing consumers.
Deploy new Order schema โ OrderV2 adds a required field: `currency`
โ
Legacy consumers still receive OrderV1 messages (in-flight, not yet drained)
โ
Deserialization fails for every OrderV1 message
โ
DLQ depth spikes from 0 to 100,000 in minutes
โ
Alert fires โ engineers investigate
Defense in layers:
Layer 1 โ Schema Registry compatibility check:
# Register the new schema and check backward compatibility BEFORE deploying
curl -X POST http://schema-registry:8081/compatibility/subjects/orders-value/versions/latest \
-H "Content-Type: application/json" \
-d @order_schema_v2.avsc
# {"is_compatible":false} โ DO NOT DEPLOY โ will break existing consumers
Layer 2 โ Never remove optional fields; only add:
// โ Breaking change โ removes a field consumers rely on
public class OrderV2 {
private UUID id;
private BigDecimal totalAmount;
// private String customerId; โ REMOVED โ breaks consumers that read this field
}
// โ
Backward compatible โ adds an optional field with a default
public class OrderV2 {
private UUID id;
private BigDecimal totalAmount;
private String customerId;
private String currency = "USD"; // New field โ nullable/defaulted
}
Layer 3 โ Classify deserialization errors as non-retryable:
// Route schema errors directly to DLQ without wasting retries
handler.addNotRetryableExceptions(
DeserializationException.class,
JsonMappingException.class,
InvalidFormatException.class
);
Layer 4 โ DLQ spike runbook:
1. Check DLQ message headers for "dlq-exception-class"
2. If DeserializationException:
a. Pull a sample DLQ message payload
b. Compare schema against current consumer version
c. Identify the breaking field
d. Rollback consumer OR deploy a schema-tolerant version
e. Redrive DLQ messages after fix is deployed
3. If NullPointerException or IllegalArgumentException:
a. Check if a new required field was added to the message
b. Identify the producer that sends incomplete payloads
c. Fix producer to include the field
d. Redrive DLQ after fix
4. DLQ Metrics, Alerting, and Dashboardsโ
The DLQ should be empty under normal operating conditions. Any non-zero depth is a production incident.
@Component
@RequiredArgsConstructor
public class DlqMetricsCollector {
private final DlqIncidentRepository incidentRepository;
private final MeterRegistry registry;
@Scheduled(fixedDelay = 10_000)
public void collectMetrics() {
// 1. DLQ depth per topic โ alert on ANY value > 0
Map<String, Long> depthByTopic = incidentRepository
.countPendingGroupedByTopic();
depthByTopic.forEach((topic, count) ->
registry.gauge("dlq.depth", Tags.of("topic", topic), count));
// 2. Age of oldest pending DLQ message โ alert if > 30 minutes
incidentRepository.findOldestPendingByTopic().forEach((topic, oldest) -> {
long ageMinutes = Duration.between(oldest.getFailedAt(), Instant.now()).toMinutes();
registry.gauge("dlq.oldest.message.age.minutes",
Tags.of("topic", topic), ageMinutes);
});
// 3. DLQ ingestion rate โ spike indicates a breaking change
registry.summary("dlq.ingestion.rate"); // Updated by DlqConsumer per message
}
}
Prometheus alerting rules:
groups:
- name: dlq_alerts
rules:
# P1: Any message in DLQ โ immediate investigation required
- alert: DLQNotEmpty
expr: dlq_depth > 0
for: 1m
labels:
severity: critical
annotations:
summary: "DLQ has {{ $value }} pending messages on topic {{ $labels.topic }}"
runbook: "https://wiki.company.com/runbooks/dlq-investigation"
# P2: Old unresolved DLQ messages approaching expiration
- alert: DLQMessageApproachingExpiry
expr: dlq_oldest_message_age_minutes > 180 # 3 hours
labels:
severity: warning
annotations:
summary: "DLQ messages on {{ $labels.topic }} are {{ $value }} minutes old"
# P2: DLQ spike โ many messages arriving in a short window (schema break?)
- alert: DLQIngestionSpike
expr: rate(dlq_messages_received_total[5m]) > 10
labels:
severity: warning
annotations:
summary: "DLQ receiving {{ $value }} messages/sec โ possible schema change"
5. The DLQ Across Broker Technologiesโ
The DLQ concept is universal, but implementation details differ across brokers.
| Aspect | Kafka | RabbitMQ | AWS SQS |
|---|---|---|---|
| DLQ mechanism | Separate topic (orders.DLQ) written to by consumer | Dead Letter Exchange (DLX) โ Dead Letter Queue | Separate SQS queue via redrive policy |
| Routing trigger | Application-level (DeadLetterPublishingRecoverer) | Broker-level (NACK with requeue=false, TTL, max-length) | Broker-level (maxReceiveCount exceeded) |
| Message ordering | Preserved within partition | Not guaranteed after DLQ | Not guaranteed (Standard queue) |
| DLQ message metadata | Added as Kafka headers by recoverer | Added as message properties by RabbitMQ | Available via ApproximateReceiveCount, MessageId attributes |
| Redrive mechanism | Re-publish to original topic (manual or automated) | Re-queue via shovel plugin or management API | StartMessageMoveTask API |
| Retention | Configurable topic retention (days) | Configurable queue TTL | Up to 14 days |
| FIFO ordering support | โ Per-partition | โ | โ FIFO queues |
| Multi-level DLQ | โ Retry topics โ DLQ | โ Multiple DLX hops | โ Single level |
Kafka-specific: multi-level retry topic chain:
orders (main topic)
โ fail attempt 1
orders.retry.1s (1 second delay via consumer pause)
โ fail attempt 2
orders.retry.30s (30 second delay)
โ fail attempt 3
orders.retry.5m (5 minute delay)
โ fail attempt 4
orders.DLQ (no more retries โ engineer must intervene)
// Consumer for each retry topic applies the configured delay
@KafkaListener(topics = "orders.retry.1s", groupId = "orders-retry-1s")
public void retryAfter1s(ConsumerRecord<String, Order> record) {
long publishedAt = Long.parseLong(
new String(record.headers().lastHeader("retry-publish-time").value()));
long delayMs = 1_000L - (System.currentTimeMillis() - publishedAt);
if (delayMs > 0) {
Thread.sleep(delayMs); // Enforce the delay
}
// Re-route to main topic for reprocessing
kafkaTemplate.send("orders", record.key(), record.value());
}
6. Production Incident Runbookโ
A structured decision tree for diagnosing and resolving DLQ incidents.
๐จ ALERT: DLQ depth > 0 for topic "orders"
Step 1: Triage (< 5 minutes)
โโ Single message? โ Likely a bad payload from a specific producer
โ Inspect message, find the producer, fix input
โ
โโ Batch of messages? โ Check deployment history
โ โโ Recent consumer deploy? โ Schema change? Roll back consumer and check
โ โโ Recent producer deploy? โ New field? Missing field? Check schema compatibility
โ
โโ Continuous stream? โ Consumer bug or downstream service outage
โโ Check consumer error logs for exception class
โโ Check downstream service health (DB, payment API, etc.)
โโ If downstream outage โ circuit breaker open?
Step 2: Contain (< 15 minutes)
โโ If consumer crashing on all messages โ pause consumer, stop the bleeding
โโ If downstream outage โ let retry backoff absorb; check circuit breaker
โโ If schema mismatch โ deploy schema-tolerant consumer version immediately
Step 3: Fix root cause (varies)
โโ Code bug โ deploy fix
โโ Schema mismatch โ update schema with backward compatibility
โโ Bad producer data โ fix producer, backfill correct data
โโ Downstream outage โ wait for recovery or implement fallback
Step 4: Redrive (after fix is confirmed)
โโ Inspect a sample DLQ message โ confirm it would now process successfully
โโ Redrive at throttled rate (10 msg/sec) โ watch error rate on main consumer
โโ Increase rate if error rate stays low โ full bulk redrive
โโ Monitor DLQ depth โ should return to 0
Step 5: Post-mortem
โโ Why did this message fail? (root cause)
โโ Why did we not catch this in testing? (process gap)
โโ What will prevent this class of failure? (prevention)
Interview Decision Matrixโ
| Scenario | Recommended Strategy | Why |
|---|---|---|
| General microservice event processing | DLQ with exponential backoff (3โ5 retries) | Balances retry for transient failures, safety net for permanent ones |
| Financial ledger / account balance updates | Pause partition โ no DLQ | Ordering is a business invariant; DLQ breaks it |
| Search index synchronization | DLQ โ safe | Consumers are idempotent upserts; ordering doesn't matter |
| Non-critical analytics events | Discard on failure | Investigation cost exceeds value of a lost click event |
| Schema mismatch causing mass failures | DLQ + immediate non-retryable classification | No retries on deserialization errors; inspect and fix schema first |
| Downstream service temporarily down | Exponential backoff without DLQ (3 retries) | Transient โ will recover; DLQ introduces unnecessary manual work |
| Business logic failure (insufficient funds) | Compensating event โ no DLQ | Technical retry will always fail; domain-level response is correct |
| Kafka high-throughput, cannot block partition | Retry topics โ DLQ | Non-blocking retry preserves main partition throughput |
"A DLQ is a reliability guarantee, not just an error bin. Its job is to ensure that no message is silently dropped. When a consumer encounters a permanent failure โ malformed payload, schema mismatch, invalid state โ retrying will never succeed. The message must be isolated into the DLQ so the main queue keeps flowing. After fixing the root cause, we redrive the DLQ messages back to the main queue. The two things I always make sure to get right: classifying failures as retryable vs. non-retryable so we don't waste time retrying poison pills, and making consumers idempotent so redrives are safe even if the first delivery partially succeeded."
"For a financial ledger where message ordering is an absolute business invariant, a DLQ is the wrong pattern. If TXN-101 fails and routes to the DLQ while TXN-102 is processed, the account balance is computed in the wrong order โ potentially approving a withdrawal that should have been rejected. For these systems, the correct pattern is to pause the Kafka partition on failure, page on-call immediately, and block all further processing until the issue is manually resolved. Throughput takes a back seat to correctness."
Further Readingโ
- Spring Kafka โ Error Handling and Retries โ Official Spring Kafka docs covering
DefaultErrorHandler,DeadLetterPublishingRecoverer, and retry configuration. - AWS SQS โ Dead Letter Queues โ AWS documentation on DLQ setup, redrive policy, and the
StartMessageMoveTaskAPI. - RabbitMQ โ Dead Letter Exchanges โ Official RabbitMQ docs on DLX routing arguments, per-message TTL, and queue-length-based dead lettering.
- Confluent โ Error Handling in Kafka Streams โ Covers DLQ integration patterns for Kafka Streams applications.
- Designing Data-Intensive Applications โ Chapter 11 (Stream Processing) โ Kleppmann's treatment of exactly-once semantics, fault tolerance, and ordering guarantees; foundational context for DLQ trade-offs.
- Transactional Outbox Pattern โ The companion pattern: ensuring messages are reliably published to the queue in the first place, so the DLQ only handles genuine processing failures rather than publish failures.
- Resilience4j Documentation โ Circuit breakers, retry policies, and bulkheads that work alongside DLQs to handle transient failures without unnecessary DLQ routing.