Skip to main content

Dead Letter Queue (DLQ) Pattern

A Dead Letter Queue (DLQ) is a secondary message queue that receives messages that could not be successfully processed by a consumer after a configured number of attempts. It acts as a safety valve: isolating failed, malformed, or permanently unprocessable messages so they cannot block healthy traffic, exhaust consumer threads, or silently disappear.

The DLQ is not just an error bin โ€” it is a reliability contract. It guarantees that no message is silently dropped and that every failure can be investigated, fixed, and replayed.

Who this guide is for

The Post Office Analogyโ€‹

Imagine a high-volume post office. Letters flow down a conveyor belt to automated sorting machines. Most letters are stamped, sorted, and routed correctly.

But occasionally a letter arrives that the machine cannot handle:

  • The address is illegible (malformed payload โ€” bad JSON, missing required fields).
  • The recipient no longer exists at that address (invalid entity reference โ€” message references a deleted order).
  • The envelope is physically damaged (deserialization error โ€” corrupted bytes).
  • The letter requires a signature but no one is home after 3 attempts (transient failure that exhausted retries).

Without a DLQ: The sorting machine gets stuck on the unprocessable letter. It tries the same letter over and over. The conveyor belt backs up. Healthy mail piles up behind it. The entire post office grinds to a halt โ€” a poison pill scenario.

With a DLQ: The machine ejects the unprocessable letter into a labeled "Undeliverable Mail" bin after N failed attempts. The main belt keeps moving at full speed. At the end of the day, a human postal worker (the on-call engineer) inspects the bin, diagnoses each letter, fixes the issue, and re-routes the letter for another delivery attempt.

The conveyor belt is your message queue. The sorting machine is your consumer. The undeliverable bin is your DLQ. The re-routing is DLQ redrive.


How It Works: The Full Lifecycleโ€‹

Step-by-Stepโ€‹

1. Message published. Producer sends a message to the main queue. Everything looks normal.

2. Consumer receives and fails. The consumer dequeues the message, attempts to process it, and throws an exception (malformed data, downstream timeout, logic bug).

3. Negative acknowledgement (NACK). The consumer signals failure. In Kafka this means not committing the offset. In SQS this means letting the visibility timeout expire. In RabbitMQ this means calling channel.basicNack().

4. Backoff and retry. The broker waits for the configured backoff period, then re-delivers the message. This repeats up to maxAttempts times, with exponential backoff between attempts.

5. DLQ routing. After all retry attempts are exhausted, the broker automatically routes the message to the configured DLQ. The message is no longer on the main queue โ€” it cannot block further processing.

6. Alert fires. Monitoring detects the DLQ depth > 0 and pages the on-call engineer.

7. Diagnose and fix. The engineer inspects the DLQ message, identifies the root cause (schema mismatch? null pointer? downstream service bug?), and deploys a fix.

8. Redrive. Once the consumer code is fixed, the engineer triggers a redrive โ€” moving DLQ messages back to the main queue for reprocessing.


Core Conceptsโ€‹

Poison Pills vs. Transient Failuresโ€‹

The most important distinction in DLQ design. The retry strategy and DLQ routing differ completely depending on the failure type.

Failure TypeExamplesCorrect Response
TransientDB connection timeout, downstream HTTP 503, thread pool exhaustedRetry with exponential backoff + jitter. DO NOT route to DLQ immediately.
Permanent (Poison Pill)Malformed JSON, null required field, invalid enum value, schema mismatchRoute to DLQ after N attempts โ€” retrying will never succeed
Business Logic ErrorInsufficient funds, order already cancelled, invalid state transitionDepends: either DLQ for human review, or send compensating event
Idempotency DuplicateMessage already processed (at-least-once redelivery)Detect and discard โ€” do NOT DLQ or retry
// โœ… Good: distinguish failure types before deciding how to handle
@KafkaListener(topics = "orders")
public void processOrder(Order order) {
try {
orderService.process(order);
} catch (JsonProcessingException e) {
// Permanent โ€” malformed payload, retrying will never help
// Throw to trigger DLQ routing immediately (after configured maxAttempts)
throw new DeserializationException("Malformed order payload", e);
} catch (OrderAlreadyProcessedException e) {
// Idempotency duplicate โ€” discard silently, no DLQ
log.warn("Duplicate order event, skipping: {}", order.getId());
// Do NOT throw โ€” ACK the message to remove it from the queue
} catch (InventoryServiceUnavailableException e) {
// Transient โ€” retry is appropriate, will eventually succeed
// Throw to trigger retry with backoff
throw new RetryableException("Inventory service temporarily unavailable", e);
}
}

Visibility Timeout (SQS) / Acknowledgement Timeout (RabbitMQ)โ€‹

When a consumer takes a message, the broker hides it from other consumers for a visibility timeout period. This gives the consumer time to process and acknowledge it.

Message dequeued by Consumer A
โ†’ Message hidden for visibility_timeout = 30s
โ†’ Consumer A processing...

IF Consumer A ACKs within 30s:
โ†’ Message deleted from queue โœ…

IF Consumer A crashes or takes > 30s:
โ†’ Visibility timeout expires
โ†’ Message becomes visible again โ†’ delivered to Consumer B
โ†’ receiveCount incremented
โ†’ If receiveCount > maxReceiveCount โ†’ routed to DLQ

The visibility timeout saturation trap:

Processing time average: 25 seconds
Visibility timeout configured: 30 seconds

โ†’ Under heavy load, Consumer A takes 28 seconds (GC pause, slow DB)
โ†’ At second 30: message becomes visible โ€” Consumer B picks it up
โ†’ Consumer A finishes at second 32 โ€” also tries to ACK
โ†’ Both consumers process the same message โ†’ DUPLICATE PROCESSING

Fix: Set visibility timeout โ‰ฅ 6ร— average processing time
Average = 25s โ†’ set visibility timeout = 150s

Max Receive Count / Max Attemptsโ€‹

maxReceiveCount = 3 means:
Attempt 1: fails โ†’ receiveCount = 1
Attempt 2: fails โ†’ receiveCount = 2
Attempt 3: fails โ†’ receiveCount = 3 โ†’ route to DLQ

Choosing the right maxReceiveCount:

  • Too low (1โ€“2): Transient failures (brief network blip, momentary DB unavailability) send messages to DLQ prematurely โ€” these would have succeeded on attempt 3.
  • Too high (10+): Poison pills spend excessive time blocking consumer threads across many retries before reaching the DLQ. The main queue backs up.
  • Recommended range: 3โ€“5 for most systems. Use 5 for external API calls (more transient failure possibility), 3 for internal service calls.

Alternatives and When to Choose Whatโ€‹

A DLQ is not the only error-handling strategy. Understanding the full landscape prevents over-engineering and ensures the right tool for each failure scenario.

Comparison Matrixโ€‹

StrategyDelivery GuaranteeMessage Preserved?Ordering Preserved?Operational OverheadBest For
DLQAt-least-onceโœ… Yesโš ๏ธ Broken on failureLowMost production systems; diagnosable failures
Discard on failureAt-most-onceโŒ Noโœ… YesNoneNon-critical events (click tracking, metrics)
Infinite retryExactly-once attemptโœ… Yesโœ… YesNone upfrontโŒ Almost never โ€” causes consumer starvation
Retry topic (Kafka)At-least-onceโœ… Yesโš ๏ธ PartialMediumHigh-throughput Kafka; non-blocking retries
Pause partitionExactly-once orderingโœ… Yesโœ… YesHighStrict ordering required (financial ledgers)
Compensating eventAt-least-onceโœ… Via eventโœ… YesMediumBusiness-logic failures needing rollback
Circuit breakerAt-least-once + backpressureโœ… Yesโš ๏ธ PartialMediumDownstream service unavailability

1. Discard on Failure (Drop the Message)โ€‹

Simply log the error and acknowledge (ACK) the message regardless of outcome. It is permanently gone.

@KafkaListener(topics = "click-events")
public void processClickEvent(ClickEvent event) {
try {
analyticsService.record(event);
} catch (Exception e) {
// Non-critical: losing a click event is acceptable
log.warn("Failed to record click event {}, discarding: {}", event.getId(), e.getMessage());
// Do NOT rethrow โ€” Kafka will ACK and move on
}
}

Choose discard when: The message is non-critical (UI analytics, A/B test impressions, feature flag telemetry). The cost of investigation and redrive exceeds the value of the lost event. Never discard financial transactions, order state changes, or any message that represents an irreversible business event.


2. Retry Topic Pattern (Kafka Non-Blocking Retry)โ€‹

Instead of blocking the main partition on failure, the message is routed to a series of retry topics with increasing delays. After exhausting all retry topics, it goes to the DLQ topic.

Main topic: orders
โ†’ Fail โ†’ orders.retry.2s (retry after 2 second delay)
โ†’ Fail โ†’ orders.retry.30s (retry after 30 second delay)
โ†’ Fail โ†’ orders.retry.5m (retry after 5 minute delay)
โ†’ Fail โ†’ orders.DLQ (give up โ€” route to DLQ)
@Configuration
public class KafkaRetryConfig {

@Bean
public CommonErrorHandler nonBlockingRetryHandler(KafkaTemplate<Object, Object> template) {
// Routes to: topic.retry.2000, topic.retry.30000, topic.retry.300000, topic.DLQ
DeadLetterPublishingRecoverer recoverer = new DeadLetterPublishingRecoverer(template,
(record, ex) -> {
// Distinguish retryable vs. non-retryable โ€” skip retries for poison pills
if (ex.getCause() instanceof DeserializationException) {
return new TopicPartition(record.topic() + ".DLQ", record.partition());
}
return new TopicPartition(record.topic() + ".retry.2000", record.partition());
});

ExponentialBackOffWithMaxRetries backOff = new ExponentialBackOffWithMaxRetries(3);
backOff.setInitialInterval(2_000L);
backOff.setMultiplier(15.0); // 2s โ†’ 30s โ†’ 300s
backOff.setMaxInterval(300_000L);

return new DefaultErrorHandler(recoverer, backOff);
}
}

Choose retry topics when: You are using Kafka and cannot afford to block the main partition during retries. High-throughput consumers where a single slow retry holds up the entire partition.


3. Pause Partition (Kafka Strict Ordering)โ€‹

On failure, pause the Kafka partition โ€” stop consuming from it entirely until the issue is resolved. No messages are skipped, reordered, or DLQ'd.

@Component
@RequiredArgsConstructor
public class OrderingAwareConsumer {

private final KafkaListenerEndpointRegistry registry;
private final AlertService alertService;

@KafkaListener(topics = "account-transactions", groupId = "ledger-processor",
id = "ledger-listener")
public void process(ConsumerRecord<String, Transaction> record) {
try {
ledgerService.applyTransaction(record.value());
} catch (Exception e) {
log.error("FATAL: Failed to process transaction at offset {}, partition {} โ€” PAUSING",
record.offset(), record.partition());

// Pause this specific partition โ€” no further messages consumed
MessageListenerContainer container = registry.getListenerContainer("ledger-listener");
container.pause();

// Page on-call immediately โ€” human must intervene
alertService.pageOnCall("Ledger partition paused โ€” manual intervention required",
record.topic(), record.partition(), record.offset());

throw e; // Do not ACK โ€” offset not committed
}
}
}

Choose pause partition when: Message ordering is an absolute requirement (banking ledgers, account balance updates, inventory adjustments where sequence matters). DLQ + redrive would destroy the ordering guarantee and potentially corrupt business state.


4. Compensating Eventโ€‹

For business-logic failures that require domain-level rollback rather than technical retry.

@KafkaListener(topics = "payment-requests")
public void processPayment(PaymentRequest request) {
try {
paymentService.charge(request);
} catch (InsufficientFundsException e) {
// This is a business failure, not a technical failure
// Do NOT DLQ โ€” publish a compensating event instead
eventPublisher.publish(new PaymentFailedEvent(
request.getOrderId(),
request.getCustomerId(),
PaymentFailureReason.INSUFFICIENT_FUNDS,
e.getMessage()
));
// ACK the original message โ€” it was processed correctly (just failed business logic)
}
}

Choose compensating events when: The failure is a business-domain outcome, not a technical error. Downstream services need to react to the failure (e.g., order service must cancel the order when payment fails).


Implementation: Spring Boot and Kafkaโ€‹

Production-Grade Configurationโ€‹

@Configuration
@EnableKafka
@RequiredArgsConstructor
public class KafkaDlqConfig {

private final KafkaTemplate<Object, Object> kafkaTemplate;

@Bean
public CommonErrorHandler errorHandler() {
DeadLetterPublishingRecoverer recoverer = new DeadLetterPublishingRecoverer(
kafkaTemplate,
// Route to <topic>.DLQ on the same partition for traceability
(record, ex) -> new TopicPartition(record.topic() + ".DLQ", record.partition())
);

// Enrich DLQ message with diagnostic headers
recoverer.setHeadersFunction((record, ex) -> {
Map<String, Object> headers = new LinkedHashMap<>();
headers.put("dlq-original-topic", record.topic());
headers.put("dlq-original-partition", String.valueOf(record.partition()));
headers.put("dlq-original-offset", String.valueOf(record.offset()));
headers.put("dlq-exception-class", ex.getClass().getName());
headers.put("dlq-exception-message", ex.getMessage());
headers.put("dlq-failed-at", Instant.now().toString());
headers.put("dlq-consumer-group", "order-processor");
return new RecordHeaders(
headers.entrySet().stream()
.map(e -> new RecordHeader(e.getKey(),
e.getValue().toString().getBytes()))
.toList()
);
});

// Exponential backoff: 1s โ†’ 2s โ†’ 4s โ†’ 8s (4 total attempts, then DLQ)
ExponentialBackOffWithMaxRetries backOff = new ExponentialBackOffWithMaxRetries(3);
backOff.setInitialInterval(1_000L);
backOff.setMultiplier(2.0);
backOff.setMaxInterval(30_000L);

DefaultErrorHandler handler = new DefaultErrorHandler(recoverer, backOff);

// Classify non-retryable exceptions โ€” route to DLQ immediately without retries
handler.addNotRetryableExceptions(
JsonProcessingException.class, // Malformed JSON โ€” retrying won't help
DeserializationException.class, // Schema mismatch
IllegalArgumentException.class, // Invalid business data
NullPointerException.class // Programming error โ€” fix the code, not the retry
);

// Classify retryable exceptions explicitly
handler.addRetryableExceptions(
TransientDataAccessException.class, // DB transient error
ResourceAccessException.class, // Network timeout to downstream service
OptimisticLockingFailureException.class // Concurrency conflict
);

return handler;
}
}

DLQ Consumer (Inspector + Reprocess)โ€‹

@Component
@RequiredArgsConstructor
@Slf4j
public class OrderDlqConsumer {

private final ObjectMapper objectMapper;
private final DlqIncidentRepository incidentRepository;
private final MeterRegistry meterRegistry;

@KafkaListener(
topics = "orders.DLQ",
groupId = "orders-dlq-inspector",
// Consume from beginning on startup โ€” don't miss pre-existing DLQ messages
properties = {"auto.offset.reset=earliest"}
)
public void inspect(ConsumerRecord<String, String> record) {
String originalTopic = getHeader(record, "dlq-original-topic");
String exceptionClass = getHeader(record, "dlq-exception-class");
String exceptionMsg = getHeader(record, "dlq-exception-message");
String failedAt = getHeader(record, "dlq-failed-at");

log.error("DLQ message received | topic={} partition={} offset={} | error={}: {}",
originalTopic,
getHeader(record, "dlq-original-partition"),
getHeader(record, "dlq-original-offset"),
exceptionClass,
exceptionMsg);

// Persist DLQ incident for dashboard + audit trail
DlqIncident incident = DlqIncident.builder()
.originalTopic(originalTopic)
.payload(record.value())
.exceptionClass(exceptionClass)
.exceptionMessage(exceptionMsg)
.failedAt(Instant.parse(failedAt))
.status(DlqStatus.PENDING)
.build();
incidentRepository.save(incident);

// Increment counter for Grafana alerting
meterRegistry.counter("dlq.messages.received",
"topic", originalTopic,
"exception", exceptionClass).increment();
}

private String getHeader(ConsumerRecord<?, ?> record, String key) {
Header header = record.headers().lastHeader(key);
return header != null ? new String(header.value()) : "unknown";
}
}

Redrive Service (Replay DLQ โ†’ Main Queue)โ€‹

@Service
@RequiredArgsConstructor
@Slf4j
public class DlqRedriveService {

private final KafkaTemplate<String, String> kafkaTemplate;
private final DlqIncidentRepository incidentRepository;

/**
* Redrive a single DLQ incident back to the original topic.
* Called manually by an engineer after the root cause is fixed.
*/
@Transactional
public void redriveIncident(UUID incidentId) {
DlqIncident incident = incidentRepository.findById(incidentId)
.orElseThrow(() -> new IncidentNotFoundException(incidentId));

if (incident.getStatus() != DlqStatus.PENDING) {
throw new InvalidIncidentStateException(
"Can only redrive PENDING incidents, current status: " + incident.getStatus());
}

try {
// Send back to the original topic
kafkaTemplate.send(incident.getOriginalTopic(), incident.getPayload()).get();
incident.setStatus(DlqStatus.REDRIVEN);
incident.setRedrivenAt(Instant.now());
log.info("Redriven DLQ incident {} to topic {}", incidentId, incident.getOriginalTopic());
} catch (Exception e) {
incident.setStatus(DlqStatus.REDRIVE_FAILED);
incident.setLastError(e.getMessage());
throw new RedriveException("Redrive failed for incident " + incidentId, e);
} finally {
incidentRepository.save(incident);
}
}

/**
* Bulk redrive โ€” redrive all PENDING incidents for a given topic.
* Use after deploying a fix that addresses a batch of failures.
*/
@Transactional
public RedriveResult bulkRedrive(String topic, int batchSize) {
List<DlqIncident> pending = incidentRepository
.findPendingByTopicOrderByFailedAt(topic, PageRequest.of(0, batchSize));

int succeeded = 0, failed = 0;
for (DlqIncident incident : pending) {
try {
redriveIncident(incident.getId());
succeeded++;
} catch (Exception e) {
log.error("Failed to redrive incident {}: {}", incident.getId(), e.getMessage());
failed++;
}
}

return new RedriveResult(topic, succeeded, failed, pending.size());
}
}

Implementation: RabbitMQโ€‹

Queue Declaration with DLX (Dead Letter Exchange)โ€‹

RabbitMQ routes failed messages via a Dead Letter Exchange (DLX) โ€” a regular exchange configured as the dead letter destination for another queue.

@Configuration
public class RabbitMqDlqConfig {

// โ”€โ”€ Main infrastructure โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
@Bean
public TopicExchange mainExchange() {
return new TopicExchange("orders.exchange");
}

@Bean
public TopicExchange deadLetterExchange() {
return new TopicExchange("orders.dlx"); // DLX โ€” routes to DLQ
}

@Bean
public Queue mainQueue() {
return QueueBuilder.durable("orders.queue")
// Any rejected/expired message routes to the DLX
.withArgument("x-dead-letter-exchange", "orders.dlx")
.withArgument("x-dead-letter-routing-key", "orders.dead")
// Optional: per-message TTL โ€” message goes to DLQ if not consumed in 24h
.withArgument("x-message-ttl", 86_400_000L)
// Optional: max queue length before head is dead-lettered
.withArgument("x-max-length", 100_000)
.build();
}

@Bean
public Queue deadLetterQueue() {
return QueueBuilder.durable("orders.dlq")
// DLQ has a longer retention than the main queue
.withArgument("x-message-ttl", 1_209_600_000L) // 14 days
.build();
}

// โ”€โ”€ Bindings โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
@Bean
public Binding mainBinding() {
return BindingBuilder.bind(mainQueue())
.to(mainExchange())
.with("orders.#");
}

@Bean
public Binding dlqBinding() {
return BindingBuilder.bind(deadLetterQueue())
.to(deadLetterExchange())
.with("orders.dead");
}
}

RabbitMQ Consumer with Manual Acknowledgementโ€‹

@Component
@RequiredArgsConstructor
@Slf4j
public class OrderRabbitConsumer {

private final OrderService orderService;
private final MeterRegistry meterRegistry;

@RabbitListener(queues = "orders.queue")
public void consume(Message message, Channel channel,
@Header(AmqpHeaders.DELIVERY_TAG) long deliveryTag) throws IOException {
String payload = new String(message.getBody());
try {
Order order = objectMapper.readValue(payload, Order.class);
orderService.process(order);

// โœ… ACK โ€” message successfully processed, remove from queue
channel.basicAck(deliveryTag, false);
meterRegistry.counter("orders.processed").increment();

} catch (JsonProcessingException e) {
// Permanent failure โ€” malformed JSON, retrying won't help
// requeue=false โ†’ message routes to DLX โ†’ DLQ
log.error("Malformed order payload, routing to DLQ: {}", e.getMessage());
channel.basicNack(deliveryTag, false, false); // false = do NOT requeue
meterRegistry.counter("orders.dlq.routed", "reason", "deserialization_error").increment();

} catch (TransientServiceException e) {
// Transient failure โ€” requeue for retry
// requeue=true โ†’ message goes back to the head of the queue
log.warn("Transient failure processing order, requeueing: {}", e.getMessage());
channel.basicNack(deliveryTag, false, true); // true = requeue
// Note: RabbitMQ spring-retry handles exponential backoff before requeue
}
}
}
# application.yml โ€” RabbitMQ retry with exponential backoff
spring:
rabbitmq:
listener:
simple:
acknowledge-mode: manual # Manual ACK gives full control
retry:
enabled: true
max-attempts: 4
initial-interval: 1000ms # 1s first retry
multiplier: 2.0 # 1s โ†’ 2s โ†’ 4s โ†’ DLQ
max-interval: 30000ms
stateless: false # Stateful retry tracks per-message state

Implementation: AWS SQSโ€‹

Terraform Configurationโ€‹

# Main queue with DLQ redrive policy
resource "aws_sqs_queue" "orders" {
name = "orders-queue"
visibility_timeout_seconds = 150 # 6ร— average processing time of 25s
message_retention_seconds = 345600 # 4 days

redrive_policy = jsonencode({
deadLetterTargetArn = aws_sqs_queue.orders_dlq.arn
maxReceiveCount = 4 # 4 delivery attempts before DLQ
})
}

resource "aws_sqs_queue" "orders_dlq" {
name = "orders-queue-dlq"
message_retention_seconds = 1209600 # 14 days โ€” longer than main queue
}

# CloudWatch alarm โ€” alert the moment DLQ receives any message
resource "aws_cloudwatch_metric_alarm" "orders_dlq_not_empty" {
alarm_name = "orders-dlq-not-empty"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
metric_name = "ApproximateNumberOfMessagesVisible"
namespace = "AWS/SQS"
period = 60
statistic = "Sum"
threshold = 0
alarm_description = "Messages in DLQ indicate consumer failures requiring investigation"
alarm_actions = [aws_sns_topic.on_call_alerts.arn]

dimensions = {
QueueName = aws_sqs_queue.orders_dlq.name
}
}

Spring Boot SQS Consumerโ€‹

@Component
@RequiredArgsConstructor
@Slf4j
public class SqsOrderConsumer {

private final OrderService orderService;
private final SqsTemplate sqsTemplate;

@SqsListener(value = "${aws.sqs.orders-queue-url}", acknowledgementMode = AcknowledgementMode.MANUAL)
public void consume(Order order, Acknowledgement ack,
@Header("ApproximateReceiveCount") int receiveCount,
@Header("MessageId") String messageId) {
log.info("Processing order {} (attempt {})", order.getId(), receiveCount);

try {
orderService.process(order);
ack.acknowledge(); // โœ… Delete from SQS
} catch (NonRetryableException e) {
// Permanent failure โ€” let the message exhaust its receiveCount
// and route to DLQ naturally (do not ACK)
log.error("Permanent failure for order {}, attempt {}/{}: {}",
order.getId(), receiveCount, 4, e.getMessage());
// Not acknowledging โ†’ visibility timeout expires โ†’ message reappears
// After maxReceiveCount (4) it moves to DLQ automatically
} catch (Exception e) {
// Transient failure โ€” same as above, but log differently
log.warn("Transient failure for order {}, attempt {}/{}: {}",
order.getId(), receiveCount, 4, e.getMessage());
}
}
}

SQS Redrive (AWS CLI + Java SDK)โ€‹

# List messages in the DLQ for inspection
aws sqs receive-message \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789012/orders-queue-dlq \
--attribute-names All \
--message-attribute-names All \
--max-number-of-messages 10

# Start a redrive task โ€” moves all DLQ messages back to the source queue
aws sqs start-message-move-task \
--source-arn arn:aws:sqs:us-east-1:123456789012:orders-queue-dlq \
--destination-arn arn:aws:sqs:us-east-1:123456789012:orders-queue \
--max-number-of-messages-per-second 10 # Throttle to avoid overwhelming the consumer
// Programmatic SQS redrive via AWS SDK v2
@Service
@RequiredArgsConstructor
public class SqsRedriveService {

private final SqsClient sqsClient;

@Value("${aws.sqs.orders-dlq-arn}")
private String dlqArn;

@Value("${aws.sqs.orders-queue-arn}")
private String sourceQueueArn;

public String startRedrive(int messagesPerSecond) {
StartMessageMoveTaskResponse response = sqsClient.startMessageMoveTask(r -> r
.sourceArn(dlqArn)
.destinationArn(sourceQueueArn)
.maxNumberOfMessagesPerSecond(messagesPerSecond)
);

log.info("Started SQS redrive task: {}", response.taskHandle());
return response.taskHandle();
}

public MessageMoveTaskStatus checkRedriveStatus(String taskHandle) {
ListMessageMoveTasksResponse response = sqsClient.listMessageMoveTasks(r ->
r.sourceArn(dlqArn));

return response.results().stream()
.filter(t -> t.taskHandle().equals(taskHandle))
.findFirst()
.map(t -> new MessageMoveTaskStatus(
t.status(), t.approximateNumberOfMessagesMoved(),
t.approximateNumberOfMessagesToMove()))
.orElseThrow();
}
}

Senior Deep Diveโ€‹

1. The Ordering vs. Throughput Trade-offโ€‹

Routing a failed message to the DLQ breaks the processing order for messages that follow it. For most systems this is acceptable. For some, it is catastrophic.

Main queue: [TXN-100 balance=1000] โ†’ [TXN-101 withdraw=500] โ†’ [TXN-102 withdraw=700]

Scenario: TXN-101 fails and routes to DLQ
TXN-102 is processed: balance=1000, withdraw=700 โ†’ balance=300

Later: TXN-101 is redriven and processed: balance=300, withdraw=500 โ†’ balance=-200 โŒ
This would be APPROVED โ€” but should have been REJECTED (balance was only 300 at the time)
The correct final balance should be 1000 - 500 - 700 = -200 REJECTED

Result: Data corruption from out-of-order replay

Decision framework:

System TypeOrdering RequirementRecommended Strategy
Notification serviceโŒ NoneDLQ โ€” order doesn't matter
Search index syncโŒ None (idempotent upserts)DLQ โ€” last write wins
E-commerce order statusโš ๏ธ Soft (within order)DLQ + per-aggregate ordering via partition key
Bank account ledgerโœ… StrictPause partition โ€” no DLQ
Inventory adjustmentsโœ… StrictPause partition or saga compensation

Kafka partition key for per-aggregate ordering:

// Ensure all events for the same aggregate go to the same partition
// โ†’ Ordering is preserved per aggregate, DLQ is safe
kafkaTemplate.send(
new ProducerRecord<>(
"orders",
order.getOrderId().toString(), // partition key = order ID
order
)
);
// Events for orderId=123 always go to the same partition โ†’ in-order processing
// A failure on orderId=123 does NOT affect orderId=456 on a different partition

2. Idempotent DLQ Consumersโ€‹

When a message is redriven from the DLQ, the consumer processes it again. If the first processing attempt partially succeeded (e.g., the DB write succeeded but the downstream API call failed), a naive consumer will try to create the same record again.

// โŒ Non-idempotent consumer โ€” safe for first delivery, unsafe for redrive
@KafkaListener(topics = "orders")
public void process(Order order) {
orderRepository.save(order); // If first attempt: succeeds
// If redriven: DUPLICATE KEY VIOLATION โŒ
paymentService.charge(order);
}

// โœ… Idempotent consumer โ€” safe for both first delivery and redrive
@KafkaListener(topics = "orders")
@Transactional
public void process(ConsumerRecord<String, Order> record) {
String eventId = record.topic() + "-" + record.partition() + "-" + record.offset();

// Check if already successfully processed (from first delivery attempt)
if (processedEventRepository.existsById(eventId)) {
log.info("Event {} already processed, skipping (idempotent)", eventId);
return; // ACK and move on
}

Order order = record.value();
orderRepository.save(order); // Idempotent: upsert by business ID
paymentService.charge(order); // Idempotent: check if charge already exists

// Record successful processing โ€” prevents duplicate on redrive
processedEventRepository.save(new ProcessedEvent(eventId, Instant.now()));
}

3. Schema Evolution and the DLQ Spike Patternโ€‹

One of the most common causes of mass DLQ spikes in production: a schema change that breaks existing consumers.

Deploy new Order schema โ€” OrderV2 adds a required field: `currency`
โ†“
Legacy consumers still receive OrderV1 messages (in-flight, not yet drained)
โ†“
Deserialization fails for every OrderV1 message
โ†“
DLQ depth spikes from 0 to 100,000 in minutes
โ†“
Alert fires โ€” engineers investigate

Defense in layers:

Layer 1 โ€” Schema Registry compatibility check:

# Register the new schema and check backward compatibility BEFORE deploying
curl -X POST http://schema-registry:8081/compatibility/subjects/orders-value/versions/latest \
-H "Content-Type: application/json" \
-d @order_schema_v2.avsc
# {"is_compatible":false} โ†’ DO NOT DEPLOY โ€” will break existing consumers

Layer 2 โ€” Never remove optional fields; only add:

// โŒ Breaking change โ€” removes a field consumers rely on
public class OrderV2 {
private UUID id;
private BigDecimal totalAmount;
// private String customerId; โ† REMOVED โ€” breaks consumers that read this field
}

// โœ… Backward compatible โ€” adds an optional field with a default
public class OrderV2 {
private UUID id;
private BigDecimal totalAmount;
private String customerId;
private String currency = "USD"; // New field โ€” nullable/defaulted
}

Layer 3 โ€” Classify deserialization errors as non-retryable:

// Route schema errors directly to DLQ without wasting retries
handler.addNotRetryableExceptions(
DeserializationException.class,
JsonMappingException.class,
InvalidFormatException.class
);

Layer 4 โ€” DLQ spike runbook:

1. Check DLQ message headers for "dlq-exception-class"
2. If DeserializationException:
a. Pull a sample DLQ message payload
b. Compare schema against current consumer version
c. Identify the breaking field
d. Rollback consumer OR deploy a schema-tolerant version
e. Redrive DLQ messages after fix is deployed

3. If NullPointerException or IllegalArgumentException:
a. Check if a new required field was added to the message
b. Identify the producer that sends incomplete payloads
c. Fix producer to include the field
d. Redrive DLQ after fix

4. DLQ Metrics, Alerting, and Dashboardsโ€‹

The DLQ should be empty under normal operating conditions. Any non-zero depth is a production incident.

@Component
@RequiredArgsConstructor
public class DlqMetricsCollector {

private final DlqIncidentRepository incidentRepository;
private final MeterRegistry registry;

@Scheduled(fixedDelay = 10_000)
public void collectMetrics() {
// 1. DLQ depth per topic โ€” alert on ANY value > 0
Map<String, Long> depthByTopic = incidentRepository
.countPendingGroupedByTopic();
depthByTopic.forEach((topic, count) ->
registry.gauge("dlq.depth", Tags.of("topic", topic), count));

// 2. Age of oldest pending DLQ message โ€” alert if > 30 minutes
incidentRepository.findOldestPendingByTopic().forEach((topic, oldest) -> {
long ageMinutes = Duration.between(oldest.getFailedAt(), Instant.now()).toMinutes();
registry.gauge("dlq.oldest.message.age.minutes",
Tags.of("topic", topic), ageMinutes);
});

// 3. DLQ ingestion rate โ€” spike indicates a breaking change
registry.summary("dlq.ingestion.rate"); // Updated by DlqConsumer per message
}
}

Prometheus alerting rules:

groups:
- name: dlq_alerts
rules:
# P1: Any message in DLQ โ€” immediate investigation required
- alert: DLQNotEmpty
expr: dlq_depth > 0
for: 1m
labels:
severity: critical
annotations:
summary: "DLQ has {{ $value }} pending messages on topic {{ $labels.topic }}"
runbook: "https://wiki.company.com/runbooks/dlq-investigation"

# P2: Old unresolved DLQ messages approaching expiration
- alert: DLQMessageApproachingExpiry
expr: dlq_oldest_message_age_minutes > 180 # 3 hours
labels:
severity: warning
annotations:
summary: "DLQ messages on {{ $labels.topic }} are {{ $value }} minutes old"

# P2: DLQ spike โ€” many messages arriving in a short window (schema break?)
- alert: DLQIngestionSpike
expr: rate(dlq_messages_received_total[5m]) > 10
labels:
severity: warning
annotations:
summary: "DLQ receiving {{ $value }} messages/sec โ€” possible schema change"

5. The DLQ Across Broker Technologiesโ€‹

The DLQ concept is universal, but implementation details differ across brokers.

AspectKafkaRabbitMQAWS SQS
DLQ mechanismSeparate topic (orders.DLQ) written to by consumerDead Letter Exchange (DLX) โ†’ Dead Letter QueueSeparate SQS queue via redrive policy
Routing triggerApplication-level (DeadLetterPublishingRecoverer)Broker-level (NACK with requeue=false, TTL, max-length)Broker-level (maxReceiveCount exceeded)
Message orderingPreserved within partitionNot guaranteed after DLQNot guaranteed (Standard queue)
DLQ message metadataAdded as Kafka headers by recovererAdded as message properties by RabbitMQAvailable via ApproximateReceiveCount, MessageId attributes
Redrive mechanismRe-publish to original topic (manual or automated)Re-queue via shovel plugin or management APIStartMessageMoveTask API
RetentionConfigurable topic retention (days)Configurable queue TTLUp to 14 days
FIFO ordering supportโœ… Per-partitionโŒโœ… FIFO queues
Multi-level DLQโœ… Retry topics โ†’ DLQโœ… Multiple DLX hopsโŒ Single level

Kafka-specific: multi-level retry topic chain:

orders (main topic)
โ†“ fail attempt 1
orders.retry.1s (1 second delay via consumer pause)
โ†“ fail attempt 2
orders.retry.30s (30 second delay)
โ†“ fail attempt 3
orders.retry.5m (5 minute delay)
โ†“ fail attempt 4
orders.DLQ (no more retries โ€” engineer must intervene)
// Consumer for each retry topic applies the configured delay
@KafkaListener(topics = "orders.retry.1s", groupId = "orders-retry-1s")
public void retryAfter1s(ConsumerRecord<String, Order> record) {
long publishedAt = Long.parseLong(
new String(record.headers().lastHeader("retry-publish-time").value()));
long delayMs = 1_000L - (System.currentTimeMillis() - publishedAt);

if (delayMs > 0) {
Thread.sleep(delayMs); // Enforce the delay
}
// Re-route to main topic for reprocessing
kafkaTemplate.send("orders", record.key(), record.value());
}

6. Production Incident Runbookโ€‹

A structured decision tree for diagnosing and resolving DLQ incidents.

๐Ÿšจ ALERT: DLQ depth > 0 for topic "orders"

Step 1: Triage (< 5 minutes)
โ”Œโ”€ Single message? โ†’ Likely a bad payload from a specific producer
โ”‚ Inspect message, find the producer, fix input
โ”‚
โ”œโ”€ Batch of messages? โ†’ Check deployment history
โ”‚ โ”œโ”€ Recent consumer deploy? โ†’ Schema change? Roll back consumer and check
โ”‚ โ””โ”€ Recent producer deploy? โ†’ New field? Missing field? Check schema compatibility
โ”‚
โ””โ”€ Continuous stream? โ†’ Consumer bug or downstream service outage
โ”œโ”€ Check consumer error logs for exception class
โ”œโ”€ Check downstream service health (DB, payment API, etc.)
โ””โ”€ If downstream outage โ†’ circuit breaker open?

Step 2: Contain (< 15 minutes)
โ”œโ”€ If consumer crashing on all messages โ†’ pause consumer, stop the bleeding
โ”œโ”€ If downstream outage โ†’ let retry backoff absorb; check circuit breaker
โ””โ”€ If schema mismatch โ†’ deploy schema-tolerant consumer version immediately

Step 3: Fix root cause (varies)
โ”œโ”€ Code bug โ†’ deploy fix
โ”œโ”€ Schema mismatch โ†’ update schema with backward compatibility
โ”œโ”€ Bad producer data โ†’ fix producer, backfill correct data
โ””โ”€ Downstream outage โ†’ wait for recovery or implement fallback

Step 4: Redrive (after fix is confirmed)
โ”œโ”€ Inspect a sample DLQ message โ€” confirm it would now process successfully
โ”œโ”€ Redrive at throttled rate (10 msg/sec) โ€” watch error rate on main consumer
โ”œโ”€ Increase rate if error rate stays low โ†’ full bulk redrive
โ””โ”€ Monitor DLQ depth โ†’ should return to 0

Step 5: Post-mortem
โ”œโ”€ Why did this message fail? (root cause)
โ”œโ”€ Why did we not catch this in testing? (process gap)
โ””โ”€ What will prevent this class of failure? (prevention)

Interview Decision Matrixโ€‹

ScenarioRecommended StrategyWhy
General microservice event processingDLQ with exponential backoff (3โ€“5 retries)Balances retry for transient failures, safety net for permanent ones
Financial ledger / account balance updatesPause partition โ€” no DLQOrdering is a business invariant; DLQ breaks it
Search index synchronizationDLQ โ€” safeConsumers are idempotent upserts; ordering doesn't matter
Non-critical analytics eventsDiscard on failureInvestigation cost exceeds value of a lost click event
Schema mismatch causing mass failuresDLQ + immediate non-retryable classificationNo retries on deserialization errors; inspect and fix schema first
Downstream service temporarily downExponential backoff without DLQ (3 retries)Transient โ€” will recover; DLQ introduces unnecessary manual work
Business logic failure (insufficient funds)Compensating event โ€” no DLQTechnical retry will always fail; domain-level response is correct
Kafka high-throughput, cannot block partitionRetry topics โ†’ DLQNon-blocking retry preserves main partition throughput
Interview Phrasing โ€” DLQ Fundamentals

"A DLQ is a reliability guarantee, not just an error bin. Its job is to ensure that no message is silently dropped. When a consumer encounters a permanent failure โ€” malformed payload, schema mismatch, invalid state โ€” retrying will never succeed. The message must be isolated into the DLQ so the main queue keeps flowing. After fixing the root cause, we redrive the DLQ messages back to the main queue. The two things I always make sure to get right: classifying failures as retryable vs. non-retryable so we don't waste time retrying poison pills, and making consumers idempotent so redrives are safe even if the first delivery partially succeeded."

Interview Phrasing โ€” When NOT to DLQ

"For a financial ledger where message ordering is an absolute business invariant, a DLQ is the wrong pattern. If TXN-101 fails and routes to the DLQ while TXN-102 is processed, the account balance is computed in the wrong order โ€” potentially approving a withdrawal that should have been rejected. For these systems, the correct pattern is to pause the Kafka partition on failure, page on-call immediately, and block all further processing until the issue is manually resolved. Throughput takes a back seat to correctness."


Further Readingโ€‹