Transactional Outbox Pattern
- New learners β start at The Dual-Write Problem and The Coffee Shop Analogy to build intuition.
- Senior engineers β jump to How the Relay Works Internally, CDC vs Polling Deep Dive, Ordering Semantics, or Production Observability.
The Dual-Write Problemβ
Every event-driven microservice eventually hits the same wall:
@Transactional
public Order placeOrder(CreateOrderCommand cmd) {
Order order = orderRepository.save(Order.from(cmd)); // Write #1: database
kafkaTemplate.send("order-events", new OrderPlaced(order)); // Write #2: Kafka
return order;
}
This code looks innocent but has a fatal concurrency bug. There are two independent writes to two different systems, and there is no way to make them atomic together:
| Scenario | DB Write | Kafka Write | Result |
|---|---|---|---|
| Both succeed | β | β | Correct |
| DB fails, Kafka succeeds | β | β | Ghost event β event for an order that doesn't exist |
| DB succeeds, Kafka fails | β | β | Silent drop β order exists but no downstream service knows |
| Process crashes between the two writes | β | β | Same as above |
The third scenario β DB succeeds, Kafka fails β is the most dangerous and most common. Your order service confirms the order to the user, but inventory was never reserved, the email was never sent, and payment was never triggered.
The Coffee Shop Analogyβ
Imagine a coffee shop. When you order, the barista:
- Writes your order on a paper ticket (the outbox table β in the same action as taking the order)
- Sticks it on the order board (publishes to Kafka β done by a separate relay process)
Even if the board catches fire (Kafka is down), the paper ticket is still there. A new barista (relay restart) can read the ticket and re-post it to the board when it's back up.
The key insight: the ticket (outbox record) and the order (business record) are written in one pen stroke β atomically. The board (Kafka) gets updated eventually but reliably.
The Transactional Outbox Patternβ
The Transactional Outbox Pattern solves the Dual-Write Problem by writing the event payload to the same database as the business data, in the same local ACID transaction. A separate relay process then reads the outbox and publishes events to the message broker.
Guarantee: If the database transaction commits, the event will eventually be published to Kafka. If the transaction rolls back, no event is published. There is no window where the order exists but the event is missing.
Database Schemaβ
PostgreSQL Outbox Tableβ
CREATE TABLE outbox_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
aggregate_type VARCHAR(255) NOT NULL, -- e.g., 'Order', 'Payment'
aggregate_id VARCHAR(255) NOT NULL, -- e.g., order UUID
event_type VARCHAR(255) NOT NULL, -- e.g., 'OrderPlaced'
payload JSONB NOT NULL, -- full event payload
topic VARCHAR(255) NOT NULL, -- target Kafka topic
partition_key VARCHAR(255), -- for ordering (e.g., orderId)
status VARCHAR(50) NOT NULL DEFAULT 'PENDING',
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
published_at TIMESTAMPTZ,
retry_count INT NOT NULL DEFAULT 0,
last_error TEXT,
idempotency_key VARCHAR(255) UNIQUE -- prevents duplicate inserts
);
-- Fast index for the relay process (only scans PENDING rows)
CREATE INDEX CONCURRENTLY idx_outbox_pending
ON outbox_events (created_at ASC)
WHERE status = 'PENDING';
-- Index for cleanup jobs (purge old PUBLISHED records)
CREATE INDEX idx_outbox_published_at
ON outbox_events (published_at)
WHERE status = 'PUBLISHED';
WHERE status = 'PENDING' makes the index only cover unpublished rows β which is the tiny minority of rows in a mature system. The relay's SELECT query scans only this index, not the full table, giving O(1) relay performance regardless of total table size.
Application Write Phase (Spring Boot)β
@Service
@RequiredArgsConstructor
@Slf4j
public class OrderService {
private final OrderRepository orderRepository;
private final OutboxRepository outboxRepository;
private final ObjectMapper objectMapper;
@Transactional // local ACID β both writes succeed or both fail
public Order placeOrder(CreateOrderCommand cmd) {
// Business operation
Order order = Order.create(cmd);
orderRepository.save(order);
// Write event to outbox IN THE SAME TRANSACTION
OutboxEvent event = OutboxEvent.builder()
.aggregateType("Order")
.aggregateId(order.getId().toString())
.eventType("OrderPlaced")
.topic("order-events")
.partitionKey(order.getId().toString()) // ensures ordering per order
.payload(serialize(new OrderPlacedEvent(order)))
.idempotencyKey("order-placed:" + order.getId())
.build();
outboxRepository.save(event);
log.info("Order {} created, outbox event {} queued", order.getId(), event.getId());
return order;
}
private String serialize(Object obj) {
try {
return objectMapper.writeValueAsString(obj);
} catch (JsonProcessingException e) {
throw new IllegalStateException("Serialization failed", e);
}
}
}
How the Relay Works Internallyβ
Strategy A: Polling Publisher (SKIP LOCKED)β
A scheduled background job queries for PENDING rows and publishes them to Kafka. The critical detail is FOR UPDATE SKIP LOCKED β this prevents multiple relay instances from picking up the same events simultaneously.
Timeline:
t=0 Relay instance 1 queries: SELECT ... WHERE status='PENDING' FOR UPDATE SKIP LOCKED
β Gets rows [E1, E2, E3] (locked)
t=0 Relay instance 2 queries: SELECT ... WHERE status='PENDING' FOR UPDATE SKIP LOCKED
β Gets rows [E4, E5, E6] (different rows β instance 1's rows are SKIPped)
t=1 Instance 1 publishes E1 to Kafka, marks PUBLISHED
t=1 Instance 2 publishes E4 to Kafka, marks PUBLISHED
β Both instances work in parallel without contention β
@Component
@RequiredArgsConstructor
@Slf4j
public class OutboxPollingRelay {
private final OutboxRepository outboxRepository;
private final KafkaTemplate<String, String> kafkaTemplate;
private final OutboxMetrics metrics;
@Scheduled(fixedDelay = 500) // poll every 500ms
@Transactional
public void pollAndPublish() {
List<OutboxEvent> batch = outboxRepository.findPendingForUpdate(100);
if (batch.isEmpty()) return;
log.debug("Processing batch of {} outbox events", batch.size());
long startMs = System.currentTimeMillis();
for (OutboxEvent event : batch) {
try {
// Synchronous send β confirms broker accepted before marking as published
kafkaTemplate.send(
event.getTopic(),
event.getPartitionKey(),
event.getPayload()
).get(5, TimeUnit.SECONDS); // timeout after 5s
event.markPublished();
outboxRepository.save(event);
metrics.recordPublished(event.getEventType());
} catch (Exception e) {
log.error("Failed to publish outbox event {}: {}", event.getId(), e.getMessage());
event.recordFailure(e.getMessage());
outboxRepository.save(event);
metrics.recordFailure(event.getEventType());
break; // Stop batch on failure to preserve ordering per partition key
}
}
metrics.recordBatchDuration(System.currentTimeMillis() - startMs);
}
}
public interface OutboxRepository extends JpaRepository<OutboxEvent, UUID> {
@Query(value = """
SELECT *
FROM outbox_events
WHERE status = 'PENDING'
ORDER BY created_at ASC
LIMIT :#{#pageable.pageSize}
FOR UPDATE SKIP LOCKED
""", nativeQuery = true)
List<OutboxEvent> findPendingForUpdate(int limit);
}
Polling trade-offs:
| Aspect | Detail |
|---|---|
| Latency | 0β500ms additional latency (poll interval) |
| Database load | Constant read queries every 500ms per relay instance |
| Complexity | Simple β no additional infrastructure |
| Throughput | Scales horizontally by adding relay instances |
| Best for | Low-to-medium volume (< 1,000 events/sec) |
Strategy B: Change Data Capture (CDC) via Debeziumβ
Debezium reads the PostgreSQL Write-Ahead Log (WAL) directly β the same internal mechanism PostgreSQL uses for replication. When a row is inserted into outbox_events, Debezium reads the WAL entry and streams it to Kafka Connect immediately β typically in < 50ms.
PostgreSQL WAL Stream:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LSN 00001/A1B2C3: INSERT outbox_events β
β id=uuid-123, event_type=OrderPlaced, β
β aggregate_id=order-456, status=PENDING β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Debezium reads WAL
Kafka Connect
β publishes to Kafka topic
order-events (Kafka partition 3)
{
"name": "outbox-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "postgres-host",
"database.port": "5432",
"database.user": "debezium_user",
"database.password": "${file:/kafka/connect-secrets:postgres.password}",
"database.dbname": "myapp",
"slot.name": "debezium_outbox_slot",
"publication.name": "debezium_publication",
"table.include.list": "public.outbox_events",
"transforms": "outbox",
"transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",
"transforms.outbox.route.by.field": "aggregate_type",
"transforms.outbox.route.topic.replacement": "${routedByValue}-events",
"transforms.outbox.table.field.event.key": "aggregate_id",
"transforms.outbox.table.field.event.payload": "payload",
"transforms.outbox.table.field.event.type": "event_type",
"heartbeat.interval.ms": "10000",
"slot.drop.on.stop": "false"
}
}
Setting up PostgreSQL for logical replication:
-- postgresql.conf must have:
wal_level = logical
max_replication_slots = 5 -- one per Debezium connector
max_wal_senders = 5
-- Create a Debezium user with replication privileges
CREATE USER debezium_user WITH REPLICATION LOGIN PASSWORD 'secure_pass';
GRANT SELECT ON ALL TABLES IN SCHEMA public TO debezium_user;
-- Create publication (what Debezium watches)
CREATE PUBLICATION debezium_publication FOR TABLE outbox_events;
CDC trade-offs:
| Aspect | Detail |
|---|---|
| Latency | Sub-50ms β reads WAL nearly in real time |
| Database load | Minimal β reads WAL stream, not polling queries |
| Complexity | Requires Kafka Connect + Debezium + PostgreSQL logical replication |
| Ordering | Preserves WAL order β events published in exact insert order |
| Best for | High volume (> 1,000 events/sec), strict latency requirements |
CDC vs Polling Publisher Deep Diveβ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β COMPARISON β
βββββββββββββββββββββββ¬βββββββββββββββββββ¬ββββββββββββββββββββββββ€
β Dimension β Polling β CDC (Debezium) β
βββββββββββββββββββββββΌβββββββββββββββββββΌββββββββββββββββββββββββ€
β Latency β 0β500ms β < 50ms β
β DB Load β Queries every β WAL stream (low CPU) β
β β 500ms β β
β Infrastructure β Just app code β Kafka Connect + β
β β β Debezium + WAL config β
β Scaling β Horizontal β Single reader per β
β β (SKIP LOCKED) β connector (Kafka β
β β β Connect partitioning) β
β Ordering β Per poll-batch β Exact WAL order β
β Failure recovery β Restart relay β Debezium resumes from β
β β β last LSN in slot β
β Exactly-once? β No (at-least- β No (at-least-once) β
β β once) β β
β Best TPS ceiling β ~1,000/sec β ~50,000/sec β
βββββββββββββββββββββββ΄βββββββββββββββββββ΄ββββββββββββββββββββββββ
Neither polling nor CDC guarantees exactly-once delivery to Kafka. Your consumers must handle duplicates using idempotency keys. See the Deduplication Guide for implementation patterns.
Ordering Semantics & Partitioningβ
Kafka preserves order within a partition. To ensure ordered delivery for a specific entity (e.g., all events for orderId=123 arrive in order), all events for that entity must go to the same partition:
// Use orderId as the partition key β Kafka hashes it to a consistent partition
OutboxEvent event = OutboxEvent.builder()
.topic("order-events")
.partitionKey(order.getId().toString()) // β critical for ordering
.payload(serialize(new OrderPlaced(order)))
.build();
What happens without a partition key:
- Events for the same order may land on different Kafka partitions
- Consumers may process
OrderCancelledbeforeOrderPlaced(out of order) - Downstream services corrupt state
What to use as partition key:
| Entity Type | Partition Key | Why |
|---|---|---|
| Order events | orderId | All order lifecycle events stay in order |
| Customer events | customerId | Customer profile/preference events stay ordered |
| Payment events | paymentId or orderId | Depends on consumer needs |
| Inventory events | productId | Stock changes per product stay ordered |
Cleanup & Table Managementβ
The outbox_events table grows indefinitely without a cleanup strategy. Published events are safe to delete after a retention window.
@Component
@RequiredArgsConstructor
@Slf4j
public class OutboxCleanupJob {
private final OutboxRepository outboxRepository;
@Scheduled(cron = "0 0 2 * * ?") // 2am daily
@Transactional
public void cleanupPublishedEvents() {
Instant cutoff = Instant.now().minus(Duration.ofDays(7)); // keep 7 days
int deleted = outboxRepository.deletePublishedBefore(cutoff);
log.info("Cleaned up {} published outbox events older than {}", deleted, cutoff);
}
@Scheduled(cron = "0 */30 * * * ?") // every 30 minutes
@Transactional
public void retryFailedEvents() {
// Re-queue events that have been PENDING for too long (relay may have died)
Instant staleThreshold = Instant.now().minus(Duration.ofMinutes(5));
List<OutboxEvent> staleEvents = outboxRepository
.findStalePendingEvents(staleThreshold, 50);
log.warn("Found {} stale outbox events, re-queuing", staleEvents.size());
// They will be picked up by the next polling cycle
}
}
public interface OutboxRepository extends JpaRepository<OutboxEvent, UUID> {
@Modifying
@Query("DELETE FROM OutboxEvent e WHERE e.status = 'PUBLISHED' AND e.publishedAt < :cutoff")
int deletePublishedBefore(@Param("cutoff") Instant cutoff);
@Query("SELECT e FROM OutboxEvent e WHERE e.status = 'PENDING' AND e.createdAt < :threshold ORDER BY e.createdAt ASC")
List<OutboxEvent> findStalePendingEvents(@Param("threshold") Instant threshold, Pageable page);
}
Production Observabilityβ
@Component
@RequiredArgsConstructor
public class OutboxMetrics {
private final MeterRegistry registry;
public void recordPublished(String eventType) {
registry.counter("outbox.published", "event_type", eventType).increment();
}
public void recordFailure(String eventType) {
registry.counter("outbox.publish_failed", "event_type", eventType).increment();
}
public void recordBatchDuration(long durationMs) {
registry.timer("outbox.relay.batch_duration").record(durationMs, TimeUnit.MILLISECONDS);
}
// Called from a scheduled job
public void recordPendingCount() {
long count = outboxRepository.countByStatus("PENDING");
registry.gauge("outbox.pending_count", count);
}
}
Key dashboards and alerts:
| Metric | Alert Threshold | Meaning |
|---|---|---|
outbox.pending_count | > 1,000 | Relay is falling behind or has died |
outbox.pending_count stable > 5 min | Any | Relay is stuck β check logs |
outbox.publish_failed rate | > 0 for > 2 min | Kafka broker connectivity issue |
outbox.relay.batch_duration p99 | > 2s | Kafka or DB performance degradation |
Events with retry_count > 5 | Any | Persistent failure β check event payload and topic config |
Summary Comparison: 2PC vs Saga vs Outboxβ
| Pattern | Consistency | Availability | Latency | Complexity | Best For |
|---|---|---|---|---|---|
| 2PC (XA) | Strong (ACID) | Low (blocking) | High (sync round trips) | Medium | Collocated JDBC resources, < 500 TPS |
| Saga (Orchestration) | Eventual | High | Low (async) | High | Complex multi-service workflows with compensation |
| Saga (Choreography) | Eventual | High | Low (async) | Medium | Simple linear event chains between loosely coupled services |
| Transactional Outbox | Eventual (at-least-once) | High | Very low (relay async) | LowβMedium | Reliable event publishing from any service |
Use the Outbox Pattern inside each Saga step to make event publishing within each step reliable. The Outbox solves "how do I reliably publish an event?". The Saga solves "how do I coordinate a multi-step workflow?". They work together.
Senior Interview Questionsβ
Q: The Transactional Outbox guarantees at-least-once delivery. How do downstream consumers handle duplicates?β
A: Consumers must be idempotent. For each event, the consumer checks if it has already been processed using the event's id or a business-level idempotency key. Common approaches:
INSERT INTO processed_events (event_id) ... ON CONFLICT DO NOTHINGβ DB-backed dedup.Redis SETNX event:{id}with a TTL β fast path check before processing.- Upsert the target entity using the aggregate ID β makes the operation naturally idempotent.
Q: What is a replication slot in PostgreSQL and why does Debezium need it?β
A: A replication slot is a cursor in PostgreSQL's WAL stream. It tracks how far a consumer has read, and prevents PostgreSQL from discarding WAL segments that the consumer hasn't yet processed. Debezium creates a slot and maintains its position (LSN β Log Sequence Number). If Debezium restarts, it resumes reading from its last LSN, guaranteeing no events are missed. The risk: if Debezium falls behind, PostgreSQL cannot discard old WAL β disk usage grows. Always monitor pg_replication_slots lag in production.
Q: What happens if the relay publishes an event to Kafka but then crashes before marking it as PUBLISHED in the database?β
A: On restart, the relay will find the event still in PENDING state and publish it again. Kafka will receive a duplicate. This is why both the relay (via SKIP LOCKED) and downstream consumers (via idempotency keys) must be designed for at-least-once semantics. The event's id should be used as the Kafka message key or included in the payload as the idempotency key.
Q: How would you handle outbox events for multi-tenant systems where each tenant has its own schema?β
A: Options:
- Single shared outbox table with a
tenant_idcolumn β simple but requires partitioning strategy. - Per-tenant outbox table with dynamic table routing in the relay β flexible but complex.
- Per-tenant Debezium connectors using
table.include.listper schema β fully isolated but operationally expensive at scale.
Most teams use option 1 with tenant_id as part of the partition key for Kafka, ensuring per-tenant ordering.
See Alsoβ
- Saga Pattern β Use Outbox inside each Saga step for reliable event publishing
- Two-Phase Commit (2PC) β Synchronous alternative to understand the trade-offs
- Change Data Capture (CDC) β The broader CDC pattern that Debezium implements
- Dead Letter Queue (DLQ) β Handling permanently failed relay events
- Scaling Reads β How outbox pattern / event-driven invalidation keeps caches consistent with database state
- Scaling Writes β Using async write pipelines and transactional deduplication to scale ingestion