Saga Pattern (Distributed Workflows)
- New learners โ start at The Problem: Why Can't We Just Use a Database Transaction? and The Travel Booking Analogy to understand the core concept.
- Senior engineers โ jump to Saga State Machine Design, Idempotency & Exactly-Once Semantics, Choreography vs Orchestration Trade-offs, or Production Failure Handling.
The Problem: Why Can't We Just Use a Database Transaction?
In a monolith with one database, a @Transactional annotation is enough. If any step fails, the database rolls everything back โ atomically, durably, and consistently.
In a microservices architecture, each service owns its own database. There is no single database that spans all services. When an e-commerce checkout needs to:
- Create an order in
order-service(PostgreSQL) - Reserve stock in
inventory-service(MySQL) - Charge the customer in
payment-service(Stripe API) - Notify via
notification-service(SendGrid API)
...there is no BEGIN TRANSACTION that spans all four. If Step 3 (payment) fails after Steps 1 and 2 have succeeded, you have a partially committed operation โ the customer's stock is reserved and an order exists, but no payment has been taken. The system is now inconsistent.
The Travel Booking Analogy
Imagine booking a trip: flight + hotel + car rental. Each is booked with a different company.
- You book the flight โ
- You book the hotel โ
- You try to book the car rental โ โ no cars available at that price.
What do you do? You cancel the hotel and cancel the flight โ but you call each company individually to reverse the booking. There's no one transaction that rolls all three back.
This is exactly a Saga: a sequence of steps where each step has a compensating action that undoes its effect if a later step fails.
Saga Pattern Overview
A Saga decomposes a distributed transaction into a sequence of local ACID transactions โ one per service. Each local transaction:
- Updates its own database (local ACID guarantee)
- Publishes an event or sends a command to trigger the next step
T1 (Order Created) โโโบ T2 (Stock Reserved) โโโบ T3 (Payment Charged) โโโบ Complete โ
If a step fails, compensating transactions execute in reverse:
T1 (Order) โโโบ T2 (Inventory) โโโบ T3 (Payment FAIL โ)
โ
โโโ C2 (Release Stock) โโโโโโโโ
C1 (Cancel Order) โโโโโโโโโโโโโโโโโ
What Are Compensating Transactions?
A database rollback erases changes as if they never happened. A compensating transaction is a new, forward-moving operation that semantically reverses the effect of a previous step.
| Step | Forward Transaction | Compensation |
|---|---|---|
| T1: Create Order | INSERT INTO orders (status='PENDING') | UPDATE orders SET status='CANCELLED' |
| T2: Reserve Stock | UPDATE inventory SET reserved = reserved + 1 | UPDATE inventory SET reserved = reserved - 1 |
| T3: Charge Card | POST to Stripe /charges | POST to Stripe /refunds |
Notice that T1's compensation does not DELETE the order โ it marks it CANCELLED. This preserves the audit trail, which is essential for financial systems.
Saga Coordination Styles
There are two fundamentally different ways to coordinate a Saga:
| Choreography | Orchestration | |
|---|---|---|
| Control | Distributed โ each service decides what to do next | Centralized โ one orchestrator drives all steps |
| Communication | Events (async, reactive) | Commands (directed instructions) |
| Visibility | Low โ workflow is implicit in event chains | High โ workflow is explicit in orchestrator code |
| Coupling | Services coupled to event contracts | Services coupled to orchestrator |
| Failure handling | Complex โ each service must handle compensation events | Centralized โ orchestrator coordinates compensations |
| Best for | Simple, linear, well-understood flows | Complex, branching workflows with error recovery |
Choreography (Event-Driven)
In choreography, there is no central controller. Each service listens for events from other services and publishes events to trigger subsequent steps. The workflow emerges from how events propagate through the system.
Happy Path Code (Inventory Service โ Choreography)
@Component
@RequiredArgsConstructor
@Slf4j
public class InventoryChoreographer {
private final InventoryRepository inventoryRepository;
private final OutboxRepository outboxRepository; // use Outbox for reliable publishing
@KafkaListener(topics = "order-events", groupId = "inventory-service")
@Transactional // local ACID: reserve stock AND write outbox atomically
public void onOrderCreated(OrderCreatedEvent event) {
log.info("Reserving stock for order {}", event.getOrderId());
try {
inventoryRepository.reserve(event.getOrderId(), event.getItems());
// Write success event to outbox (same DB transaction)
outboxRepository.save(OutboxEvent.builder()
.aggregateId(event.getOrderId())
.eventType("StockReserved")
.payload(new StockReservedEvent(event.getOrderId(), event.getAmount()))
.build());
} catch (InsufficientStockException e) {
log.warn("Insufficient stock for order {}: {}", event.getOrderId(), e.getMessage());
// Write failure event to outbox (same DB transaction)
outboxRepository.save(OutboxEvent.builder()
.aggregateId(event.getOrderId())
.eventType("StockReservationFailed")
.payload(new StockReservationFailedEvent(event.getOrderId(), e.getMessage()))
.build());
}
}
// Compensation: triggered by PaymentFailed event
@KafkaListener(topics = "payment-events", groupId = "inventory-service")
@Transactional
public void onPaymentFailed(PaymentFailedEvent event) {
log.info("Releasing stock reservation for order {} due to payment failure", event.getOrderId());
inventoryRepository.release(event.getOrderId());
outboxRepository.save(OutboxEvent.builder()
.aggregateId(event.getOrderId())
.eventType("StockReleased")
.payload(new StockReleasedEvent(event.getOrderId()))
.build());
}
}
Pros of Choreography:
- No single point of failure (no central coordinator)
- Services are loosely coupled โ each only knows its own event contracts
- Naturally maps to event-driven architectures
Cons of Choreography:
- Workflow is invisible โ to understand the full flow you must read every service
- Compensation chains are hard to reason about โ compensation events must fan out correctly
- Cyclic dependency risk โ services can end up listening to each other's events in cycles
- Debugging requires distributed tracing across every service
Orchestration (Central Coordinator)
In orchestration, a dedicated Orchestrator service acts as the brain. It explicitly issues commands to each participant, waits for responses, and decides what to do next โ including triggering compensations.
Orchestrator Code Example
@Service
@RequiredArgsConstructor
@Slf4j
public class OrderSagaOrchestrator {
private final SagaStateRepository sagaStateRepository;
private final InventoryClient inventoryClient;
private final PaymentClient paymentClient;
private final OrderRepository orderRepository;
@Transactional
public void startSaga(String sagaId, CreateOrderCommand cmd) {
SagaState state = SagaState.builder()
.id(sagaId)
.status(SagaStatus.STARTED)
.orderId(cmd.getOrderId())
.build();
sagaStateRepository.save(state);
executeStep(sagaId, cmd);
}
@Transactional
public void executeStep(String sagaId, CreateOrderCommand cmd) {
SagaState state = sagaStateRepository.findById(sagaId).orElseThrow();
try {
switch (state.getStatus()) {
case STARTED -> {
log.info("[{}] Step 1: Reserving stock", sagaId);
inventoryClient.reserve(cmd.getOrderId(), cmd.getItems());
state.transition(SagaStatus.STOCK_RESERVED);
sagaStateRepository.save(state);
// Continue to next step
executeStep(sagaId, cmd);
}
case STOCK_RESERVED -> {
log.info("[{}] Step 2: Processing payment", sagaId);
paymentClient.charge(cmd.getOrderId(), cmd.getPaymentInfo());
state.transition(SagaStatus.PAYMENT_PROCESSED);
sagaStateRepository.save(state);
executeStep(sagaId, cmd);
}
case PAYMENT_PROCESSED -> {
log.info("[{}] Saga complete", sagaId);
orderRepository.markCompleted(cmd.getOrderId());
state.transition(SagaStatus.COMPLETED);
sagaStateRepository.save(state);
}
}
} catch (InventoryException e) {
log.error("[{}] Stock reservation failed: {}", sagaId, e.getMessage());
state.transition(SagaStatus.FAILED);
sagaStateRepository.save(state);
orderRepository.markFailed(cmd.getOrderId(), "Stock unavailable");
} catch (PaymentException e) {
log.error("[{}] Payment failed: {}. Starting compensation.", sagaId, e.getMessage());
state.transition(SagaStatus.COMPENSATING);
sagaStateRepository.save(state);
compensate(sagaId, cmd, e.getMessage());
}
}
@Transactional
private void compensate(String sagaId, CreateOrderCommand cmd, String reason) {
try {
log.info("[{}] Compensation Step 1: Releasing stock", sagaId);
inventoryClient.release(cmd.getOrderId(), cmd.getItems());
log.info("[{}] Compensation Step 2: Cancelling order", sagaId);
orderRepository.markFailed(cmd.getOrderId(), reason);
SagaState state = sagaStateRepository.findById(sagaId).orElseThrow();
state.transition(SagaStatus.CANCELLED);
sagaStateRepository.save(state);
log.info("[{}] Saga cancelled successfully", sagaId);
} catch (Exception compensationEx) {
log.error("[{}] COMPENSATION FAILED: {}", sagaId, compensationEx.getMessage());
SagaState state = sagaStateRepository.findById(sagaId).orElseThrow();
state.transition(SagaStatus.MANUAL_INTERVENTION_REQUIRED);
state.setFailureReason(compensationEx.getMessage());
sagaStateRepository.save(state);
// Alert operations team
}
}
}
Saga State Machine Design
A production saga must model its lifecycle as a formal state machine to prevent invalid transitions and enable replay/recovery.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SAGA STATE MACHINE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
STARTED โโโบ STOCK_RESERVED โโโบ PAYMENT_PROCESSED โโโบ COMPLETED โ
โ โ
โ โโโโบ COMPENSATING โโโบ CANCELLED โ
โ โ
โโโโบ FAILED โ
โโโโบ MANUAL_INTERVENTION_REQUIRED โ ๏ธ
public enum SagaStatus {
STARTED,
STOCK_RESERVED,
PAYMENT_PROCESSED,
COMPLETED,
COMPENSATING,
CANCELLED,
FAILED,
MANUAL_INTERVENTION_REQUIRED;
public boolean isTerminal() {
return this == COMPLETED || this == CANCELLED
|| this == FAILED || this == MANUAL_INTERVENTION_REQUIRED;
}
public boolean isForwardTransitionValid(SagaStatus next) {
return switch (this) {
case STARTED -> next == STOCK_RESERVED || next == FAILED;
case STOCK_RESERVED -> next == PAYMENT_PROCESSED || next == COMPENSATING;
case PAYMENT_PROCESSED -> next == COMPLETED || next == COMPENSATING;
case COMPENSATING -> next == CANCELLED || next == MANUAL_INTERVENTION_REQUIRED;
default -> false; // terminal states cannot transition
};
}
}
@Entity
@Table(name = "saga_state")
@Data
@Builder
public class SagaState {
@Id private String id;
@Enumerated(EnumType.STRING)
private SagaStatus status;
private String orderId;
private String failureReason;
private int compensationAttempts;
private Instant createdAt;
private Instant updatedAt;
public void transition(SagaStatus next) {
if (!status.isForwardTransitionValid(next)) {
throw new InvalidSagaTransitionException(
"Cannot transition from " + status + " to " + next);
}
this.status = next;
this.updatedAt = Instant.now();
}
}
Why Store Saga State in a Database?
- Durability โ if the orchestrator crashes mid-saga, it can resume from the last known state on restart.
- Observability โ you can query all running sagas and their current step.
- Manual intervention โ operators can inspect and manually advance stuck sagas.
- Replay โ you can re-drive a saga from any intermediate state (e.g., after fixing a downstream service).
Idempotency & Exactly-Once Semantics
In a saga, each step may be retried (due to network errors, timeouts, or crashes). Every saga step must be idempotent โ executing it twice must produce the same result as executing it once.
Strategy 1: Idempotency Keys in Downstream Services
// Inventory service โ idempotent reserve endpoint
@PostMapping("/inventory/reserve")
public ResponseEntity<Void> reserve(@RequestBody ReserveCommand cmd) {
// Check if we already processed this saga step
if (reservationRepository.existsBySagaIdAndStep(cmd.getSagaId(), "reserve")) {
log.info("Duplicate reserve request for saga {}, ignoring", cmd.getSagaId());
return ResponseEntity.ok().build(); // idempotent โ return success
}
inventoryRepository.reserve(cmd.getOrderId(), cmd.getItems());
reservationRepository.save(new Reservation(cmd.getSagaId(), "reserve", cmd.getOrderId()));
return ResponseEntity.ok().build();
}
Strategy 2: Outbox Pattern for Event Publishing
When a service handles a saga step and publishes an event, these must be atomic โ either both happen or neither does:
@Transactional
public void reserveStock(ReserveCommand cmd) {
// 1. Reserve stock (business operation)
inventoryRepository.reserve(cmd.getOrderId(), cmd.getItems());
// 2. Write event to outbox IN THE SAME TRANSACTION
outboxRepository.save(OutboxEvent.of("StockReserved", cmd.getOrderId()));
// If this transaction commits: both stock reservation AND outbox event are durable
// If it fails: neither happens โ perfectly safe to retry
}
Without the Outbox Pattern, you risk: stock reserved but event never published โ saga hangs forever.
Choreography vs Orchestration Trade-offs
- Choreography
- Orchestration
When to use:
- Small number of services (2โ4) with linear, well-understood flows
- High autonomy is more important than visibility
- Event-driven architecture is already in place
Watch out for:
- Saga creep โ as requirements grow, the implicit workflow becomes increasingly hard to follow
- Compensation fan-out โ when T3 fails, it must publish a failure event that triggers C2 and C1. What if those compensation events are lost?
- Testing difficulty โ end-to-end testing requires running all services simultaneously
When to use:
- Complex, branching workflows (conditional steps, parallel steps)
- You need centralized visibility and auditability
- Compensations involve multiple services
- Regulatory/compliance requirements need a clear audit trail
Watch out for:
- Orchestrator as SPOF โ the orchestrator must be highly available (multi-instance with distributed locking or event sourcing)
- God service โ orchestrators can accumulate too much business logic; keep them as workflow coordinators, not business logic containers
- Tight coupling โ the orchestrator knows the full workflow; adding a new step requires modifying the orchestrator
Production Failure Handling
Exponential Backoff with Jitter for Step Retries
@Component
@RequiredArgsConstructor
@Slf4j
public class SagaStepRetryExecutor {
public <T> T executeWithRetry(String stepName, Supplier<T> step, int maxAttempts) {
for (int attempt = 1; attempt <= maxAttempts; attempt++) {
try {
return step.get();
} catch (TransientException e) {
if (attempt == maxAttempts) {
throw new SagaStepExhaustedException(
"Step " + stepName + " failed after " + maxAttempts + " attempts", e);
}
long delayMs = calculateBackoff(attempt);
log.warn("Step {} attempt {}/{} failed. Retrying in {}ms",
stepName, attempt, maxAttempts, delayMs);
sleep(delayMs);
}
}
throw new IllegalStateException("Unreachable");
}
private long calculateBackoff(int attempt) {
long baseDelay = 500L; // 500ms base
long exponential = (long) Math.pow(2, attempt) * baseDelay; // 1s, 2s, 4s, 8s
long jitter = (long) (Math.random() * baseDelay); // ยฑ500ms jitter
return Math.min(exponential + jitter, 30_000L); // cap at 30s
}
}
Escalation Playbook
When compensation itself fails (e.g., the inventory service is down while you're trying to release a reservation):
Level 1 โ Automatic Retry (0โ5 minutes)
โโโ Exponential backoff with jitter: 1s, 2s, 4s, 8s, 16s
โโโ Log each retry attempt with sagaId, step, attempt number
Level 2 โ Dead Letter Queue (5 minutes)
โโโ Move failed saga to DLQ for isolated inspection
โโโ Prevent blocking healthy saga traffic
Level 3 โ Manual Intervention (ops alert)
โโโ Saga status: MANUAL_INTERVENTION_REQUIRED
โโโ PagerDuty/alerting fires
โโโ Admin dashboard shows saga state, failed step, error message
โโโ Operator can manually: retry a step, skip a step, force cancel
Level 4 โ Business Resolution
โโโ If compensation is truly impossible (e.g., payment was sent to bank):
create a manual financial adjustment record
involve finance/ops team for out-of-band resolution
@Service
@RequiredArgsConstructor
public class SagaEscalationService {
private final SagaStateRepository sagaStateRepository;
private final AlertingService alertingService;
private final DlqPublisher dlqPublisher;
public void escalate(String sagaId, String step, Exception cause) {
SagaState state = sagaStateRepository.findById(sagaId).orElseThrow();
int attempts = state.incrementCompensationAttempts();
if (attempts <= 5) {
// Level 1: automatic retry (handled by retry executor)
return;
}
if (attempts == 6) {
// Level 2: DLQ
dlqPublisher.publish(new FailedSagaMessage(sagaId, step, cause.getMessage()));
}
if (attempts >= 6) {
// Level 3: alert ops
state.transition(SagaStatus.MANUAL_INTERVENTION_REQUIRED);
state.setFailureReason(step + ": " + cause.getMessage());
sagaStateRepository.save(state);
alertingService.alert(Alert.critical(
"Saga compensation failed - manual intervention required",
Map.of("sagaId", sagaId, "step", step, "error", cause.getMessage())
));
}
}
}
Observability: Monitoring Sagas in Production
@Component
@RequiredArgsConstructor
public class SagaMetrics {
private final MeterRegistry registry;
public void recordSagaStarted(String sagaType) {
registry.counter("saga.started", "type", sagaType).increment();
}
public void recordSagaCompleted(String sagaType, Duration duration) {
registry.timer("saga.duration", "type", sagaType, "result", "success")
.record(duration);
}
public void recordSagaFailed(String sagaType, String step) {
registry.counter("saga.failed", "type", sagaType, "step", step).increment();
}
public void recordCompensationStarted(String sagaType) {
registry.counter("saga.compensation.started", "type", sagaType).increment();
}
public void recordManualIntervention(String sagaType, String step) {
registry.counter("saga.manual_intervention", "type", sagaType, "step", step)
.increment();
}
}
Key metrics to alert on:
| Metric | Alert When | What It Means |
|---|---|---|
saga.failed rate | > 1% of started sagas | Downstream service degradation or data issue |
saga.manual_intervention | Any occurrence | Critical โ requires immediate ops response |
saga.duration p99 | > 30s for typical flows | Downstream service slowdown or deadlock |
saga.compensation.started rate | > 0.5% of sagas | Payment/inventory failures increasing |
Sagas stuck in COMPENSATING > 5 min | Any | Compensation service unreachable |
Senior Interview Questions
Q: What is the difference between a saga and a distributed transaction?
A: A distributed transaction (like 2PC) achieves strong ACID atomicity across multiple databases through a synchronous locking protocol. All participants either commit or roll back together. A saga achieves eventual consistency โ each step commits locally, and failure is handled by compensating transactions (semantically reversing earlier steps). Sagas trade ACID guarantees for scalability, availability, and independence from distributed locking.
Q: Can a saga leave the system in a temporarily inconsistent state?
A: Yes, by design. While a saga is executing, the system is in an intermediate state (e.g., stock reserved but payment not yet confirmed). This is tolerated because the saga will either complete (reaching a consistent state) or fully compensate (returning to a consistent starting state). External observers may briefly see intermediate states โ this is why saga-based systems must carefully design their read models and exposure windows.
Q: How do you handle idempotency in saga steps?
A: Every saga step must be idempotent because retries are inevitable. Strategies include:
- Idempotency keys in the DB โ check if the step already executed for this sagaId before processing.
- Outbox Pattern โ write the event and business state change in one atomic local transaction, preventing the case where the event is published but the DB change is lost (or vice versa).
- Upsert semantics โ use
ON CONFLICT DO NOTHINGor similar to make the operation safe to repeat.
Q: What happens if the orchestrator crashes mid-saga?
A: If saga state is persisted durably (in a database), the orchestrator can recover by reading its current state on restart. A recovery job queries all sagas in non-terminal states and re-drives them from their last persisted step. This is why sagaStateRepository.save(state) must happen before each network call to a downstream service โ ensuring the state is durable before the action is taken.
Q: How do you design saga compensations for operations that can't be reversed?
A: For truly irreversible operations (e.g., a payment was sent to a bank, an email was delivered, a 3PL warehouse shipped a parcel):
- Don't compensate technically โ create a business-level adjustment record (e.g., a refund record, a return request).
- Pivot the saga state to
MANUAL_INTERVENTION_REQUIREDand alert operations. - Document in the system which operations are "pivot transactions" (steps after which compensation is impossible) and architect sagas to place them at the end.
See Also
- Two-Phase Commit (2PC) โ Synchronous alternative; understand why sagas exist
- Transactional Outbox Pattern โ Reliable event publishing inside saga steps
- Dead Letter Queue (DLQ) โ Handling permanently failed saga messages
- Microservices Patterns โ Broader microservices architecture context