Two-Phase Commit (2PC) & Three-Phase Commit (3PC)
- New learners β start at The Problem: Why Is This Hard? and The Restaurant Analogy to build intuition before looking at how 2PC works.
- Senior engineers β jump to 2PC Internals & WAL, Failure Modes, XA Transactions, or When to Use 2PC vs Alternatives.
The Problem: Why Is This Hard?β
Imagine you are building an e-commerce checkout service. When a customer places an order, your system needs to:
- Debit
$100from the customer's payment database (Postgres on server A) - Reserve
1 unitof stock in the inventory database (MySQL on server B) - Create an order record in the orders database (Postgres on server C)
If you do these three steps in sequence and the second step crashes, you've charged the customer's card but not reserved stock. You now have inconsistency across your system boundaries.
A single local database transaction doesn't help here β you have three separate databases that each know nothing about each other.
:::note The Core Challenge
A local BEGIN; ...; COMMIT; gives you ACID guarantees within one database. But there is no built-in mechanism to make that atomicity span across multiple, independent databases.
:::
The Restaurant Analogyβ
Think of 2PC like a restaurant hosting a group dinner. The event coordinator (Coordinator) calls every table captain (Participant) before the dinner starts:
Phase 1 β Calling Around (Prepare):
"Can everyone commit to attending the dinner this Saturday?" Each captain checks with their table and says "Yes, we'll be there" (and marks it in their calendar), or "No, sorry, conflicts".
Phase 2 β Final Confirmation (Commit or Cancel):
If everyone said yes β The coordinator sends the final confirmation: "Dinner is on! See you Saturday." Each captain now makes it official. If anyone said no β The coordinator calls everyone back: "Dinner is cancelled." Everyone erases the reservation.
The key insight: No one actually shows up until everyone has confirmed. This is atomic commitment across multiple parties.
Two-Phase Commit (2PC)β
Two-Phase Commit (2PC) is a synchronous protocol designed to achieve atomic transaction commits across multiple independent resources such as database nodes or message brokers.
The protocol relies on a central Coordinator and multiple Participants.
Mechanics of 2PCβ
Phase 1: Prepare (Voting Round)β
- The coordinator sends a
PREPAREmessage to all participants along with the transaction ID. - Each participant performs all transaction work up to the commit boundary:
- Writes undo log entries (to roll back if aborted).
- Acquires all necessary row/table locks.
- Flushes changes to its Write-Ahead Log (WAL) on durable storage.
- Participants vote:
- VOTE YES β "I have done my work, acquired locks, and flushed to durable storage. I am ready to commit."
- VOTE NO β "I encountered an error (constraint violation, timeout, disk failure). Abort the transaction."
Once a participant votes YES, it is obligated to commit if the coordinator later sends a COMMIT. It cannot unilaterally abort. This is what makes 2PC blocking.
Phase 2: Commit or Abortβ
Happy path (all votes YES):
- Coordinator writes the COMMIT decision to its own transaction log first (durable write β this is the point of no return).
- Coordinator broadcasts
COMMITto all participants. - Each participant applies the pending changes, releases locks, writes a commit record, and sends an
ACK.
Abort path (any vote NO or timeout):
- Coordinator writes an
ABORTdecision to its log. - Coordinator broadcasts
ABORTto all participants. - Each participant uses its undo log to roll back changes, releases locks, and sends an
ACK.
2PC Internals & WALβ
What happens inside a Participant during Prepare?β
Participant WAL during Phase 1:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β [LSN 1001] BEGIN txn=T1 β
β [LSN 1002] UPDATE accounts SET balance=balance-100 β
β WHERE id=42 (old_val=500, new_val=400) β
β [LSN 1003] PREPARE txn=T1 xid=2PC-global-id β
β β WAL fsync() called here β
β β Row locks held, transaction "in-doubt" β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The participant is now in an in-doubt state. It holds locks and has durable undo data, but hasn't committed or rolled back. It waits for the coordinator's Phase 2 decision.
Coordinator Transaction Logβ
The coordinator's log is the single most critical piece of state:
Coordinator Log:
[T1] STARTED β 2025-01-15 10:00:00.001
[T1] PREPARED β participants=[DB-A, DB-B, DB-C]
[T1] ALL VOTED YES β 10:00:00.050
[T1] DECISION: COMMIT β durable fsync here (point of no return)
[T1] COMMIT SENT to DB-A β 10:00:00.055
[T1] COMMIT SENT to DB-B β 10:00:00.057
[T1] ACK from DB-A β 10:00:00.060
[T1] ACK from DB-B β 10:00:00.062
[T1] COMPLETED
The DECISION: COMMIT log entry is the point of no return. Once this is written durably, the coordinator will retry sending COMMIT messages until all participants acknowledge β even across coordinator restarts.
Failure Modes In-Depthβ
1. Participant Crashes Before Votingβ
[Coordinator] -PREPAREβ [DB-A] β VOTE YES
[Coordinator] -PREPAREβ [DB-B] π₯ (DB-B crashes before responding)
Coordinator action: Wait for timeout β ABORT
DB-A action: Receives ABORT β rolls back β releases locks β
DB-B action: On restart, sees no PREPARE in WAL β no-op β
Result: Clean abort. No inconsistency.
2. Participant Crashes After Voting YES (The Blocking Problem)β
[DB-A] sent VOTE YES β holds locks β waits...
[DB-A] π₯ crashes mid-wait
On recovery, DB-A sees its WAL:
[PREPARE txn=T1] β in-doubt state
No COMMIT or ABORT record
DB-A MUST contact the coordinator to learn the decision.
Until it does, DB-A holds its locks β potentially blocking all readers/writers.
How PostgreSQL handles this: Uses pg_prepared_xacts to persist in-doubt transactions across restarts. A recovery process queries the coordinator for the decision.
3. Coordinator Crashes After Writing COMMIT Decisionβ
Coordinator log: [T1] DECISION: COMMIT β written durably
Coordinator π₯ crashes before sending COMMIT to participants
Participants DB-A, DB-B: Still in in-doubt state, holding locks β
On coordinator restart:
Recovery reads log: sees DECISION: COMMIT for T1
Re-sends COMMIT to all participants
Participants check their WAL β see PREPARE β apply COMMIT β
Result: Eventually consistent, but participants are blocked for the duration of the coordinator's recovery window (seconds to minutes).
4. Network Partition Between Coordinator and Participantsβ
[Coordinator] sent COMMIT to DB-A β
[Coordinator] βββ network partition βββ [DB-B]
DB-A: Committed β
DB-B: Still in-doubt, holding locks β (can never resolve without coordinator)
This is the fundamental unsolvability of 2PC under network partitions. A participant that cannot reach the coordinator is permanently blocked β it cannot unilaterally commit or abort, because:
- If it commits and the decision was ABORT β inconsistency.
- If it aborts and the decision was COMMIT β inconsistency.
Spring Boot 2PC Code Representationβ
A logical coordinator managing participants in a 2PC workflow:
@Service
@RequiredArgsConstructor
@Slf4j
public class TwoPhaseCommitCoordinator {
private final List<TransactionParticipant> participants;
private final CoordinatorLogRepository coordinatorLog;
public void executeTransaction(String txnId, List<Operation> operations) {
List<TransactionParticipant> preparedParticipants = new ArrayList<>();
coordinatorLog.record(txnId, "STARTED");
try {
// βββ Phase 1: Prepare βββββββββββββββββββββββββββββββββββββββ
for (Operation op : operations) {
TransactionParticipant participant = getParticipant(op);
log.info("Sending PREPARE to {} for txn {}", participant.getName(), txnId);
if (!participant.prepare(txnId, op)) {
throw new PrepareFailedException(
"Participant " + participant.getName() + " voted NO");
}
preparedParticipants.add(participant);
}
// βββ All voted YES: write durable commit decision ββββββββββ
coordinatorLog.record(txnId, "DECISION:COMMIT"); // point of no return
// βββ Phase 2: Commit ββββββββββββββββββββββββββββββββββββββββ
for (TransactionParticipant participant : preparedParticipants) {
retryUntilSuccess(() -> participant.commit(txnId)); // must commit
}
coordinatorLog.record(txnId, "COMPLETED");
} catch (PrepareFailedException e) {
coordinatorLog.record(txnId, "DECISION:ABORT");
// Abort all that voted YES
for (TransactionParticipant prepared : preparedParticipants) {
try {
prepared.abort(txnId);
} catch (Exception abortEx) {
log.error("ABORT failed for {}, txn {}. Manual intervention required.",
prepared.getName(), txnId, abortEx);
// In production: alert ops team, update monitoring
}
}
throw new TransactionException("Transaction " + txnId + " aborted", e);
}
}
private void retryUntilSuccess(Runnable action) {
// In production: retry with exponential backoff
// The commit MUST eventually reach every participant
action.run();
}
}
Limitations of 2PCβ
| Limitation | Description | Impact |
|---|---|---|
| Blocking Protocol | Participants hold locks from Phase 1 until they receive the Phase 2 decision | Reduces throughput, increases tail latency, causes lock amplification under load |
| Coordinator SPOF | Coordinator crash after writing COMMIT but before notifying participants leaves them in-doubt holding locks | Requires coordinator HA (clustering/leader election) |
| Network Partition | A partitioned participant cannot safely commit or abort without the coordinator's decision | Indefinite blocking β violates availability |
| Not Cloud-Native | 2PC requires persistent, stateful connections and XA drivers that don't work with HTTP-based microservices or cloud databases | Only viable for collocated JDBC resources |
| Performance | Two synchronous network round trips + two durable fsyncs per participant for every transaction | Practical ceiling of ~100β500 TPS per coordinator |
XA Transactionsβ
The XA Standard (X/Open DTP) is the industry specification for implementing 2PC across heterogeneous resources. It defines:
- XA Resource β any system that can participate in a distributed transaction (databases, message brokers).
- Transaction Manager (TM) β the coordinator that manages the XA lifecycle.
- Resource Manager (RM) β each participant (e.g., PostgreSQL, ActiveMQ).
Java XA with Spring Boot (JTA/Atomikos)β
// application.yml β using Atomikos JTA transaction manager
spring:
jta:
enabled: true
atomikos:
datasource:
primary:
xa-data-source-class-name: org.postgresql.xa.PGXADataSource
xa-properties:
serverName: primary-db
databaseName: orders
secondary:
xa-data-source-class-name: com.mysql.cj.jdbc.MysqlXADataSource
xa-properties:
serverName: inventory-db
databaseName: inventory
@Service
@Transactional // JTA @Transactional β spans BOTH datasources
public class OrderService {
@Autowired private OrderRepository orderRepo; // DataSource: primary-db
@Autowired private InventoryRepository inventoryRepo; // DataSource: secondary-db
public Order placeOrder(CreateOrderCommand cmd) {
// Both of these writes are wrapped in a single XA transaction
// Atomikos will coordinate 2PC between PostgreSQL and MySQL
Order order = orderRepo.save(Order.from(cmd));
inventoryRepo.reserve(cmd.getProductId(), cmd.getQuantity());
return order;
// On method exit: Atomikos calls xa.prepare() on both DBs,
// then xa.commit() if both return XA_OK
}
}
XA transactions are notoriously slow and operationally complex. The prepare() β commit() round trips add 5β20ms per transaction. Most modern architectures avoid XA entirely in favor of the Saga Pattern or Transactional Outbox Pattern.
Three-Phase Commit (3PC)β
Three-Phase Commit (3PC) is an evolution of 2PC that adds a pre-commit phase to eliminate the blocking problem under coordinator failure.
Phase 1 Phase 2 Phase 3
[CanCommit] βββΊ [PreCommit] βββΊ [DoCommit]
(Voting) (Lock + Promise) (Final Apply)
How 3PC Eliminates Blockingβ
The key insight: 3PC allows participants to make a unilateral decision on timeout:
| Phase | Coordinator Crashes | Participant Action on Timeout |
|---|---|---|
| After CanCommit | Before PreCommit sent | Participants can safely ABORT (nobody is locked yet) |
| After PreCommit | Before DoCommit sent | Participants can safely COMMIT (they know everyone acknowledged PreCommit) |
This is possible because PreCommit is a quorum acknowledgment β every participant that receives PreCommit knows every other participant also received it. So committing unilaterally is safe.
3PC Sequenceβ
Why 3PC is Not Used in Practiceβ
- Still not partition-tolerant. In the event of a network partition during PreCommit, participants on different sides of the partition can independently decide differently (one aborts, one commits) β leading to inconsistency.
- Three round trips instead of two β significantly higher tail latency.
- Complexity β implementation is far more complex with little practical benefit given that Paxos/Raft-based consensus is now available for coordinator HA.
Verdict: 3PC is a theoretical improvement. Almost no production database systems implement 3PC. The modern solution to coordinator SPOF is making the coordinator itself highly available via Paxos/Raft (as Google Spanner does with its TrueTime-based coordinators).
When to Use 2PC vs Alternativesβ
| Situation | Recommendation | Reason |
|---|---|---|
| Two JDBC datasources in the same JVM, < 200 TPS | 2PC (XA via Atomikos/Narayana) | Simplest correct solution for collocated resources |
| Microservices with independent HTTP-based APIs | Saga Pattern | 2PC requires stateful connections; HTTP is stateless |
| High-throughput writes (> 1000 TPS) | Outbox Pattern + CDC | 2PC blocking is prohibitive at scale |
| Cross-database event publishing | Transactional Outbox | Atomic write to event store inside local transaction |
| Financial ledger with strict ACID guarantees | 2PC or Google Spanner | Spanner gives 2PC semantics with distributed coordinator HA |
Senior Interview Questionsβ
Q: A participant has voted YES in Phase 1. The coordinator then crashes. What happens?β
A: The participant is in an in-doubt state. It holds its locks and cannot proceed. It must periodically poll the coordinator (or a designated recovery coordinator) to learn the decision. Modern databases like PostgreSQL expose pg_prepared_xacts for exactly this purpose. During the outage window, any thread needing those locked rows will block β potentially indefinitely. This is why 2PC is not suitable for high-contention or long-lived transactions.
Q: What is the "point of no return" in 2PC and why does it matter?β
A: The point of no return is when the coordinator durably writes the COMMIT decision to its own transaction log. Before this, the coordinator can safely abort. After this, the coordinator is obligated to drive every participant to commit β it will retry forever if necessary. The durability of this log entry is what makes the protocol correct even under coordinator crashes.
Q: How does XA differ from logical 2PC in application code?β
A: XA is a standardized protocol (X/Open DTP) that provides a C API (xa_prepare, xa_commit, xa_rollback) implemented by each Resource Manager (database driver). A JTA Transaction Manager (Atomikos, Narayana) acts as the coordinator and calls these APIs. Logical 2PC (as in the application-level code above) is a custom implementation that mirrors the same concepts without using the XA standard β useful for resources that don't support XA.
Q: What is lock amplification in 2PC and how does it impact production systems?β
A: Lock amplification means that a single distributed transaction holds locks in multiple databases simultaneously from the end of Phase 1 until all Phase 2 ACKs are received. If you have 10 operations, you're holding 10 locks across 10 participants for the entire Phase 2 round trip time. Under concurrent load, this causes lock waits to cascade β transactions waiting for Phase 2 to complete block other transactions, increasing average latency non-linearly. At high TPS, a single slow Phase 2 participant can back-pressure the entire system.
See Alsoβ
- Saga Pattern β Eventual consistency alternative to 2PC for microservices
- Transactional Outbox Pattern β Reliable event publishing without distributed transactions
- Data Consistency Deep Dive β Isolation levels, MVCC, WAL internals