Distributed Systems
Fallacies of Distributed Computingโ
Never assume:
- The network is reliable
- Latency is zero
- Bandwidth is infinite
- The network is secure
- Topology doesn't change
- There is one administrator
- Transport cost is zero
- The network is homogeneous
Consensus Problemโ
How do N nodes agree on a single value despite failures?
Used for: leader election, distributed commits, replicated state machines.
Paxos (Classic)โ
Complex, hard to implement. Foundation of many systems.
Raft (Understandable)โ
Used by: etcd, Consul, CockroachDB.
All nodes start as Followers
If no heartbeat received โ become Candidate โ request votes
Majority votes โ become Leader
Leader sends heartbeats + log entries to Followers
Leader Election:
Follower timeout (150โ300ms) โ RequestVote RPC
First to get majority wins
Split vote โ timeout + retry with new term
Log Replication:
Client โ Leader (append entry to log)
โ Send to all Followers (AppendEntries RPC)
โ Majority ACK โ Mark committed
โ Reply to client
Leader Election Patternsโ
Zookeeper Ephemeral Nodesโ
All candidates create ephemeral sequential znode: /election/node-000N
Node with smallest number = leader
On leader failure โ znode deleted โ next node becomes leader
Kubernetes โ Only one leader with Leaseโ
# Leader election via lease object
# Spring Integration or custom via k8s client
// Spring Integration Leader Election
@Bean
public LeaderInitiator leaderInitiator(LockRegistry lockRegistry) {
return new LeaderInitiator(lockRegistry, new DefaultCandidate("my-service", "my-role"));
}
@EventListener
public void onLeadershipGranted(OnGrantedEvent event) {
log.info("This node is now the leader");
startLeaderOnlyTask();
}
@EventListener
public void onLeadershipRevoked(OnRevokedEvent event) {
log.info("Leadership revoked");
stopLeaderOnlyTask();
}
Vector Clocksโ
Track causality across distributed systems without synchronized clocks.
Initial: A=[0,0,0], B=[0,0,0], C=[0,0,0]
A sends event: A=[1,0,0]
AโB message: B receives โ B=[1,1,0]
AโC message: C receives โ C=[1,0,1]
BโC message: C receives โ C=[max(1,1), max(0,1), max(1,1)] = [1,1,1]
Causality: if VC(a) < VC(b) for all components โ a happened-before b
if neither VC(a) < VC(b) nor VC(b) < VC(a) โ concurrent
Used by: Amazon DynamoDB, Riak (for conflict detection).
Two-Phase Commit (2PC)โ
Two-Phase Commit (2PC) is a distributed transaction protocol that ensures atomic commit across multiple participants.
For complete sequence diagrams, failure mode analyses (e.g., blocking coordinators, network partitions), and a comparison with Three-Phase Commit (3PC) and the Saga Pattern, see the dedicated Two-Phase Commit (2PC) & Three-Phase Commit (3PC) Guide and the Saga Pattern Guide.
CAP Theorem in Real Systemsโ
CAP fundamentals are documented in Architecture Fundamentals. In practice, partition tolerance is non-negotiable, so production choices are usually CP vs AP by workload.
Applied Decision Guideโ
- Payments, inventory, ledger: lean CP for correctness on critical writes
- Feeds, analytics, recommendations: lean AP for availability and low latency
- Hybrid systems often expose CP writes and AP reads in different endpoints
Senior Tradeoff Exampleโ
For N=3 replicas:
W=2, R=2gives stronger read freshness but lower availability under partitionW=1, R=1improves availability but increases stale-read risk
Distributed Locking Essentialsโ
Beginner Viewโ
Distributed locks ensure only one worker performs a critical operation at a time (for example, one scheduler instance runs a monthly billing job).
Senior Deep Diveโ
Leases are safer than forever locks: each lock has TTL and requires renewal.
Critical safety concept: fencing tokens.
- Lock service returns monotonically increasing token
- Downstream storage accepts writes only from highest token
- Prevents stale leader from writing after lease expiry
Worker A gets token 41 (lease expires)
Worker B gets token 42
If A wakes up late, storage rejects token 41 writes
See dedicated guide: Distributed Locking.
Beyond Crash Faults: BFT Overviewโ
Raft/Paxos assume crash faults (nodes fail-stop). Byzantine Fault Tolerance (BFT) handles arbitrary or malicious behavior.
Senior Viewโ
- Crash fault model: typically
2f + 1nodes tolerateffailures - Byzantine model: typically
3f + 1nodes toleratefByzantine nodes - BFT adds communication rounds and signature overhead
Use BFT only when trust boundaries require it (multi-organization consensus, adversarial environments).
See dedicated guide: Advanced Consensus and BFT.
Gossip Protocolโ
Nodes periodically share information with random peers. Information spreads like a virus.
Round 1: A knows X โ A tells B, C
Round 2: B knows X โ B tells D, E; C tells F, G
Round 3: All nodes know X
Properties:
- Fault-tolerant (no central coordinator)
- Eventually consistent
- Used by: Cassandra (membership), Redis Cluster, Consul
Failure Detectorsโ
Heartbeat + Timeoutโ
Every 5s: Node A sends heartbeat to B
If B doesn't hear from A in 15s โ A is suspected failed
Challenge: Cannot distinguish slow from dead (network partition vs node crash).
Phi Accrual Failure Detector (Cassandra)โ
Instead of binary alive/dead, outputs a suspicion level ฯ (phi):
- ฯ = 1: ~10% chance of failure
- ฯ = 10: ~99.99% chance of failure
- Application sets threshold (e.g., ฯ > 8 โ mark suspect)
Consistency Patterns in Practiceโ
Read-Your-Writes via Sticky Readsโ
// After write, route subsequent reads to primary for N seconds
public User getUser(Long userId, String sessionToken) {
boolean recentWrite = recentWriteCache.contains(userId);
if (recentWrite) {
return primaryRepo.findById(userId); // Strong consistency
}
return replicaRepo.findById(userId); // Eventual consistency
}
Monotonic Read Consistencyโ
Always read from the same replica in a session.
// Session affinity: bind user to replica by userId hash
public DataSource selectReplica(Long userId) {
int replicaIndex = (int)(userId % replicas.size());
return replicas.get(replicaIndex);
}
Distributed Transactions Comparisonโ
| Approach | Availability | Consistency | Complexity |
|---|---|---|---|
| 2PC | Low (blocking) | Strong | Medium |
| 3PC | Medium | Strong | High |
| Saga (Orchestration) | High | Eventual | Medium |
| Saga (Choreography) | High | Eventual | High (debugging) |
| TCC (Try-Confirm-Cancel) | High | Strong (conceptually) | High |
Idempotency Keys (Distributed)โ
// Distributed idempotency with Redis
public <T> T executeIdempotent(String key, Supplier<T> operation, Duration ttl) {
String result = redis.opsForValue().get("idem:" + key);
if (result != null) {
return deserialize(result, operationType);
}
T value = operation.get();
// SET NX (only if not exists) prevents race condition
redis.opsForValue().setIfAbsent("idem:" + key, serialize(value), ttl);
return value;
}
Network Partitions & Split-Brainโ
Data Center A โโโรโโโ Data Center B
(network cut)
A: "I'm the leader"
B: "I'm the leader"
โ Both accept writes โ divergent state (split-brain)
Solutions:
- Quorum: Only side with majority can elect leader
- Fencing: External authority invalidates old leader's token
- Pause-minority: Smaller partition stops accepting writes
Interview Questionsโ
Q: What is the consensus problem? What algorithms solve it?โ
A: Consensus means multiple nodes agree on one sequence of decisions despite failures. Paxos, Raft, and various BFT protocols solve different fault/trust models.
Q: Explain Raft leader election in plain English.โ
A: When followers stop hearing heartbeats, they start an election and request votes for a new term. A node winning majority becomes leader and starts sending heartbeats.
Q: What is a vector clock and how does it detect causal ordering?โ
A: A vector clock tracks per-node event counters in updates. Comparing vectors shows whether one event happened-before another or if they are concurrent conflicts.
Q: What is Two-Phase Commit? What are its failure modes?โ
A: Two-Phase Commit (2PC) coordinates distributed resource updates by voting to prepare before executing a commit. Its primary failure modes are coordinator failure (leaving participants blocked holding locks) and network partitions. See the Two-Phase Commit Guide for details.
Q: How does a gossip protocol work? What is it used for?โ
A: Nodes periodically exchange state with random peers, and updates spread probabilistically through the cluster. It is used for membership, health signals, and configuration dissemination.
Q: What is split-brain syndrome and how do you prevent it?โ
A: Split-brain is when partitions each believe they are primary and accept conflicting writes. Prevent it with quorum rules, fencing, and single-writer leadership.
Q: How do you build a distributed system that is available during a network partition?โ
A: Favor AP behavior for selected operations, allow local writes, and reconcile conflicts later. Classify which paths can be eventually consistent versus requiring strong consistency.
Q: What is the difference between a failure detector and a consensus algorithm?โ
A: Failure detector guesses node liveness; consensus establishes agreed decisions despite uncertain liveness. One provides signals, the other provides safety/ordering guarantees.
Q: How does Zookeeper achieve distributed coordination?โ
A: ZooKeeper uses a quorum-backed ordered log and ephemeral/sequential znodes for locks, leader election, and config. Session semantics and watches enable reliable coordination workflows.
Q: What are the fallacies of distributed computing and why do they matter?โ
A: They are false assumptions like "network is reliable" and "latency is zero." Ignoring them leads to fragile designs that fail under normal production conditions.