Skip to main content

Distributed Systems


Fallacies of Distributed Computingโ€‹

Never assume:

  1. The network is reliable
  2. Latency is zero
  3. Bandwidth is infinite
  4. The network is secure
  5. Topology doesn't change
  6. There is one administrator
  7. Transport cost is zero
  8. The network is homogeneous

Consensus Problemโ€‹

How do N nodes agree on a single value despite failures?

Used for: leader election, distributed commits, replicated state machines.

Paxos (Classic)โ€‹

Complex, hard to implement. Foundation of many systems.

Raft (Understandable)โ€‹

Used by: etcd, Consul, CockroachDB.

All nodes start as Followers
If no heartbeat received โ†’ become Candidate โ†’ request votes
Majority votes โ†’ become Leader
Leader sends heartbeats + log entries to Followers

Leader Election:

Follower timeout (150โ€“300ms) โ†’ RequestVote RPC
First to get majority wins
Split vote โ†’ timeout + retry with new term

Log Replication:

Client โ†’ Leader (append entry to log)
โ†’ Send to all Followers (AppendEntries RPC)
โ†’ Majority ACK โ†’ Mark committed
โ†’ Reply to client

Leader Election Patternsโ€‹

Zookeeper Ephemeral Nodesโ€‹

All candidates create ephemeral sequential znode: /election/node-000N
Node with smallest number = leader
On leader failure โ†’ znode deleted โ†’ next node becomes leader

Kubernetes โ€” Only one leader with Leaseโ€‹

# Leader election via lease object
# Spring Integration or custom via k8s client
// Spring Integration Leader Election
@Bean
public LeaderInitiator leaderInitiator(LockRegistry lockRegistry) {
return new LeaderInitiator(lockRegistry, new DefaultCandidate("my-service", "my-role"));
}

@EventListener
public void onLeadershipGranted(OnGrantedEvent event) {
log.info("This node is now the leader");
startLeaderOnlyTask();
}

@EventListener
public void onLeadershipRevoked(OnRevokedEvent event) {
log.info("Leadership revoked");
stopLeaderOnlyTask();
}

Vector Clocksโ€‹

Track causality across distributed systems without synchronized clocks.

Initial: A=[0,0,0], B=[0,0,0], C=[0,0,0]

A sends event: A=[1,0,0]
Aโ†’B message: B receives โ†’ B=[1,1,0]
Aโ†’C message: C receives โ†’ C=[1,0,1]
Bโ†’C message: C receives โ†’ C=[max(1,1), max(0,1), max(1,1)] = [1,1,1]

Causality: if VC(a) < VC(b) for all components โ†’ a happened-before b
if neither VC(a) < VC(b) nor VC(b) < VC(a) โ†’ concurrent

Used by: Amazon DynamoDB, Riak (for conflict detection).


Two-Phase Commit (2PC)โ€‹

Two-Phase Commit (2PC) is a distributed transaction protocol that ensures atomic commit across multiple participants.

For complete sequence diagrams, failure mode analyses (e.g., blocking coordinators, network partitions), and a comparison with Three-Phase Commit (3PC) and the Saga Pattern, see the dedicated Two-Phase Commit (2PC) & Three-Phase Commit (3PC) Guide and the Saga Pattern Guide.


CAP Theorem in Real Systemsโ€‹

CAP fundamentals are documented in Architecture Fundamentals. In practice, partition tolerance is non-negotiable, so production choices are usually CP vs AP by workload.

Applied Decision Guideโ€‹

  • Payments, inventory, ledger: lean CP for correctness on critical writes
  • Feeds, analytics, recommendations: lean AP for availability and low latency
  • Hybrid systems often expose CP writes and AP reads in different endpoints

Senior Tradeoff Exampleโ€‹

For N=3 replicas:

  • W=2, R=2 gives stronger read freshness but lower availability under partition
  • W=1, R=1 improves availability but increases stale-read risk

Distributed Locking Essentialsโ€‹

Beginner Viewโ€‹

Distributed locks ensure only one worker performs a critical operation at a time (for example, one scheduler instance runs a monthly billing job).

Senior Deep Diveโ€‹

Leases are safer than forever locks: each lock has TTL and requires renewal.

Critical safety concept: fencing tokens.

  • Lock service returns monotonically increasing token
  • Downstream storage accepts writes only from highest token
  • Prevents stale leader from writing after lease expiry
Worker A gets token 41 (lease expires)
Worker B gets token 42
If A wakes up late, storage rejects token 41 writes

See dedicated guide: Distributed Locking.


Beyond Crash Faults: BFT Overviewโ€‹

Raft/Paxos assume crash faults (nodes fail-stop). Byzantine Fault Tolerance (BFT) handles arbitrary or malicious behavior.

Senior Viewโ€‹

  • Crash fault model: typically 2f + 1 nodes tolerate f failures
  • Byzantine model: typically 3f + 1 nodes tolerate f Byzantine nodes
  • BFT adds communication rounds and signature overhead

Use BFT only when trust boundaries require it (multi-organization consensus, adversarial environments).

See dedicated guide: Advanced Consensus and BFT.


Gossip Protocolโ€‹

Nodes periodically share information with random peers. Information spreads like a virus.

Round 1: A knows X โ†’ A tells B, C
Round 2: B knows X โ†’ B tells D, E; C tells F, G
Round 3: All nodes know X

Properties:

  • Fault-tolerant (no central coordinator)
  • Eventually consistent
  • Used by: Cassandra (membership), Redis Cluster, Consul

Failure Detectorsโ€‹

Heartbeat + Timeoutโ€‹

Every 5s: Node A sends heartbeat to B
If B doesn't hear from A in 15s โ†’ A is suspected failed

Challenge: Cannot distinguish slow from dead (network partition vs node crash).

Phi Accrual Failure Detector (Cassandra)โ€‹

Instead of binary alive/dead, outputs a suspicion level ฯ† (phi):

  • ฯ† = 1: ~10% chance of failure
  • ฯ† = 10: ~99.99% chance of failure
  • Application sets threshold (e.g., ฯ† > 8 โ†’ mark suspect)

Consistency Patterns in Practiceโ€‹

Read-Your-Writes via Sticky Readsโ€‹

// After write, route subsequent reads to primary for N seconds
public User getUser(Long userId, String sessionToken) {
boolean recentWrite = recentWriteCache.contains(userId);
if (recentWrite) {
return primaryRepo.findById(userId); // Strong consistency
}
return replicaRepo.findById(userId); // Eventual consistency
}

Monotonic Read Consistencyโ€‹

Always read from the same replica in a session.

// Session affinity: bind user to replica by userId hash
public DataSource selectReplica(Long userId) {
int replicaIndex = (int)(userId % replicas.size());
return replicas.get(replicaIndex);
}

Distributed Transactions Comparisonโ€‹

ApproachAvailabilityConsistencyComplexity
2PCLow (blocking)StrongMedium
3PCMediumStrongHigh
Saga (Orchestration)HighEventualMedium
Saga (Choreography)HighEventualHigh (debugging)
TCC (Try-Confirm-Cancel)HighStrong (conceptually)High

Idempotency Keys (Distributed)โ€‹

// Distributed idempotency with Redis
public <T> T executeIdempotent(String key, Supplier<T> operation, Duration ttl) {
String result = redis.opsForValue().get("idem:" + key);
if (result != null) {
return deserialize(result, operationType);
}

T value = operation.get();

// SET NX (only if not exists) prevents race condition
redis.opsForValue().setIfAbsent("idem:" + key, serialize(value), ttl);
return value;
}

Network Partitions & Split-Brainโ€‹

Data Center A โ†โ”€โ”€ร—โ”€โ”€โ†’ Data Center B
(network cut)

A: "I'm the leader"
B: "I'm the leader"
โ†’ Both accept writes โ†’ divergent state (split-brain)

Solutions:

  • Quorum: Only side with majority can elect leader
  • Fencing: External authority invalidates old leader's token
  • Pause-minority: Smaller partition stops accepting writes

Interview Questionsโ€‹

Q: What is the consensus problem? What algorithms solve it?โ€‹

A: Consensus means multiple nodes agree on one sequence of decisions despite failures. Paxos, Raft, and various BFT protocols solve different fault/trust models.

Q: Explain Raft leader election in plain English.โ€‹

A: When followers stop hearing heartbeats, they start an election and request votes for a new term. A node winning majority becomes leader and starts sending heartbeats.

Q: What is a vector clock and how does it detect causal ordering?โ€‹

A: A vector clock tracks per-node event counters in updates. Comparing vectors shows whether one event happened-before another or if they are concurrent conflicts.

Q: What is Two-Phase Commit? What are its failure modes?โ€‹

A: Two-Phase Commit (2PC) coordinates distributed resource updates by voting to prepare before executing a commit. Its primary failure modes are coordinator failure (leaving participants blocked holding locks) and network partitions. See the Two-Phase Commit Guide for details.

Q: How does a gossip protocol work? What is it used for?โ€‹

A: Nodes periodically exchange state with random peers, and updates spread probabilistically through the cluster. It is used for membership, health signals, and configuration dissemination.

Q: What is split-brain syndrome and how do you prevent it?โ€‹

A: Split-brain is when partitions each believe they are primary and accept conflicting writes. Prevent it with quorum rules, fencing, and single-writer leadership.

Q: How do you build a distributed system that is available during a network partition?โ€‹

A: Favor AP behavior for selected operations, allow local writes, and reconcile conflicts later. Classify which paths can be eventually consistent versus requiring strong consistency.

Q: What is the difference between a failure detector and a consensus algorithm?โ€‹

A: Failure detector guesses node liveness; consensus establishes agreed decisions despite uncertain liveness. One provides signals, the other provides safety/ordering guarantees.

Q: How does Zookeeper achieve distributed coordination?โ€‹

A: ZooKeeper uses a quorum-backed ordered log and ephemeral/sequential znodes for locks, leader election, and config. Session semantics and watches enable reliable coordination workflows.

Q: What are the fallacies of distributed computing and why do they matter?โ€‹

A: They are false assumptions like "network is reliable" and "latency is zero." Ignoring them leads to fragile designs that fail under normal production conditions.