Skip to main content

Architecture Fundamentals

CAP Theorem

You can only guarantee 2 of 3:

PropertyDescription
ConsistencyEvery read receives the most recent write or an error
AvailabilityEvery request receives a (non-error) response, without guarantee it's the most recent
Partition ToleranceThe system continues operating despite network partitions

In practice, network partitions always happen, so the real choice is CP vs AP.

System TypeExamplesTrade-off
CPHBase, Zookeeper, etcdReturns error or timeout during partition
APCassandra, DynamoDB, CouchDBReturns stale data during partition
CA (not realistic in distributed)Traditional RDBMSAssumes no partition

Consistency Models (Weakest → Strongest)

Eventual  →  Monotonic Read  →  Read-Your-Writes  →  Causal  →  Sequential  →  Linearizable  →  Strict
ModelGuaranteeExample
EventualWill converge eventuallyDNS, shopping carts
Read-Your-WritesYou always see your own writesUser profile update
CausalCausally related operations seen in orderComments/replies
LinearizableAppears as if on a single machineBank balance

Availability Numbers

AvailabilityDowntime/yearDowntime/month
99% ("two nines")3.65 days7.2 hours
99.9% ("three nines")8.76 hours43.8 minutes
99.99% ("four nines")52.6 minutes4.4 minutes
99.999% ("five nines")5.26 minutes26 seconds

Calculating SLA in series (weakest link):

Total = SLA_A × SLA_B  →  0.999 × 0.999 = 0.998 (99.8%)

Availability in parallel (redundancy):

Total = 1 − (1 − SLA)^N  →  1 − (0.001)^2 = 99.9999%

Latency Reference Numbers

OperationLatency
L1 cache reference~1 ns
Main memory reference~100 ns
SSD random read~100 µs
HDD seek~10 ms
Network RTT (same datacenter)~0.5 ms
Network RTT (cross-continent)~150 ms

Key Architectural Trade-offs

Latency vs Throughput

  • Latency: Time for one request to complete
  • Throughput: Requests processed per second
  • Caching improves both; queuing improves throughput at the cost of latency

Stateful vs Stateless Services

StatelessStateful
ScalingEasy (add instances)Hard (session affinity needed)
Failure recoveryEasyRequires state replication
ExamplesREST APIs, workersWebSocket servers, databases

Synchronous vs Asynchronous

SyncAsync
SimplicityHigherLower
CouplingTightLoose
Failure isolationLowerHigher
Use caseUser-facing readsBackground processing

Data Partitioning Strategies

Horizontal Partitioning (Sharding)

  • Range-based: user_id 1-1M → shard1, 1M-2M → shard2 — risk of hot spots
  • Hash-based: shard = hash(key) % N — even distribution, hard to rebalance
  • Directory-based: Lookup service maps key to shard — flexible, single point of failure

Vertical Partitioning

  • Split table columns: keep hot columns separate from cold columns
  • Example: user_core(id, name, email) + user_profile(id, bio, avatar, ...)

Replication Strategies

StrategyProsCons
Single-leaderSimple, strong consistencyWrite bottleneck, failover complexity
Multi-leaderGeographic writes, higher write throughputConflict resolution needed
Leaderless (Dynamo-style)High availability, no single pointEventual consistency, quorum complexity

Quorum (N replicas, W writes, R reads)

  • Strong consistency: W + R > N
  • Common: N=3, W=2, R=2

Common Failure Modes

  • Cascading failures: One service overwhelms another during recovery
  • Split-brain: Network partition causes two leaders
  • Thundering herd: Cache miss causes N simultaneous DB hits
  • Head-of-line blocking: One slow request blocks the queue

Interview Questions

  1. What is the CAP theorem and what trade-off do most modern databases make?
  2. Explain the difference between linearizability and eventual consistency. Give a use case for each.
  3. A service has 99.9% uptime. You depend on 3 such services in series. What's your effective uptime?
  4. When would you choose a CP system over an AP system?
  5. What is the difference between horizontal and vertical scaling?
  6. How does replication lag affect your system, and how do you mitigate it?