Skip to main content

Chapter 1: Reliable, Scalable, and Maintainable Applications

The Big Idea

Most modern applications are data-intensive, not compute-intensive. The bottleneck is rarely the CPU — it's the amount of data, how fast it changes, and how complex it is to manage. Systems are built from well-known building blocks:

  • Databases — store and retrieve data
  • Caches — remember expensive computations
  • Search indexes — query data by keyword or filter
  • Stream processors — handle async messaging between services
  • Batch processors — crunch large amounts of accumulated data

The challenge is knowing how to combine these tools reliably and efficiently. This chapter introduces the three properties that every data system must balance.


🔒 Reliability

A system is reliable if it continues to work correctly even when things go wrong.

"Things going wrong" are called faults. A system that handles faults gracefully is fault-tolerant (or resilient). Note the distinction:

  • A fault is a component deviating from spec
  • A failure is the whole system stopping to provide service

We want to tolerate faults before they cause failures.

Types of Faults

1. Hardware Faults

Hard disks crash, RAM fails, power cuts out. In a data center with thousands of machines, hardware faults are a daily occurrence.

Mitigation:

  • RAID disk configuration
  • Dual power supplies and hot-swap CPUs
  • Full machine redundancy (rolling upgrades)

Cloud platforms like AWS are designed to prioritize flexibility and elasticity over single-machine reliability, so multi-machine redundancy is the norm.

2. Software Errors

These are harder to predict and tend to cause correlated failures — many nodes crash at the same time.

Examples:

  • A bug that crashes every instance when a specific input arrives (e.g., Linux kernel bug triggered by June 30 leap second)
  • A runaway process consuming shared resources (CPU, memory, disk)
  • A cascading failure where service A depends on service B which is slow

Mitigation: Careful thinking about assumptions, thorough testing, process isolation, crash-and-restart, monitoring, and alerting.

3. Human Errors

Studies show operator mistakes are the leading cause of outages — not hardware.

Mitigation:

  • Design APIs and admin interfaces that make it hard to do the wrong thing
  • Decouple environments: sandbox for experimentation, production with real data
  • Allow quick recovery: easy rollback, gradual rollout (canary deployments)
  • Detailed monitoring: performance metrics and error rates (observability)

How Important Is Reliability?

Even non-critical applications must be reliable. If an e-commerce site goes down for 10 minutes, you lose sales and customer trust. "Move fast and break things" is a product philosophy — not a systems engineering one.


📈 Scalability

Scalability is a system's ability to cope with increased load.

Scalability is not a yes/no property. You ask: "If the system grows in X way, what are our options for handling that growth?"

Describing Load: Load Parameters

Load is described using load parameters — numbers that characterize how hard the system is working. Examples:

  • Requests per second to a web server
  • Ratio of reads to writes in a database
  • Number of active users in a chat room
  • Cache hit rate

Twitter example: The hard part of Twitter's architecture is the fan-out problem. When a user with 30 million followers posts a tweet, do you:

  1. Write once, read dynamically (query followers on read)? — cheap writes, expensive reads
  2. Pre-compute ("fan-out on write") to each follower's home timeline cache? — expensive writes, cheap reads

Twitter settled on approach 2 for most users, with approach 1 for high-follower accounts.

Describing Performance

Once you've described load, you can ask:

  • If load increases but resources stay the same, how does performance degrade?
  • How much do you need to increase resources to keep performance stable?

For batch systems, the key metric is throughput (records processed per second). For online systems, it's response time.

Percentiles Over Averages

The mean response time is misleading — use percentiles:

PercentileMeaning
p50 (median)Half of requests are faster than this
p9595% of requests are faster than this
p9999% of requests are faster than this
p99999.9% of requests are faster than this

High percentiles (p95, p99, p999) are called tail latencies. They matter because:

  • Your slowest requests often affect your most valuable customers (heavy users with the most data)
  • In a microservices call chain, a single slow backend makes the whole chain slow (head-of-line blocking)

SLOs and SLAs are commonly defined using percentiles: "response time p99 < 200ms."

Approaches for Coping with Load

Vertical Scaling (Scale Up)

Move to a more powerful machine. Simpler, but has limits and can be expensive.

Horizontal Scaling (Scale Out)

Distribute load across many smaller machines. More complex but cheaper and practically unlimited.

Elastic Systems

Automatically add/remove resources as load fluctuates. Good for unpredictable load spikes.

tip

There is no universal "magic scaling architecture." The right approach depends on your specific load parameters. An architecture designed for 100K req/s and 1 KB payloads looks completely different from one designed for 3 req/min and 2 GB payloads.


🔧 Maintainability

The majority of software cost is in ongoing maintenance — not initial development.

Three design principles for maintainable systems:

1. Operability — Making Life Easy for Operations

Operations teams keep the system running. Make their life easy by:

  • Providing visibility into system behavior (health checks, metrics, logs, tracing)
  • Automating repetitive tasks
  • Avoiding dependencies on individual machines (for rolling restarts)
  • Providing good documentation and an understandable operational model
  • Making default behavior sensible but allowing overrides

2. Simplicity — Managing Complexity

As systems grow, they become complex and hard to reason about. Symptoms of complexity:

  • Explosion of state space
  • Tight coupling between modules
  • Tangled dependencies
  • Inconsistent naming and terminology
  • Hacks to work around performance issues

Abstraction is the most powerful tool for managing complexity. A good abstraction hides implementation details behind a clean interface (e.g., SQL hides the storage engine from developers).

3. Evolvability — Making Change Easy

Requirements change constantly. Your system must adapt to:

  • New use cases
  • Regulatory requirements
  • Business strategy changes
  • Changing user expectations

Evolvability is closely tied to simplicity — simpler systems are easier to modify. Agile practices and TDD help at the code level; at the architecture level, the tools discussed throughout this book help.


Summary

PropertyQuestion to askKey techniques
ReliabilityDoes it work correctly when things go wrong?Fault tolerance, redundancy, testing, monitoring
ScalabilityCan it handle growing load?Load parameters, percentiles, horizontal scaling
MaintainabilityIs it easy to operate and evolve?Operability, simplicity, evolvability

These three properties are the lens through which the entire book evaluates every system and design decision. Keep them in mind as you read on.