Capacity Planning & Estimation

The goal is order-of-magnitude accuracy, not precision. Round aggressively.

Essential Numbers to Memorize

Data Sizes

Unit	Bytes
KB	10^3
MB	10^6
GB	10^9
TB	10^12
PB	10^15

Time Conversions

Period	Seconds
1 minute	60
1 hour	3,600
1 day	~86,400 ≈ 10^5
1 month	~2.6M ≈ 2.5 × 10^6
1 year	~31.5M ≈ 3 × 10^7

Common Throughput Rules of Thumb

Technology	Throughput
Single RDBMS (Postgres/MySQL)	~1,000–5,000 QPS
Single Redis instance	~100,000 QPS
Single Kafka partition	~10 MB/s write
Single HTTP server (Spring Boot)	~5,000–20,000 RPS
CDN edge node	~100,000+ RPS

Estimation Framework

Step 1 — Clarify Scale

DAU (Daily Active Users)
Read/write ratio
Peak multiplier (usually 2–5× average)

Step 2 — Estimate QPS

avg QPS = DAU × requests_per_user / seconds_per_day
peak QPS = avg QPS × peak_multiplier

Example — Twitter-like feed:

DAU = 100M
Avg requests/user/day = 10 (reads) + 1 (write)
Avg read QPS  = 100M × 10 / 100,000 = 10,000 QPS
Avg write QPS = 100M × 1  / 100,000 = 1,000 QPS
Peak read QPS = 10,000 × 3 = 30,000 QPS

Step 3 — Estimate Storage

storage = daily_writes × record_size × retention_days

Example — Tweet storage:

Tweets/day = 1,000 QPS × 86,400 = ~86M tweets/day
Record size = 280 bytes text + 200 bytes metadata ≈ 500 bytes
Storage/day = 86M × 500B = ~43 GB/day
5-year storage = 43 GB × 365 × 5 ≈ 78 TB

Step 4 — Estimate Bandwidth

bandwidth = QPS × avg_payload_size

Example — Image upload service:

Write QPS = 1,000
Avg image size = 2 MB
Write bandwidth = 1,000 × 2 MB = 2 GB/s  ← needs chunking + CDN!

Step 5 — Estimate Memory (for caching)

cache_size = hot_data_fraction × total_dataset_size

Example — URL shortener (60M URLs, 20% hot):

Avg URL size = 100 bytes
Total = 60M × 100 = 6 GB
Cache (20%) = 1.2 GB  ← fits in one Redis node

Worked Examples

URL Shortener (bit.ly)

100M DAU, 1% write, 99% read
Write QPS: 100M × 0.01 / 86400 ≈ 12 QPS
Read QPS: 100M × 99 × 0.01 / 86400 ≈ 1,150 QPS
Storage: 12 QPS × 86400 × 365 × 5 years × 100 bytes ≈ 1.8 TB
Cache: top 20% URLs = small, single Redis node sufficient

Video Streaming (YouTube-like)

1B DAU, 0.1% upload, 10 views/user/day, avg 5 min video
Uploads/day: 1B × 0.001 = 1M uploads
Upload bandwidth: 1M × 300 MB / 86400 ≈ 3.5 TB/s (needs massive CDN)
Storage (with 3 encoding resolutions): 1M × 300 MB × 3 ≈ 900 TB/day

Chat App (WhatsApp-like)

500M DAU, 40 messages/user/day, avg 100 bytes/message
Message QPS: 500M × 40 / 86400 ≈ 230,000 msg/s
Storage: 500M × 40 × 100 = 2 TB/day
With 30-day retention: 60 TB

Capacity Planning Checklist

Estimated peak QPS (read & write)
Storage growth rate (daily, annually)
Network bandwidth requirement
Number of servers needed (QPS / throughput_per_server)
Cache size requirement
Database sizing (RAM for working set)
Identify bottleneck component

Number of Servers Estimation

servers = peak_QPS / QPS_per_server

For Spring Boot service at ~10,000 RPS per instance:

30,000 peak QPS → 3 instances + headroom → 5 instances

Interview Questions

Q: Estimate the QPS and storage for a Twitter-like service with 300M DAU.

A: Start with actions per user per day, convert to average and peak QPS (for example peak factor $5$ - $10$ x), then apply read/write split. For storage, multiply daily write volume by retention and replication/compression factors.

Q: You need to store 1M images per day. How much storage do you need in 5 years?

A: Compute $1,000,000 \times 365 \times 5 \times$ average image size, then add replicas/erasure overhead and metadata/index space. Include growth buffer for derivative sizes (thumbnails/transcodes).

Q: A feature requires reading 1 KB per request at 50,000 RPS. What's the bandwidth? Can a single server handle it?

A: Raw egress is about $50,000 \times 1\ KB \approx 50\ MB/s$ before protocol overhead. A single server might handle it on modern NICs, but headroom, TLS, and tail latency usually require horizontal scaling.

Q: How would you estimate the number of servers needed for a new service?

A: Derive peak QPS and per-request CPU/memory/IO cost from benchmarks, then calculate instances at target utilization (for example 60%-70%). Add redundancy for failures, maintenance, and regional capacity.

Q: What's the working set size of a database and why does it matter for memory planning?

A: Working set is the hot subset accessed frequently enough to justify memory residency. If it fits in RAM, latency is stable; if not, random disk I/O and cache churn increase p99.

Q: How do you estimate cache hit rate and what affects it?

A: Use access distribution (often Zipf-like), TTL, and cache size relative to hot keys to model hit rate. Key cardinality growth, invalidation frequency, and burstiness strongly affect real outcomes.

Essential Numbers to Memorize​

Data Sizes​

Time Conversions​

Common Throughput Rules of Thumb​

Estimation Framework​

Step 1 — Clarify Scale​

Step 2 — Estimate QPS​

Step 3 — Estimate Storage​

Step 4 — Estimate Bandwidth​

Step 5 — Estimate Memory (for caching)​

Worked Examples​

URL Shortener (bit.ly)​

Video Streaming (YouTube-like)​

Chat App (WhatsApp-like)​

Capacity Planning Checklist​

Number of Servers Estimation​

Interview Questions​

Q: Estimate the QPS and storage for a Twitter-like service with 300M DAU.​

Q: You need to store 1M images per day. How much storage do you need in 5 years?​

Q: A feature requires reading 1 KB per request at 50,000 RPS. What's the bandwidth? Can a single server handle it?​

Q: How would you estimate the number of servers needed for a new service?​

Q: What's the working set size of a database and why does it matter for memory planning?​

Q: How do you estimate cache hit rate and what affects it?​