Skip to main content

Capacity Planning & Estimation

The goal is order-of-magnitude accuracy, not precision. Round aggressively.


Essential Numbers to Memorizeโ€‹

Data Sizesโ€‹

UnitBytes
KB10^3
MB10^6
GB10^9
TB10^12
PB10^15

Time Conversionsโ€‹

PeriodSeconds
1 minute60
1 hour3,600
1 day~86,400 โ‰ˆ 10^5
1 month~2.6M โ‰ˆ 2.5 ร— 10^6
1 year~31.5M โ‰ˆ 3 ร— 10^7

Common Throughput Rules of Thumbโ€‹

TechnologyThroughput
Single RDBMS (Postgres/MySQL)~1,000โ€“5,000 QPS
Single Redis instance~100,000 QPS
Single Kafka partition~10 MB/s write
Single HTTP server (Spring Boot)~5,000โ€“20,000 RPS
CDN edge node~100,000+ RPS

Estimation Frameworkโ€‹

Step 1 โ€” Clarify Scaleโ€‹

  • DAU (Daily Active Users)
  • Read/write ratio
  • Peak multiplier (usually 2โ€“5ร— average)

Step 2 โ€” Estimate QPSโ€‹

avg QPS = DAU ร— requests_per_user / seconds_per_day
peak QPS = avg QPS ร— peak_multiplier

Example โ€” Twitter-like feed:

DAU = 100M
Avg requests/user/day = 10 (reads) + 1 (write)
Avg read QPS = 100M ร— 10 / 100,000 = 10,000 QPS
Avg write QPS = 100M ร— 1 / 100,000 = 1,000 QPS
Peak read QPS = 10,000 ร— 3 = 30,000 QPS

Step 3 โ€” Estimate Storageโ€‹

storage = daily_writes ร— record_size ร— retention_days

Example โ€” Tweet storage:

Tweets/day = 1,000 QPS ร— 86,400 = ~86M tweets/day
Record size = 280 bytes text + 200 bytes metadata โ‰ˆ 500 bytes
Storage/day = 86M ร— 500B = ~43 GB/day
5-year storage = 43 GB ร— 365 ร— 5 โ‰ˆ 78 TB

Step 4 โ€” Estimate Bandwidthโ€‹

bandwidth = QPS ร— avg_payload_size

Example โ€” Image upload service:

Write QPS = 1,000
Avg image size = 2 MB
Write bandwidth = 1,000 ร— 2 MB = 2 GB/s โ† needs chunking + CDN!

Step 5 โ€” Estimate Memory (for caching)โ€‹

cache_size = hot_data_fraction ร— total_dataset_size

Example โ€” URL shortener (60M URLs, 20% hot):

Avg URL size = 100 bytes
Total = 60M ร— 100 = 6 GB
Cache (20%) = 1.2 GB โ† fits in one Redis node

Worked Examplesโ€‹

URL Shortener (bit.ly)โ€‹

  • 100M DAU, 1% write, 99% read
  • Write QPS: 100M ร— 0.01 / 86400 โ‰ˆ 12 QPS
  • Read QPS: 100M ร— 99 ร— 0.01 / 86400 โ‰ˆ 1,150 QPS
  • Storage: 12 QPS ร— 86400 ร— 365 ร— 5 years ร— 100 bytes โ‰ˆ 1.8 TB
  • Cache: top 20% URLs = small, single Redis node sufficient

Video Streaming (YouTube-like)โ€‹

  • 1B DAU, 0.1% upload, 10 views/user/day, avg 5 min video
  • Uploads/day: 1B ร— 0.001 = 1M uploads
  • Upload bandwidth: 1M ร— 300 MB / 86400 โ‰ˆ 3.5 TB/s (needs massive CDN)
  • Storage (with 3 encoding resolutions): 1M ร— 300 MB ร— 3 โ‰ˆ 900 TB/day

Chat App (WhatsApp-like)โ€‹

  • 500M DAU, 40 messages/user/day, avg 100 bytes/message
  • Message QPS: 500M ร— 40 / 86400 โ‰ˆ 230,000 msg/s
  • Storage: 500M ร— 40 ร— 100 = 2 TB/day
  • With 30-day retention: 60 TB

Capacity Planning Checklistโ€‹

  • Estimated peak QPS (read & write)
  • Storage growth rate (daily, annually)
  • Network bandwidth requirement
  • Number of servers needed (QPS / throughput_per_server)
  • Cache size requirement
  • Database sizing (RAM for working set)
  • Identify bottleneck component

Number of Servers Estimationโ€‹

servers = peak_QPS / QPS_per_server

For Spring Boot service at ~10,000 RPS per instance:

30,000 peak QPS โ†’ 3 instances + headroom โ†’ 5 instances

Interview Questionsโ€‹

Q: Estimate the QPS and storage for a Twitter-like service with 300M DAU.โ€‹

A: Start with actions per user per day, convert to average and peak QPS (for example peak factor 5-10x), then apply read/write split. For storage, multiply daily write volume by retention and replication/compression factors.

Q: You need to store 1M images per day. How much storage do you need in 5 years?โ€‹

A: Compute 1,000,000 \times 365 \times 5 \times average image size, then add replicas/erasure overhead and metadata/index space. Include growth buffer for derivative sizes (thumbnails/transcodes).

Q: A feature requires reading 1 KB per request at 50,000 RPS. What's the bandwidth? Can a single server handle it?โ€‹

A: Raw egress is about 50,000 \times 1\ KB \approx 50\ MB/s before protocol overhead. A single server might handle it on modern NICs, but headroom, TLS, and tail latency usually require horizontal scaling.

Q: How would you estimate the number of servers needed for a new service?โ€‹

A: Derive peak QPS and per-request CPU/memory/IO cost from benchmarks, then calculate instances at target utilization (for example 60%-70%). Add redundancy for failures, maintenance, and regional capacity.

Q: What's the working set size of a database and why does it matter for memory planning?โ€‹

A: Working set is the hot subset accessed frequently enough to justify memory residency. If it fits in RAM, latency is stable; if not, random disk I/O and cache churn increase p99.

Q: How do you estimate cache hit rate and what affects it?โ€‹

A: Use access distribution (often Zipf-like), TTL, and cache size relative to hot keys to model hit rate. Key cardinality growth, invalidation frequency, and burstiness strongly affect real outcomes.