Capacity Planning & Estimation
The goal is order-of-magnitude accuracy, not precision. Round aggressively.
Essential Numbers to Memorizeโ
Data Sizesโ
| Unit | Bytes |
|---|---|
| KB | 10^3 |
| MB | 10^6 |
| GB | 10^9 |
| TB | 10^12 |
| PB | 10^15 |
Time Conversionsโ
| Period | Seconds |
|---|---|
| 1 minute | 60 |
| 1 hour | 3,600 |
| 1 day | ~86,400 โ 10^5 |
| 1 month | ~2.6M โ 2.5 ร 10^6 |
| 1 year | ~31.5M โ 3 ร 10^7 |
Common Throughput Rules of Thumbโ
| Technology | Throughput |
|---|---|
| Single RDBMS (Postgres/MySQL) | ~1,000โ5,000 QPS |
| Single Redis instance | ~100,000 QPS |
| Single Kafka partition | ~10 MB/s write |
| Single HTTP server (Spring Boot) | ~5,000โ20,000 RPS |
| CDN edge node | ~100,000+ RPS |
Estimation Frameworkโ
Step 1 โ Clarify Scaleโ
- DAU (Daily Active Users)
- Read/write ratio
- Peak multiplier (usually 2โ5ร average)
Step 2 โ Estimate QPSโ
avg QPS = DAU ร requests_per_user / seconds_per_day
peak QPS = avg QPS ร peak_multiplier
Example โ Twitter-like feed:
DAU = 100M
Avg requests/user/day = 10 (reads) + 1 (write)
Avg read QPS = 100M ร 10 / 100,000 = 10,000 QPS
Avg write QPS = 100M ร 1 / 100,000 = 1,000 QPS
Peak read QPS = 10,000 ร 3 = 30,000 QPS
Step 3 โ Estimate Storageโ
storage = daily_writes ร record_size ร retention_days
Example โ Tweet storage:
Tweets/day = 1,000 QPS ร 86,400 = ~86M tweets/day
Record size = 280 bytes text + 200 bytes metadata โ 500 bytes
Storage/day = 86M ร 500B = ~43 GB/day
5-year storage = 43 GB ร 365 ร 5 โ 78 TB
Step 4 โ Estimate Bandwidthโ
bandwidth = QPS ร avg_payload_size
Example โ Image upload service:
Write QPS = 1,000
Avg image size = 2 MB
Write bandwidth = 1,000 ร 2 MB = 2 GB/s โ needs chunking + CDN!
Step 5 โ Estimate Memory (for caching)โ
cache_size = hot_data_fraction ร total_dataset_size
Example โ URL shortener (60M URLs, 20% hot):
Avg URL size = 100 bytes
Total = 60M ร 100 = 6 GB
Cache (20%) = 1.2 GB โ fits in one Redis node
Worked Examplesโ
URL Shortener (bit.ly)โ
- 100M DAU, 1% write, 99% read
- Write QPS:
100M ร 0.01 / 86400 โ 12 QPS - Read QPS:
100M ร 99 ร 0.01 / 86400 โ 1,150 QPS - Storage:
12 QPS ร 86400 ร 365 ร 5 years ร 100 bytes โ 1.8 TB - Cache: top 20% URLs = small, single Redis node sufficient
Video Streaming (YouTube-like)โ
- 1B DAU, 0.1% upload, 10 views/user/day, avg 5 min video
- Uploads/day:
1B ร 0.001 = 1M uploads - Upload bandwidth:
1M ร 300 MB / 86400 โ 3.5 TB/s(needs massive CDN) - Storage (with 3 encoding resolutions):
1M ร 300 MB ร 3 โ 900 TB/day
Chat App (WhatsApp-like)โ
- 500M DAU, 40 messages/user/day, avg 100 bytes/message
- Message QPS:
500M ร 40 / 86400 โ 230,000 msg/s - Storage:
500M ร 40 ร 100 = 2 TB/day - With 30-day retention:
60 TB
Capacity Planning Checklistโ
- Estimated peak QPS (read & write)
- Storage growth rate (daily, annually)
- Network bandwidth requirement
- Number of servers needed (
QPS / throughput_per_server) - Cache size requirement
- Database sizing (RAM for working set)
- Identify bottleneck component
Number of Servers Estimationโ
servers = peak_QPS / QPS_per_server
For Spring Boot service at ~10,000 RPS per instance:
30,000 peak QPS โ 3 instances + headroom โ 5 instances
Interview Questionsโ
Q: Estimate the QPS and storage for a Twitter-like service with 300M DAU.โ
A: Start with actions per user per day, convert to average and peak QPS (for example peak factor 5-10x), then apply read/write split. For storage, multiply daily write volume by retention and replication/compression factors.
Q: You need to store 1M images per day. How much storage do you need in 5 years?โ
A: Compute 1,000,000 \times 365 \times 5 \times average image size, then add replicas/erasure overhead and metadata/index space. Include growth buffer for derivative sizes (thumbnails/transcodes).
Q: A feature requires reading 1 KB per request at 50,000 RPS. What's the bandwidth? Can a single server handle it?โ
A: Raw egress is about 50,000 \times 1\ KB \approx 50\ MB/s before protocol overhead. A single server might handle it on modern NICs, but headroom, TLS, and tail latency usually require horizontal scaling.
Q: How would you estimate the number of servers needed for a new service?โ
A: Derive peak QPS and per-request CPU/memory/IO cost from benchmarks, then calculate instances at target utilization (for example 60%-70%). Add redundancy for failures, maintenance, and regional capacity.
Q: What's the working set size of a database and why does it matter for memory planning?โ
A: Working set is the hot subset accessed frequently enough to justify memory residency. If it fits in RAM, latency is stable; if not, random disk I/O and cache churn increase p99.
Q: How do you estimate cache hit rate and what affects it?โ
A: Use access distribution (often Zipf-like), TTL, and cache size relative to hot keys to model hit rate. Key cardinality growth, invalidation frequency, and burstiness strongly affect real outcomes.