Observability & Monitoring
"Observability is not about what you know to look for. It's about being able to ask questions you haven't thought of yet."
The Three Pillarsโ
| Pillar | What | Tool Examples |
|---|---|---|
| Metrics | Numeric measurements over time | Prometheus, Micrometer, Datadog |
| Logs | Timestamped text records of events | ELK Stack, Loki, CloudWatch |
| Traces | Request journey across services | Jaeger, Zipkin, AWS X-Ray |
Metricsโ
The Four Golden Signals (Google SRE)โ
| Signal | Description | Example Metric |
|---|---|---|
| Latency | Time to serve a request | http_request_duration_seconds |
| Traffic | Request rate | http_requests_total |
| Errors | Error rate | http_errors_total / http_requests_total |
| Saturation | Resource utilization | jvm_memory_used_bytes / jvm_memory_max_bytes |
RED Method (Services)โ
- Rate: Requests per second
- Error rate: % of failed requests
- Duration: Latency distribution (p50, p95, p99)
USE Method (Resources)โ
- Utilization: % of time resource is busy
- Saturation: Queue length waiting for resource
- Errors: Count of error events
Spring Boot Observability Setupโ
<!-- pom.xml -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
# application.yml
management:
endpoints:
web:
exposure:
include: health, info, metrics, prometheus
endpoint:
health:
show-details: always
metrics:
tags:
application: ${spring.application.name} # Tag all metrics with app name
tracing:
sampling:
probability: 0.1 # 10% sampling in prod
Custom Metricsโ
@Service
public class OrderService {
private final Counter orderCounter;
private final Timer orderTimer;
private final Gauge pendingOrdersGauge;
public OrderService(MeterRegistry registry, OrderRepository repo) {
this.orderCounter = Counter.builder("orders.created")
.description("Total orders created")
.tag("region", "us-east")
.register(registry);
this.orderTimer = Timer.builder("orders.processing.duration")
.description("Order processing time")
.publishPercentiles(0.5, 0.95, 0.99)
.register(registry);
// Gauge โ reports current value when scraped
Gauge.builder("orders.pending", repo, r -> r.countByStatus("PENDING"))
.description("Pending orders in queue")
.register(registry);
}
public Order createOrder(CreateOrderRequest req) {
return orderTimer.record(() -> {
Order order = processOrder(req);
orderCounter.increment();
return order;
});
}
}
Logging Best Practicesโ
Structured Logging (JSON)โ
// Use SLF4J with Logback โ JSON output for ELK/Loki
@Slf4j
@Service
public class PaymentService {
public void processPayment(Payment payment) {
log.info("Processing payment",
kv("paymentId", payment.getId()),
kv("amount", payment.getAmount()),
kv("userId", payment.getUserId()),
kv("status", payment.getStatus()));
try {
// ... processing ...
log.info("Payment processed successfully",
kv("paymentId", payment.getId()),
kv("durationMs", timer.elapsed()));
} catch (Exception e) {
log.error("Payment processing failed",
kv("paymentId", payment.getId()),
kv("error", e.getMessage()),
e);
}
}
}
# logback-spring.xml โ JSON output
<configuration>
<appender name="JSON" class="ch.qos.logback.core.ConsoleAppender">
<encoder class="net.logstash.logback.encoder.LogstashEncoder"/>
</appender>
<root level="INFO">
<appender-ref ref="JSON"/>
</root>
</configuration>
Log Levelsโ
| Level | Use For |
|---|---|
TRACE | Very fine-grained, loop iterations |
DEBUG | Debugging information, method entry/exit |
INFO | Business events, startup, key state changes |
WARN | Unexpected but handled situations |
ERROR | Failures requiring attention |
Correlation IDsโ
// Add trace/correlation ID to all logs via MDC
@Component
public class CorrelationFilter extends OncePerRequestFilter {
@Override
protected void doFilterInternal(HttpServletRequest req, ...) {
String correlationId = req.getHeader("X-Correlation-ID");
if (correlationId == null) correlationId = UUID.randomUUID().toString();
MDC.put("correlationId", correlationId);
response.setHeader("X-Correlation-ID", correlationId);
try {
chain.doFilter(req, response);
} finally {
MDC.clear();
}
}
}
// All subsequent log statements automatically include correlationId
Distributed Tracingโ
// Spring Boot 3 + Micrometer Tracing (auto-configures)
// Trace context automatically propagated via HTTP headers (W3C TraceContext)
@Service
public class OrderService {
@Autowired private Tracer tracer;
public Order processOrder(CreateOrderCommand cmd) {
Span span = tracer.nextSpan().name("process-order").start();
try (Tracer.SpanInScope ws = tracer.withSpan(span)) {
span.tag("orderId", cmd.getOrderId().toString());
span.tag("userId", cmd.getUserId().toString());
inventoryClient.reserve(cmd); // Trace context propagated automatically
paymentClient.charge(cmd);
return orderRepository.save(new Order(cmd));
} catch (Exception e) {
span.error(e);
throw e;
} finally {
span.end();
}
}
}
Trace Visualizationโ
Request: POST /orders (traceId: abc123)
โโ OrderService.processOrder (12ms)
โโ InventoryService.reserve (3ms) โ HTTP GET inventory-service/items/reserve
โโ PaymentService.charge (7ms) โ HTTP POST payment-service/charges
โโ DB: INSERT orders (2ms)
SLO / SLA / SLIโ
| Term | Definition | Example |
|---|---|---|
| SLI (Indicator) | What you measure | 99th percentile latency = 120ms |
| SLO (Objective) | Target for SLI | p99 latency < 200ms, 99.9% of the time |
| SLA (Agreement) | Legal contract | If SLO violated โ customer credit |
| Error Budget | Allowed downtime | 99.9% SLO = 8.76 hours/year downtime allowed |
Error Budget Policyโ
Monthly error budget: 99.9% = 43.8 minutes downtime
If budget consumed < 50%: Deploy freely, take risks
If budget consumed 50-75%: Review before deploying
If budget consumed > 75%: Freeze non-critical deploys
If budget exhausted: Incident response only
Alertingโ
Alert Anatomyโ
# Prometheus alerting rule
groups:
- name: api-alerts
rules:
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m]) > 0.01
for: 5m # Must be true for 5 min before firing
labels:
severity: critical
annotations:
summary: "Error rate > 1% on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: SlowP99
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 0.5
for: 10m
labels:
severity: warning
Alert Fatigue Preventionโ
- Only alert on user impact (not noise)
- Use symptom-based alerts over cause-based
- Every alert should be actionable
- Avoid flapping alerts (use
forclause) - Escalation policy: warning โ critical โ page on-call
Health Checksโ
// Custom health indicator
@Component
public class DatabaseHealthIndicator extends AbstractHealthIndicator {
@Autowired private DataSource dataSource;
@Override
protected void doHealthCheck(Health.Builder builder) {
try (Connection conn = dataSource.getConnection()) {
conn.isValid(2); // 2s timeout
builder.up()
.withDetail("database", "PostgreSQL")
.withDetail("connectionPool", getPoolStats());
} catch (Exception e) {
builder.down().withException(e);
}
}
}
# Kubernetes liveness/readiness probes
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
Runbook Templateโ
Every alert should have a runbook:
## Alert: HighErrorRate
**Severity**: Critical
**SLO Impact**: Error budget burning at 5x rate
### Diagnosis Steps
1. Check `http_errors_total` by endpoint โ which endpoint is failing?
2. Check recent deployments โ was there a recent deploy?
3. Check DB connection pool โ is the pool exhausted?
4. Check downstream services โ is a dependency down?
### Remediation
- If bad deploy: rollback with `kubectl rollout undo deployment/api`
- If DB issue: check `pg_stat_activity` for long-running queries
- If dependency down: enable circuit breaker fallback
### Escalation
- After 15 min unresolved: page backend team lead
Interview Questionsโ
Q: What are the three pillars of observability? How do they differ?โ
A: Metrics show aggregate trends, logs capture discrete events, and traces connect request paths across services. Together they answer what is failing, where, and why.
Q: What are the four golden signals? Why those four?โ
A: Latency, traffic, errors, and saturation. They cover user experience, load, correctness, and capacity pressure, which are the main failure dimensions.
Q: What is the difference between SLI, SLO, and SLA?โ
A: SLI is a measured reliability metric, SLO is the internal target for that metric, and SLA is the external contractual commitment. Breaching SLA has business/legal consequences.
Q: What is an error budget and how should it affect engineering decisions?โ
A: Error budget is allowable unreliability under the SLO. When budget burns fast, prioritize reliability work and slow risky feature releases.
Q: How do you implement distributed tracing in a Spring Boot microservices system?โ
A: Use OpenTelemetry instrumentation, propagate traceparent across sync and async calls, and export spans to a tracing backend. Add key span attributes (endpoint, tenant, DB call) for diagnosis.
Q: What's the difference between liveness and readiness probes in Kubernetes?โ
A: Liveness decides when to restart a stuck container; readiness decides whether it should receive traffic. A pod can be alive but temporarily not ready.
Q: How do you prevent alert fatigue?โ
A: Alert on actionable symptoms tied to SLO impact, deduplicate noisy signals, and tune thresholds with burn-rate policies. Route ownership clearly and retire stale alerts.
Q: What should you log? What should you not log?โ
A: Log state transitions, failures, and contextual identifiers needed for triage. Do not log secrets, raw PII, or high-cardinality noise with little diagnostic value.
Q: What is structured logging and why is it better than plain text logs?โ
A: Structured logs store fields as key-value JSON for reliable querying and correlation. They improve aggregation, filtering, and automated analysis compared to free text parsing.
Q: How would you debug a latency issue that only affects the p99 of requests?โ
A: Slice by endpoint, tenant, region, and dependency to isolate outliers, then inspect traces for long-tail hops. Typical causes are lock contention, queue buildup, GC pauses, and retries.
OpenTelemetry: The Unified Observability Standardโ
Building Microservices Chapter 10 covers the evolution from per-service instrumentation to standardized, vendor-neutral telemetry pipelines โ which is exactly what OpenTelemetry addresses.
OpenTelemetry (OTel) is the CNCF project that unifies the three pillars under a single vendor-neutral SDK and protocol (OTLP). It replaces the fragmented landscape of Zipkin clients, Prometheus clients, and Logback appenders with a single instrumentation API.
OTel Architectureโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Application Code โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ OTel SDK (Traces + Metrics + Logs) โ โ
โ โ - Auto-instrumentation (HTTP, DB, gRPC, Kafka) โ โ
โ โ - Manual spans via Tracer API โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ OTLP (gRPC or HTTP)
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโ
โ OTel Collector โ
โ (Receiver โ Processor โ
โ โ Exporter) โ
โโโโโโโโฌโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโผโโโโโโโโโโโ
โผ โผ โผ
Jaeger Prometheus Loki
(Traces) (Metrics) (Logs)
OTel Collector Configurationโ
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
memory_limiter:
limit_mib: 400
# Tail-based sampling processor
tail_sampling:
decision_wait: 10s
policies:
- name: error-policy
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-policy
type: latency
latency: { threshold_ms: 500 }
- name: probabilistic-fallback
type: probabilistic
probabilistic: { sampling_percentage: 5 }
exporters:
jaeger:
endpoint: jaeger:14250
prometheus:
endpoint: "0.0.0.0:8889"
loki:
endpoint: http://loki:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, tail_sampling, batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [batch]
exporters: [loki]
Auto-Instrumentation vs. Manual Instrumentationโ
| Approach | Coverage | Control | Overhead |
|---|---|---|---|
| Auto (Java Agent) | HTTP, JDBC, Kafka, Redis, gRPC automatically | Low | ~3-5% CPU overhead |
| Manual Spans | Business-level operations (checkout, payment) | Full | Requires code changes |
| Hybrid (recommended) | Auto for infrastructure, manual for business events | Optimal | Balanced |
// Hybrid instrumentation: auto-instruments HTTP, then add business span
@Service
public class CheckoutService {
@Autowired private Tracer tracer;
public CheckoutResult checkout(Cart cart) {
// Auto-instrumented: HTTP calls to inventory, payment services
// Manual: wrap the business transaction for end-to-end visibility
Span checkoutSpan = tracer.spanBuilder("checkout.process")
.setAttribute("cart.total", cart.getTotal())
.setAttribute("cart.items", cart.getItemCount())
.setAttribute("user.tier", cart.getUserTier())
.startSpan();
try (Scope scope = checkoutSpan.makeCurrent()) {
InventoryResult inv = inventoryClient.reserve(cart); // auto-traced
PaymentResult pay = paymentClient.charge(cart); // auto-traced
return orderRepository.save(new Order(cart, inv, pay)); // auto-traced
} catch (Exception e) {
checkoutSpan.setStatus(StatusCode.ERROR, e.getMessage());
checkoutSpan.recordException(e);
throw e;
} finally {
checkoutSpan.end();
}
}
}
Tracing Sampling Strategies: Deep Diveโ
Building Microservices distinguishes between capturing everything (cost-prohibitive at scale) and sampling intelligently to retain diagnostic signal for failures and outliers.
Head-Based vs. Tail-Based Samplingโ
| Strategy | Decision Point | Pros | Cons | Use Case |
|---|---|---|---|---|
| Head-Based | At trace start, before any data collected | Low overhead, simple | Blindly samples; misses rare errors | High-volume, low-criticality APIs |
| Tail-Based | After the full trace is collected | Captures all errors + slow traces | Requires buffering entire traces | Payment, checkout, critical paths |
| Adaptive/Dynamic | Rate-adjusted based on current traffic | Balances cost + signal | More complex to configure | General production workloads |
| Priority-Based | Caller sets sampling.priority header | Useful for QA/debugging | Can be abused to inflate storage | Debug sessions, specific user traces |
Tail-Based Sampling Decision Logicโ
Trace completes (all spans collected in collector)
โ
โโ Any span has ERROR status? โ ALWAYS KEEP
โ
โโ Root span duration > 500ms? โ ALWAYS KEEP
โ
โโ Trace includes payment/auth service? โ KEEP 20%
โ
โโ Otherwise โ KEEP 1%
Head-Based Sampling Configurationโ
// Spring Boot + OTel: probabilistic head-based
management:
tracing:
sampling:
probability: 0.05 # 5% of all traces
# OR: rate-based (max N traces/sec regardless of traffic)
# Uses token bucket internally
Tail-Based Sampling: Collector-Sideโ
# Requires OTel Collector with tail_sampling processor
# Spans from the same trace are buffered in memory for `decision_wait` seconds
processors:
tail_sampling:
decision_wait: 30s # Wait for full trace before deciding
num_traces: 50000 # Buffer up to 50k concurrent traces
expected_new_traces_per_sec: 10
policies:
- name: keep-errors
type: status_code
status_code:
status_codes: [ERROR]
- name: keep-slow
type: latency
latency:
threshold_ms: 1000
- name: keep-specific-routes
type: string_attribute
string_attribute:
key: http.route
values: ["/api/checkout", "/api/payment"]
- name: base-rate
type: probabilistic
probabilistic:
sampling_percentage: 2 # Keep 2% of normal traces
Propagation Formats Comparisonโ
| Format | Header | Use Case |
|---|---|---|
| W3C TraceContext | traceparent, tracestate | Modern standard, cross-vendor |
| B3 Multi-Header | X-B3-TraceId, X-B3-SpanId, X-B3-Sampled | Zipkin legacy |
| B3 Single-Header | b3 | Compact B3 |
| Jaeger | uber-trace-id | Jaeger native |
# OTel SDK: configure propagators
OTEL_PROPAGATORS=tracecontext,baggage
# For Zipkin-compatible systems:
OTEL_PROPAGATORS=b3multi,tracecontext
Log Aggregation Pipeline: Architecture Comparisonโ
Building Microservices Chapter 10 emphasizes that in a microservices system with dozens of services, centralized log aggregation is not optional โ it is a prerequisite for effective diagnosis.
ELK Stack vs. PLG Stackโ
| Dimension | ELK (Elasticsearch + Logstash + Kibana) | PLG (Promtail + Loki + Grafana) |
|---|---|---|
| Storage model | Full-text index of all log fields | Index only metadata labels; compress raw log text |
| Cost | High (indexes every field) | Low (10x cheaper per GB) |
| Query speed | Fast full-text search across all fields | Fast label queries; slower full-text (grep-like) |
| Schema | Schema-on-write (mapping required) | Schema-on-read (no upfront schema) |
| Alerting | Kibana alerts / Elastalert | Grafana AlertManager (integrated with Prometheus) |
| Best for | Rich full-text log analysis, security audit | Kubernetes-native, cost-sensitive high-volume logs |
Loki Label Design (Critical for Performance)โ
# โ
Good: Low-cardinality labels
{app="order-service", env="prod", region="us-east-1"}
# โ Bad: High-cardinality labels (kills Loki performance)
{request_id="abc123", user_id="user-456789"}
# Never put request/user IDs as Loki labels โ put them in the log line itself
Log Pipeline: Kubernetes โ Lokiโ
# Promtail DaemonSet: scrapes all pod logs from /var/log/pods/
# Automatically adds Kubernetes metadata as labels
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: promtail
spec:
template:
spec:
containers:
- name: promtail
image: grafana/promtail:latest
args:
- -config.file=/etc/promtail/config.yml
volumeMounts:
- name: varlog
mountPath: /var/log
readOnly: true
# promtail config.yml
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
pipeline_stages:
- docker: {} # Parse Docker JSON log format
- json: # Extract fields from structured log
expressions:
level: level
traceId: traceId
service: service
- labels: # Promote to Loki labels (low-cardinality only)
level:
service:
- timestamp:
source: timestamp
format: RFC3339
Correlation Between Pillarsโ
User reports slow checkout
โ
โโ [Grafana] Metric dashboard: p99 checkout latency spike at 14:23
โ
โโ [Loki] Log query: {service="checkout"} |= "ERROR"
โ โ Found: "PaymentService timeout after 30s"
โ
โโ [Jaeger] Trace query: traceId=abc123
โ Found: PaymentService span = 30.1s
โโ DB query span = 29.8s
โโ Missing index on payment_methods.user_id
SLO Burn Rate Alerting: Senior Deep Diveโ
Multi-Window Burn Rateโ
Simple threshold alerts (e.g., "error rate > 1%") suffer from alert lag โ they fire late and miss short spikes. Multi-window burn rate alerting from the Google SRE Workbook addresses both problems.
Burn rate = how fast the error budget is being consumed relative to normal.
Error Budget = 1 - SLO target
For 99.9% SLO: error budget = 0.1% = 1440 minutes/day ร 0.001 = 1.44 min/day of errors
Burn Rate of 1 = consuming budget at exactly the SLO rate (will exhaust in 30 days)
Burn Rate of 5 = will exhaust budget in 6 days
Burn Rate of 14.4 = will exhaust budget in 2 days (CRITICAL)
Burn Rate of 36 = will exhaust budget in 1 hour (PAGE NOW)
Multi-Window Alert Matrixโ
| Severity | Short Window | Long Window | Burn Rate | Response |
|---|---|---|---|---|
| Page (Critical) | 2% budget in 1h | 5% budget in 5h | โฅ 14.4x | Immediate page |
| Ticket (High) | 5% budget in 6h | 10% budget in 3d | โฅ 6x | Next business hour |
| Warning | 10% budget in 3d | โ | โฅ 3x | Track and monitor |
# Prometheus: Multi-window burn rate alert for 99.9% SLO
groups:
- name: slo-checkout
rules:
# Fast burn: detect high burn over short window
- alert: CheckoutSLOFastBurn
expr: |
(
rate(checkout_errors_total[1h]) / rate(checkout_requests_total[1h])
> 14.4 * 0.001 # 14.4x burn rate ร 0.1% error budget
) and (
rate(checkout_errors_total[5h]) / rate(checkout_requests_total[5h])
> 14.4 * 0.001
)
for: 2m
labels:
severity: page
slo: checkout_availability
annotations:
summary: "Checkout SLO burning at >14.4x rate โ page on-call"
runbook_url: https://wiki/runbooks/checkout-slo
# Slow burn: detect consistent moderate burn
- alert: CheckoutSLOSlowBurn
expr: |
(
rate(checkout_errors_total[6h]) / rate(checkout_requests_total[6h])
> 6 * 0.001
) and (
rate(checkout_errors_total[3d]) / rate(checkout_requests_total[3d])
> 6 * 0.001
)
for: 1h
labels:
severity: ticket
Error Budget Tracking Dashboardโ
# Grafana panel: Error Budget Remaining
expr: |
1 - (
sum(rate(http_requests_total{status=~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))
) / 0.001 # Normalize by error budget (0.1%)
# Interpretation:
# 1.0 = full budget remaining (no errors this month)
# 0.5 = half budget consumed
# 0.0 = budget exhausted โ SLA breach territory
# < 0 = already breached SLA
Metric Cardinality: The Silent Killerโ
High cardinality in metrics is the most common cause of Prometheus OOM crashes and Datadog bill explosions.
What Is Cardinality?โ
Every unique label combination creates a separate time series.
metric_name{label1="val1", label2="val2"} โ 1 time series
| Label | Cardinality | Impact |
|---|---|---|
env (prod/staging/dev) | 3 values | Acceptable |
http_method (GET/POST/PUT/DELETE) | 4 values | Acceptable |
http_status_code (200/404/500...) | ~15 values | Acceptable |
endpoint (/users/{id}/orders) | Thousands of paths | DANGEROUS |
user_id | Millions | CATASTROPHIC |
trace_id | Unique per request | CATASTROPHIC |
Cardinality Trap: Dynamic Labelsโ
// โ BAD: userId creates millions of time series
Counter.builder("api.calls")
.tag("userId", userId) // NEVER do this
.tag("orderId", orderId.toString()) // NEVER do this
.register(registry);
// โ
GOOD: low-cardinality labels only
Counter.builder("api.calls")
.tag("service", "order-service")
.tag("endpoint", "/api/orders") // normalized, not parameterized
.tag("status", "success")
.register(registry);
// Put high-cardinality data in traces/logs, not metrics
Prometheus Cardinality Limitsโ
# prometheus.yml โ protect against cardinality explosions
global:
scrape_interval: 15s
# Limit per scrape target
scrape_configs:
- job_name: order-service
metric_relabel_configs:
# Drop high-cardinality labels before ingestion
- source_labels: [user_id]
regex: '.+'
action: drop
# Global limit
storage:
tsdb:
max-block-chunk-seg-size: 536870912 # 512 MB
Distributed Tracing: Context Propagation Across Async Boundariesโ
HTTP Propagation (Automatic)โ
Spring Boot 3 + Micrometer Tracing auto-propagates traceparent via HTTP headers. Zero configuration needed.
// Outbound HTTP: traceparent header added automatically
@FeignClient(name = "payment-service")
public interface PaymentClient {
@GetMapping("/charge")
PaymentResult charge(ChargeRequest req);
// OTel SDK intercepts, adds: traceparent: 00-traceId-spanId-01
}
Kafka Propagation (Manual)โ
Kafka does not use HTTP headers, so trace context must be manually injected into Kafka record headers.
// Producer: inject trace context into Kafka headers
@Service
public class OrderEventPublisher {
@Autowired private KafkaTemplate<String, OrderEvent> kafkaTemplate;
@Autowired private Tracer tracer;
public void publish(OrderEvent event) {
ProducerRecord<String, OrderEvent> record =
new ProducerRecord<>("order-events", event.getOrderId(), event);
// Inject current span context into Kafka headers
tracer.currentSpan().propagateHeaders(
(key, value) -> record.headers().add(key, value.getBytes())
);
kafkaTemplate.send(record);
}
}
// Consumer: extract trace context from Kafka headers
@KafkaListener(topics = "order-events")
public void onOrderEvent(ConsumerRecord<String, OrderEvent> record) {
// Restore trace context from Kafka headers
Context extractedContext = openTelemetry.getPropagators()
.getTextMapPropagator()
.extract(Context.current(), record.headers(),
(headers, key) -> {
Header h = headers.lastHeader(key);
return h != null ? new String(h.value()) : null;
});
Span span = tracer.spanBuilder("kafka.order-events.consume")
.setParent(extractedContext)
.startSpan();
// ... process event
}
Thread Pool / CompletableFuture Propagationโ
// โ Context is LOST when crossing thread boundaries
CompletableFuture.supplyAsync(() -> paymentClient.charge(req)); // Trace context lost!
// โ
Wrap executor with OTel context propagation
ExecutorService contextAwareExecutor =
Context.taskWrapping(Executors.newFixedThreadPool(10));
CompletableFuture.supplyAsync(
() -> paymentClient.charge(req),
contextAwareExecutor // Context propagated automatically
);
Advanced Alerting Strategiesโ
Symptom-Based vs. Cause-Based Alertsโ
| Type | Example | Problem |
|---|---|---|
| Cause-based | "CPU > 80%" | Too many false positives; CPU spikes without user impact |
| Cause-based | "JVM GC pause > 500ms" | Not always user-visible |
| Symptom-based | "p99 checkout latency > 2s" | Direct user impact |
| Symptom-based | "Checkout error rate > 1%" | Direct user impact |
Rule: Alert on symptoms. Investigate causes. Symptoms = SLO breach indicators.
Alerting Anti-Patternsโ
| Anti-Pattern | Problem | Solution |
|---|---|---|
| Flapping alerts | Fire/resolve every 30s | Use for clause (must be true for N minutes) |
| Alert without runbook | On-call doesn't know what to do | Every alert must link to a runbook |
| Too many alerts | Alert fatigue โ ignored | Ruthlessly prune non-actionable alerts |
| Alert on infrastructure not user impact | CPU alert fires, users unaffected | SLO-driven alerting |
| Shared inbox | No clear owner | Route to service team, not shared channel |
| Wide alert windows | 15-minute error rates miss 2-minute outages | Multi-window: 1m + 5m + 15m |
Alert Routing Strategyโ
# Alertmanager routing: route by service ownership
route:
group_by: ['service', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 3h
receiver: default
routes:
- match:
service: checkout
receiver: checkout-team
routes:
- match:
severity: page
receiver: checkout-oncall # PagerDuty/OpsGenie
- match:
service: payment
receiver: payments-team
receivers:
- name: checkout-team
slack_configs:
- api_url: https://hooks.slack.com/services/...
channel: '#checkout-alerts'
- name: checkout-oncall
pagerduty_configs:
- service_key: <integration-key>
description: "{{ .CommonAnnotations.summary }}"
Profiling: The Fourth Pillarโ
Profiling captures CPU flame graphs and memory allocation data โ complementary to the three pillars.
Continuous Profiling Toolsโ
| Tool | Type | Language Support | Integration |
|---|---|---|---|
| Pyroscope | CPU, memory, goroutines | Java, Go, Python, Node | OTel-native, Grafana |
| Async Profiler | CPU, allocation, lock | JVM | JVM agent |
| Parca | CPU | Go, any (DWARF) | Kubernetes-native |
| eBPF-based (Pixie) | System-level, no instrumentation | Any language | Kubernetes |
Flame Graph Interpretationโ
main (1000ms total)
โโโ checkout (600ms, 60%)
โ โโโ inventoryClient.reserve (400ms, 40%) โ HOT PATH
โ โ โโโ HTTP + JSON deserialization (390ms) โ Bottleneck
โ โโโ orderRepository.save (200ms, 20%)
โโโ paymentClient.charge (400ms, 40%)
โโโ network wait (380ms) โ Slow external call
A wide flat bar = hot path (CPU-bound).
A deep thin chain = latency bound (waiting, I/O, network).
Observability Anti-Patternsโ
| Anti-Pattern | Why It Fails | Better Approach |
|---|---|---|
| Log-only observability | Unstructured logs are hard to correlate at scale | Add metrics + traces |
| Metrics without context | "Error rate 5%" โ which service, endpoint, tenant? | Always add service/endpoint labels |
| 100% trace sampling | Costs explode at 10k RPS | Head-based 5-10% + tail-based for errors |
| No correlation ID | Cannot link log โ trace โ metric for same request | Propagate traceId in all three pillars |
| Alerting on averages | Average hides long tails | Always alert on p95/p99 |
| Short retention for traces | Can't diagnose bugs reported days later | 14-day trace retention minimum |
| No canary observability | Deploy new version, no idea if degraded | Always compare canary vs stable metrics |
| Alert without owner | Alert fires, nobody acts | Every alert must have team label and runbook |
Observability Maturity Modelโ
Building Microservices describes organizations evolving from reactive fire-fighting to proactive SLO-driven reliability engineering.
| Level | Capabilities | Typical Tooling |
|---|---|---|
| Level 1 โ Basic | Container logs, basic health checks | kubectl logs, basic Prometheus |
| Level 2 โ Instrumented | Structured logs, RED metrics per service, distributed traces | ELK/PLG, Prometheus+Grafana, Jaeger |
| Level 3 โ SLO-Driven | Defined SLOs per service, error budgets, burn-rate alerts | Sloth/SLO generators, multi-window alerts |
| Level 4 โ Predictive | Anomaly detection, capacity forecasting, auto-remediation | ML-based alerting, Prometheus + ML |
Spring Boot Observability: Full Production Setupโ
// Complete observability configuration for a Spring Boot 3 microservice
@Configuration
public class ObservabilityConfig {
// Customize which endpoints get traced
@Bean
public ObservationPredicate excludeActuatorObservations() {
return (name, context) -> !name.startsWith("spring.security.http.secured");
}
// Add custom span attributes to all HTTP server observations
@Bean
public ObservationRegistryCustomizer<ObservationRegistry> httpServerObservations() {
return registry -> registry.observationConfig()
.observationHandler(new CustomHttpServerObservationHandler());
}
}
// Custom observation handler: add tenant/correlation to every span
public class CustomHttpServerObservationHandler
implements ObservationHandler<ServerHttpObservationContext> {
@Override
public void onStart(ServerHttpObservationContext context) {
HttpServletRequest request = context.getCarrier();
String tenantId = request.getHeader("X-Tenant-ID");
if (tenantId != null) {
context.setHighCardinalityKeyValue(
KeyValue.of("tenant.id", tenantId)
);
}
}
@Override
public boolean supportsContext(Observation.Context ctx) {
return ctx instanceof ServerHttpObservationContext;
}
}
# application.yml โ full observability setup
management:
endpoints:
web:
exposure:
include: health, info, metrics, prometheus, loggers
endpoint:
health:
show-details: when_authorized
probes:
enabled: true # /actuator/health/liveness and /readiness
metrics:
distribution:
percentiles-histogram:
http.server.requests: true # Enable histogram for p99 calculations
percentiles:
http.server.requests: 0.5, 0.95, 0.99
slo: # SLO boundaries in histogram
http.server.requests: 100ms, 500ms, 1s, 2s
tags:
application: ${spring.application.name}
version: ${APP_VERSION:unknown}
environment: ${APP_ENV:local}
tracing:
sampling:
probability: 0.1 # 10% head-based
baggage:
correlation:
fields: [tenantId, userId] # Propagate in MDC for log correlation
otlp:
tracing:
endpoint: http://otel-collector:4318/v1/traces
metrics:
endpoint: http://otel-collector:4318/v1/metrics
logging:
pattern:
level: "%5p [${spring.application.name},%X{traceId:-},%X{spanId:-}]"
Interview Questions: Senior Levelโ
Q: Explain multi-window burn rate alerting. Why is it better than simple threshold alerts?โ
A: Simple threshold alerts fire based on instantaneous metric values and suffer from either high false positives (tight thresholds) or slow detection (loose thresholds). Multi-window burn rate alerting measures how fast the error budget is being consumed across two time windows simultaneously: a short window detects sudden spikes, and a long window confirms sustained problems. Requiring both windows to exceed the burn rate threshold eliminates noisy alerts from brief transient spikes while ensuring real sustained degradation always pages. For a 99.9% SLO, a burn rate of 14.4x means the full monthly budget will be consumed in 50 hours โ worth an immediate page.
Q: What is label cardinality in Prometheus and why does it matter?โ
A: Each unique combination of label values creates a separate time series in Prometheus. High-cardinality labels like user_id, request_id, or trace_id create millions of time series, causing Prometheus to consume massive amounts of RAM and eventually OOM-crash. Best practice is to restrict metric labels to low-cardinality dimensions (service, env, endpoint pattern, status bucket) and push high-cardinality data to logs and traces instead.
Q: How do you propagate trace context across Kafka messages?โ
A: Kafka messages use record headers rather than HTTP headers. The producer must inject the current span context into Kafka headers using the OTel text map propagator's inject API. The consumer then extracts the trace context from record headers before starting its span, setting the extracted context as the parent. This creates a continuous trace across the asynchronous boundary, linking the producer span to the consumer span in Jaeger/Tempo.
Q: What is tail-based sampling and when should you use it over head-based?โ
A: Head-based sampling makes the keep/drop decision at the start of the trace before any data is collected. It is simple and low-overhead but will sample out rare errors statistically. Tail-based sampling buffers all spans until the entire trace completes, then applies rules: always keep error traces, always keep slow traces, and probabilistically keep the rest. It guarantees every error trace is captured. Use tail-based for payment, checkout, and financial flows where missing a single error trace has business cost. Use head-based for high-volume low-criticality paths like health checks.
Q: How do you correlate logs, metrics, and traces for a single user request?โ
A: The key is a shared trace ID propagated through all three pillars. The OTel SDK puts traceId and spanId into the MDC (mapped diagnostic context), which structured logging frameworks include in every log line. The same traceId tags metric exemplars in Prometheus. In Grafana, you can click a metric spike, jump to the trace via exemplar, and from the trace link to the correlated logs by querying Loki with the trace ID. This requires: structured logging with traceId field, Prometheus exemplars enabled, and Grafana Tempo/Jaeger configured as the trace data source.
Q: How would you design observability for a multi-tenant SaaS system?โ
A: Add tenant ID as a low-cardinality label to metrics (if tenants number in hundreds, not millions). Propagate tenant ID in the W3C tracestate baggage field so every span and log line carries it. Define separate SLOs per enterprise tier: free tier at 99%, business tier at 99.9%, enterprise at 99.99%. Use tenant-aware dashboards that show per-tenant error rates and latency. Alert on SLO burn rate per tenant-tier combination, not just aggregate. For security, ensure tenant data in logs is masked/hashed.