Skip to main content

Chapter 10: From Monitoring to Observability

Part II — Implementation

With dozens of services, traditional monitoring dashboards aren't enough. This chapter introduces the shift to observability — the ability to ask new questions about your system without deploying new instrumentation.


Monitoring vs. Observability

Monitoring = watching predefined metrics and alerting on thresholds.

  • "CPU usage > 80% → alert"
  • "Error rate > 1% → alert"

Observability = the ability to understand the internal state of your system from its external outputs — even for scenarios you didn't anticipate.

An observable system lets you ask any question about what's happening, not just the ones you predicted when you wrote your dashboards.

In a microservice architecture, monitoring breaks down because:

  • A user request may span 10 services
  • An error in service E may be caused by service B, discovered in service D
  • "Something is slow" requires tracing through multiple hops to find the root cause

Observability solves this.


The Three Pillars of Observability

1. Logs

Structured, timestamped records of discrete events. The classic debugging tool.

Best Practices:

  • Use structured logging (JSON) — machine-parseable, filterable
  • Include correlation IDs in every log line — trace a request across services
  • Log at appropriate levels: DEBUG (dev), INFO (normal ops), WARN (unexpected but handled), ERROR (requires investigation)

Spring Boot + Logback (structured JSON):

<!-- logback-spring.xml -->
<dependency>
<groupId>net.logstash.logback</groupId>
<artifactId>logstash-logback-encoder</artifactId>
</dependency>
@Slf4j
@Service
public class OrderService {
public Order createOrder(CreateOrderRequest request) {
log.info("Creating order for customer={} items={}",
request.getCustomerId(), request.getItems().size());
// ...
log.info("Order created orderId={} status={}", order.getId(), order.getStatus());
return order;
}
}

Output: {"timestamp":"2024-01-15T10:30:00Z","level":"INFO","message":"Order created","orderId":"ord-123","status":"CONFIRMED","traceId":"abc123"}

2. Metrics

Numerical measurements aggregated over time. CPU, memory, request rates, error rates, latency percentiles.

Spring Boot Actuator + Micrometer: Spring Boot's Micrometer library instruments everything automatically. Add your own metrics with:

@Service
public class OrderService {
private final Counter orderCreatedCounter;
private final Timer orderCreationTimer;

public OrderService(MeterRegistry registry) {
this.orderCreatedCounter = Counter.builder("orders.created")
.description("Number of orders created")
.tag("region", "eu-west")
.register(registry);
this.orderCreationTimer = Timer.builder("orders.creation.duration")
.register(registry);
}

public Order createOrder(CreateOrderRequest request) {
return orderCreationTimer.record(() -> {
Order order = doCreateOrder(request);
orderCreatedCounter.increment();
return order;
});
}
}

Metrics are then scraped by Prometheus and visualized in Grafana.

3. Distributed Tracing

A trace tracks a single request as it flows through multiple services. Each service adds a span — a unit of work with start time, end time, and metadata.

Trace ID: abc-123

├── Span: API Gateway (0ms - 250ms)
│ ├── Span: Order Service (5ms - 200ms)
│ │ ├── Span: DB Query - findCustomer (10ms - 25ms)
│ │ ├── Span: Inventory Service call (30ms - 120ms) ← Slow!
│ │ └── Span: DB Insert - saveOrder (130ms - 180ms)
│ └── Span: Notification Service (205ms - 245ms)

This immediately shows that the Inventory Service call (90ms) is the bottleneck.

Spring Boot + Micrometer Tracing (replaces Spring Cloud Sleuth):

# application.yml
management:
tracing:
sampling:
probability: 1.0 # Sample 100% in dev; use 0.1 (10%) in prod

Traces are exported to Zipkin, Jaeger, or OpenTelemetry Collector.

// Trace context propagates automatically via HTTP headers (B3 or W3C TraceContext)
// Just call other services normally — Spring adds the trace headers
OrderDto order = orderClient.getOrder(orderId); // trace header injected automatically

Correlation IDs

Every request entering your system gets a unique ID. Every service logs this ID. Every downstream call passes it along.

// Spring Cloud Sleuth / Micrometer Tracing does this automatically
// But you can also set it manually in a filter:
@Component
public class CorrelationIdFilter implements Filter {
@Override
public void doFilter(ServletRequest req, ServletResponse res, FilterChain chain)
throws IOException, ServletException {
HttpServletRequest request = (HttpServletRequest) req;
String correlationId = Optional.ofNullable(request.getHeader("X-Correlation-Id"))
.orElse(UUID.randomUUID().toString());

MDC.put("correlationId", correlationId);
((HttpServletResponse) res).setHeader("X-Correlation-Id", correlationId);

try {
chain.doFilter(req, res);
} finally {
MDC.remove("correlationId");
}
}
}

Alerting: Semantic vs. Syntactic

Syntactic alerting = alert on raw metric values. "CPU > 80%" Semantic alerting = alert on business impact. "Order creation error rate > 0.1%"

Semantic alerts are more meaningful — they directly tell you about user impact. Syntactic alerts cause alert fatigue (CPU can be 80% for perfectly legitimate reasons).

Alert Design Principles

  • Alert on symptoms (high error rate) not causes (high CPU)
  • Every alert should require human action — if you can't do anything about it, don't alert
  • Target page-worthy events only (something waking you up at 3am)
  • Use runbooks linked from alerts — "This alert means X, investigate Y, fix with Z"

The ELK Stack / Grafana Stack

Common Observability Stacks

Grafana Stack (open-source favorite):

  • Prometheus — metrics collection and storage
  • Grafana — dashboards and visualization
  • Loki — log aggregation (logs-as-metrics approach)
  • Tempo — distributed tracing backend

ELK Stack:

  • Elasticsearch — log storage and search
  • Logstash — log ingestion and transformation
  • Kibana — log visualization and dashboards

Cloud-native alternatives:

  • AWS CloudWatch, X-Ray
  • Google Cloud Operations Suite
  • Datadog, New Relic, Dynatrace (commercial SaaS)

Key Metrics to Track per Microservice

The RED Method (for services):

  • Rate — requests per second
  • Errors — error rate (4xx, 5xx)
  • Duration — latency percentiles (p50, p95, p99)

The USE Method (for infrastructure):

  • Utilization — CPU, memory, disk
  • Saturation — queue depth, connection pool usage
  • Errors — hardware and OS errors

Summary

PillarTool (Spring ecosystem)Purpose
LogsSLF4J + Logback JSON → Loki/ELKDiscrete events with context
MetricsMicrometer → Prometheus → GrafanaAggregated numbers over time
TracesMicrometer Tracing → Zipkin/JaegerRequest flow across services
AlertingPrometheus AlertManagerPage on symptoms, not causes
Correlation IDSpring Micrometer Tracing (auto)Tie logs/traces across services