Chapter 10: From Monitoring to Observability

Part II — Implementation

With dozens of services, traditional monitoring dashboards aren't enough. This chapter introduces the shift to observability — the ability to ask new questions about your system without deploying new instrumentation.

Monitoring vs. Observability

Monitoring = watching predefined metrics and alerting on thresholds.

"CPU usage > 80% → alert"
"Error rate > 1% → alert"

Observability = the ability to understand the internal state of your system from its external outputs — even for scenarios you didn't anticipate.

An observable system lets you ask any question about what's happening, not just the ones you predicted when you wrote your dashboards.

In a microservice architecture, monitoring breaks down because:

A user request may span 10 services
An error in service E may be caused by service B, discovered in service D
"Something is slow" requires tracing through multiple hops to find the root cause

Observability solves this.

The Three Pillars of Observability

1. Logs

Structured, timestamped records of discrete events. The classic debugging tool.

Best Practices:

Use structured logging (JSON) — machine-parseable, filterable
Include correlation IDs in every log line — trace a request across services
Log at appropriate levels: DEBUG (dev), INFO (normal ops), WARN (unexpected but handled), ERROR (requires investigation)

Spring Boot + Logback (structured JSON):

<!-- logback-spring.xml -->
<dependency>
    <groupId>net.logstash.logback</groupId>
    <artifactId>logstash-logback-encoder</artifactId>
</dependency>

@Slf4j
@Service
public class OrderService {
    public Order createOrder(CreateOrderRequest request) {
        log.info("Creating order for customer={} items={}", 
            request.getCustomerId(), request.getItems().size());
        // ...
        log.info("Order created orderId={} status={}", order.getId(), order.getStatus());
        return order;
    }
}

Output: {"timestamp":"2024-01-15T10:30:00Z","level":"INFO","message":"Order created","orderId":"ord-123","status":"CONFIRMED","traceId":"abc123"}

2. Metrics

Numerical measurements aggregated over time. CPU, memory, request rates, error rates, latency percentiles.

Spring Boot Actuator + Micrometer: Spring Boot's Micrometer library instruments everything automatically. Add your own metrics with:

@Service
public class OrderService {
    private final Counter orderCreatedCounter;
    private final Timer orderCreationTimer;

    public OrderService(MeterRegistry registry) {
        this.orderCreatedCounter = Counter.builder("orders.created")
            .description("Number of orders created")
            .tag("region", "eu-west")
            .register(registry);
        this.orderCreationTimer = Timer.builder("orders.creation.duration")
            .register(registry);
    }

    public Order createOrder(CreateOrderRequest request) {
        return orderCreationTimer.record(() -> {
            Order order = doCreateOrder(request);
            orderCreatedCounter.increment();
            return order;
        });
    }
}

Metrics are then scraped by Prometheus and visualized in Grafana.

3. Distributed Tracing

A trace tracks a single request as it flows through multiple services. Each service adds a span — a unit of work with start time, end time, and metadata.

Trace ID: abc-123
│
├── Span: API Gateway (0ms - 250ms)
│   ├── Span: Order Service (5ms - 200ms)
│   │   ├── Span: DB Query - findCustomer (10ms - 25ms)
│   │   ├── Span: Inventory Service call (30ms - 120ms)  ← Slow!
│   │   └── Span: DB Insert - saveOrder (130ms - 180ms)
│   └── Span: Notification Service (205ms - 245ms)

This immediately shows that the Inventory Service call (90ms) is the bottleneck.

Spring Boot + Micrometer Tracing (replaces Spring Cloud Sleuth):

# application.yml
management:
  tracing:
    sampling:
      probability: 1.0  # Sample 100% in dev; use 0.1 (10%) in prod

Traces are exported to Zipkin, Jaeger, or OpenTelemetry Collector.

// Trace context propagates automatically via HTTP headers (B3 or W3C TraceContext)
// Just call other services normally — Spring adds the trace headers
OrderDto order = orderClient.getOrder(orderId);  // trace header injected automatically

Correlation IDs

Every request entering your system gets a unique ID. Every service logs this ID. Every downstream call passes it along.

// Spring Cloud Sleuth / Micrometer Tracing does this automatically
// But you can also set it manually in a filter:
@Component
public class CorrelationIdFilter implements Filter {
    @Override
    public void doFilter(ServletRequest req, ServletResponse res, FilterChain chain)
        throws IOException, ServletException {
        HttpServletRequest request = (HttpServletRequest) req;
        String correlationId = Optional.ofNullable(request.getHeader("X-Correlation-Id"))
            .orElse(UUID.randomUUID().toString());
        
        MDC.put("correlationId", correlationId);
        ((HttpServletResponse) res).setHeader("X-Correlation-Id", correlationId);
        
        try {
            chain.doFilter(req, res);
        } finally {
            MDC.remove("correlationId");
        }
    }
}

Alerting: Semantic vs. Syntactic

Syntactic alerting = alert on raw metric values. "CPU > 80%" Semantic alerting = alert on business impact. "Order creation error rate > 0.1%"

Semantic alerts are more meaningful — they directly tell you about user impact. Syntactic alerts cause alert fatigue (CPU can be 80% for perfectly legitimate reasons).

Alert Design Principles

Alert on symptoms (high error rate) not causes (high CPU)
Every alert should require human action — if you can't do anything about it, don't alert
Target page-worthy events only (something waking you up at 3am)
Use runbooks linked from alerts — "This alert means X, investigate Y, fix with Z"

The ELK Stack / Grafana Stack

Common Observability Stacks

Grafana Stack (open-source favorite):

Prometheus — metrics collection and storage
Grafana — dashboards and visualization
Loki — log aggregation (logs-as-metrics approach)
Tempo — distributed tracing backend

ELK Stack:

Elasticsearch — log storage and search
Logstash — log ingestion and transformation
Kibana — log visualization and dashboards

Cloud-native alternatives:

AWS CloudWatch, X-Ray
Google Cloud Operations Suite
Datadog, New Relic, Dynatrace (commercial SaaS)

Key Metrics to Track per Microservice

The RED Method (for services):

Rate — requests per second
Errors — error rate (4xx, 5xx)
Duration — latency percentiles (p50, p95, p99)

The USE Method (for infrastructure):

Utilization — CPU, memory, disk
Saturation — queue depth, connection pool usage
Errors — hardware and OS errors

Summary

Pillar	Tool (Spring ecosystem)	Purpose
Logs	SLF4J + Logback JSON → Loki/ELK	Discrete events with context
Metrics	Micrometer → Prometheus → Grafana	Aggregated numbers over time
Traces	Micrometer Tracing → Zipkin/Jaeger	Request flow across services
Alerting	Prometheus AlertManager	Page on symptoms, not causes
Correlation ID	Spring Micrometer Tracing (auto)	Tie logs/traces across services

Monitoring vs. Observability​

The Three Pillars of Observability​

1. Logs​

2. Metrics​

3. Distributed Tracing​

Correlation IDs​

Alerting: Semantic vs. Syntactic​

Alert Design Principles​

The ELK Stack / Grafana Stack​

Common Observability Stacks​

Key Metrics to Track per Microservice​

Summary​