Skip to main content

Distributed Tracing

In a monolithic application, tracing a request is straightforward since all calls execute on a single execution stack. In a microservices environment, a single user request can propagate through dozens of services across network boundaries. Distributed Tracing provides visibility into the complete journey of a request as it crosses process boundaries.


How It Works: Spans and Traces

Distributed tracing coordinates two key data concepts:

Trace (Global Request Journey - Trace ID: abc123xyz)
โ”œโ”€ Span A: Gateway (Client Request Received) [Duration: 100ms]
โ”‚ โ”œโ”€ Span B: Order Service (Create Order DB) [Duration: 40ms]
โ”‚ โ””โ”€ Span C: Payment Service (Charge Call) [Duration: 50ms]
โ”‚ โ””โ”€ Span D: Stripe API (External Network Hop) [Duration: 30ms]
  • Trace: The complete end-to-end journey of a request. It is represented by a unique Trace ID (traceId) generated by the first service that intercepts the request (typically the API Gateway).
  • Span: A single logical unit of work (e.g., an HTTP request, a database query, or a message publish). Each span has a Span ID, a parent Span ID, a timestamp, and a duration.
  • Trace Context Propagation: To pass the traceId and active spanId across network borders, HTTP and Kafka calls inject metadata headers (most commonly using the W3C TraceContext standard: traceparent header).

Setup & Implementation (Spring Boot 3 + Micrometer Tracing)

In Spring Boot 3, Micrometer Tracing (which integrates with OpenTelemetry) handles context propagation automatically.

1. Add Dependencies

<!-- pom.xml -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-exporter-otlp</artifactId>
</dependency>

2. Configure Propagation and Exporting

# application.yml
management:
tracing:
sampling:
probability: 1.0 # Sample 100% of requests for debugging (use ~0.05-0.1 in high-traffic production)
otlp:
tracing:
endpoint: http://jaeger-collector.monitoring:4318/v1/traces # Send spans to Jaeger

3. Java Code Example: Manual Span Instrumentation

While Spring Boot automatically traces incoming and outgoing HTTP REST calls, you can define custom spans for critical business logic:

@Service
@Slf4j
public class OrderProcessingService {

private final Tracer tracer;

public OrderProcessingService(Tracer tracer) {
this.tracer = tracer;
}

public void processComplexOrder(Order order) {
// Create and start a custom child span
Span customSpan = tracer.nextSpan().name("complex-validation").start();

try (Tracer.SpanInScope ws = tracer.withSpan(customSpan)) {
// Tag the span with useful context
customSpan.tag("order.id", String.valueOf(order.getId()));
customSpan.tag("customer.tier", order.getCustomerTier());

// Run business logic
validateOrderConstraints(order);

log.info("Validating order rules within custom span context");
} catch (Exception e) {
customSpan.error(e);
throw e;
} finally {
customSpan.end(); // Make sure to close the span
}
}
}

Log Correlation

To tie trace metrics to text output, configure your logging layout to print the active Trace ID and Span ID on every line:

<!-- logback-spring.xml -->
<pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level [traceId=%X{traceId}, spanId=%X{spanId}] %logger{36} - %msg%n</pattern>

This generates output matching the pattern:

2026-07-03 10:15:30.123 [http-nio-8080-exec-1] INFO [traceId=d22f03f7e53a2ba1, spanId=b9a8cf5c66e2c342] c.e.o.OrderService - Saving order database entry

Pros vs. Cons

ProsCons
Rapid Diagnostics: pinpoints precisely which service in a call graph is throwing errors or adding latency.Storage Overhead: Tracing generates massive volumes of data; storing every trace is expensive.
Dependency Graphing: Automatically builds service-to-service dependency topologies for architectural auditing.Context Propagation Fragility: If any service fails to forward the traceparent headers, the trace is broken.
Performance Bottleneck Detection: Highlights slow database queries or blocking client HTTP calls within a request flow.Code Intrusion: Instrumenting legacy systems or custom protocols requires complex manual setup.

Common Gotchas & Anti-Patterns

  1. Broken Trace Chains: Forgetting to pass trace headers when spawning asynchronous threads manually in Java (e.g., using Runnable or custom ExecutorService). The child threads will execute under a brand new traceId or without context.
    • Solution: Use Micrometer's ContextExecutorService to wrap thread pools.
  2. Sampling Errors: Sampling 100% of traffic in a high-scale production system. This generates massive network traffic and storage bills. Use adaptive sampling or head/tail-based sampling instead.
  3. Mismatched Propagation Headers: API Gateway sends W3C traceparent headers (00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01), but downstream services are configured to look for Zipkin B3 headers (X-B3-TraceId). This causes traces to split. Standardize on W3C.