The Observability Stack
In a monolithic architecture, a single error log provides a complete stack trace of a failure. In a microservices architecture running on Kubernetes, an HTTP request might traverse an Ingress, an API Gateway, an Auth microservice, a Billing microservice, and a PostgreSQL database.
If this request fails or is slow, answering "Why?" requires robust Observability. Observability is defined by three pillars: Metrics, Logs, Traces.
1. Metrics (Prometheus & Grafana)β
Metrics are numerical representations of data measured over intervals of time. They are lightweight, highly compressible, and perfect for triggering alerts or viewing long-term trends.
- Examples:
http_requests_total,cpu_usage_seconds,database_connection_pool_active. - Standard Stack: Prometheus (storage and scraping) + Grafana (visualization).
Prometheus Architectureβ
Prometheus uses a pull-based model. Instead of applications sending metrics to a database, Prometheus reaches out to application endpoints (e.g. /metrics) and scrapes the data periodically.
βββββββββββββββββ
β Prometheus β
β (Time-Series) β
ββββββββ²βββββββββ
β HTTP GET /metrics every 15s
ββββββββββββββββββββΌβββββββββββββββββββ
β β β
βββββββββΌβββββββ βββββββββΌβββββββ βββββββββΌβββββββ
β Spring Boot β β Node Exporterβ β Kube-State β
β /actuator/ β β (CPU/RAM) β β Metrics β
β prometheus β β β β(Pod Status) β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
Key Advantages:
- Applications don't need to know where Prometheus isβthey just expose an HTTP port. If Prometheus goes down, the application isn't impacted.
- The network isn't flooded with push requests; scaling Prometheus scraping is highly predictable.
Metric Typesβ
- Counter: A value that only goes up (e.g., total HTTP requests). Reset on restart.
- Gauge: A value that goes up and down (e.g., active memory, current thread count).
- Histogram: Groups observations into buckets to calculate percentiles (e.g., p95 response time).
2. Logs (EFK / PLG Stack)β
Logs are immutable, timestamped records of discrete events. Crucial for pinpointing the exact error line that caused an exception.
K8s containers output logs to stdout and stderr. The kubelet stores these locally on the node in /var/log/containers/*.log. Since pods are ephemeral, logs must be shipped off the node to centralized storage immediately.
The Stackβ
Collection (The shippers):
- Fluentd / Fluent Bit: The industry standard. Deployed as a
DaemonSetperfectly ensuring exactly one replica runs on every node. It mounts the node's/var/logdirectory, enriches logs with Kubernetes metadata (namespace, pod name, labels), and forwards them. - Promtail: Similar to Fluent Bit, heavily optimized for Loki.
Storage & Visualization:
- EFK: Elasticsearch (Storage/Search) + Fluentd (Collector) + Kibana (UI). Powerful, but heavy.
- PLG: Promtail (Collector) + Loki (Storage) + Grafana (UI). Loki inherently scales better and is much cheaper because it only indexes labels, dropping the log payload straight into S3/GCS.
3. Distributed Tracing (OpenTelemetry)β
While logs tell you what broke, and metrics tell you how often it breaks, Distributed Traces tell you where in the microservice chain the slowdown occurred.
A Trace represents the end-to-end journey of a single request. A Trace is made up of multiple Spans (individual units of work, like a DB query or an HTTP call downstream).
OpenTelemetry (OTel)β
OpenTelemetry is the CNCF standard for generating and collecting telemetry data. It has replaced vendor-specific instrumentation agents.
- Instrumentation: Java apps include the OTel Java Agent (
-javaagent:opentelemetry-javaagent.jar). It intercepts HTTP clients and JDBC drivers automatically, generating Spans without developer code changes. - Context Propagation: The agent injects a
traceparentHTTP Header into outbound requests.# Header automatically added to downstream REST callstraceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01βββββββ TRACE ID βββββββββββββ βββ SPAN ID βββ - The Receiver: The downstream service's OTel agent reads that header and attaches its own Spans to the same Trace ID.
- The Collector: Both apps export their spans via OTLP (OpenTelemetry Protocol) over gRPC to an OTel Collector running in the cluster.
- The Backend: The Collector batches, filters, and exports the traces to a backend like Jaeger, Zipkin, Tempo, or Datadog for visualization.
The Golden Signalsβ
When building observability dashboards for a cluster, always measure the Four Golden Signals (coined by Google SRE):
| Signal | Description | Example Metric |
|---|---|---|
| Latency | The time it takes to service a request. | HTTP 95th percentile response time. |
| Traffic | The demand placed on your system. | HTTP requests per second (RPS). |
| Errors | Rate of requests that fail. | 5xx HTTP response codes relative to total. |
| Saturation | How "full" your service is. | CPU limits, connection pool utilization, disk I/O. |
Interview Questionsβ
Q: How do metrics, logs, and traces complement each other during incidents?β
A: Metrics detect and scope, traces localize path-level latency, and logs provide root-cause evidence.
Q: What is a common anti-pattern in observability programs?β
A: Collecting high-volume telemetry without clear SLO-aligned questions and alert strategy.
Q: How do you reduce alert fatigue while preserving reliability?β
A: Alert on symptom-based SLO burn and critical dependency failures, not every infrastructure fluctuation.
Q: Why is trace context propagation mandatory in microservices?β
A: Without it, cross-service causality is lost and latency debugging becomes guesswork.
Q: What sampling strategy would you use in production tracing?β
A: Low baseline sampling with dynamic upsampling on errors and high-latency paths.
Q: Which golden signals should be dashboarded per service by default?β
A: Request rate, error rate, latency percentiles, and saturation indicators tied to service capacity.