Service Mesh & Microservices Networking
Microservices Communication Challenges
Distributing a monolith into microservices introduces networking problems:
Service A → Service B → Service C → Service D
↑
Any hop can fail:
- Service down
- Network timeout
- Slow responses (cascading failure)
- Need auth between services
- Need observability (traces, metrics)
- Load balancing between instances
- Service discovery (where is Service B?)
Service Discovery
How does Service A find Service B's IP address?
Client-Side Discovery
Service A → [Service Registry] → gets list of B's IPs → A load balances
Service Registry: Consul, Eureka, etcd, Zookeeper
Service B registers itself on start, deregisters on shutdown
Service A queries registry → gets healthy instances → round-robin
Pros: no extra hop, client controls LB algorithm
Cons: discovery logic in every service (every language)
// Spring Cloud Eureka
@SpringBootApplication
@EnableEurekaClient
public class OrderServiceApp { ... }
// application.yml
eureka:
client:
service-url:
defaultZone: http://eureka:8761/eureka
instance:
prefer-ip-address: true
health-check-url-path: /actuator/health
// Feign client — auto-discovers via Eureka
@FeignClient("inventory-service")
public interface InventoryClient {
@GetMapping("/api/inventory/{productId}")
InventoryDto checkInventory(@PathVariable Long productId);
}
Server-Side Discovery (Kubernetes)
Service A → [kube-proxy/DNS] → routes to healthy pod
Kubernetes Service:
ClusterIP: stable virtual IP for a set of pods
DNS: inventory-service.default.svc.cluster.local → ClusterIP
kube-proxy: iptables/IPVS rules route ClusterIP → actual pod IPs
Service A doesn't need a registry — DNS + kube-proxy handle it
# Kubernetes Service (server-side discovery)
apiVersion: v1
kind: Service
metadata:
name: inventory-service
spec:
selector:
app: inventory
ports:
- port: 8080
targetPort: 8080
type: ClusterIP # internal cluster IP
# Service A accesses: http://inventory-service:8080/api/inventory/42
Circuit Breaker Pattern
Prevents cascading failures when a downstream service is slow or down.
States:
CLOSED → requests flow through normally
OPEN → requests immediately fail fast (no call to downstream)
HALF_OPEN → test requests allowed; if successful → CLOSED; if fail → OPEN
failures > threshold
CLOSED ─────────────────────────────────────────► OPEN
▲ │
│ test succeeds │ timeout expires
│◄────────────────────────────── HALF_OPEN ◄──────┘
test fails → OPEN again
// Resilience4j Circuit Breaker with Spring Boot
@Bean
public CircuitBreakerConfig circuitBreakerConfig() {
return CircuitBreakerConfig.custom()
.failureRateThreshold(50) // open when 50% of calls fail
.waitDurationInOpenState(Duration.ofSeconds(30)) // stay open 30s
.slidingWindowSize(10) // evaluate last 10 calls
.permittedNumberOfCallsInHalfOpenState(3)
.slowCallRateThreshold(50) // also trigger on slow calls
.slowCallDurationThreshold(Duration.ofSeconds(2))
.build();
}
@Service
public class InventoryService {
@CircuitBreaker(name = "inventory", fallbackMethod = "inventoryFallback")
@Retry(name = "inventory")
@TimeLimiter(name = "inventory")
public CompletableFuture<InventoryDto> checkInventory(Long productId) {
return CompletableFuture.supplyAsync(() ->
inventoryClient.check(productId));
}
public CompletableFuture<InventoryDto> inventoryFallback(Long productId, Exception e) {
log.warn("Inventory service unavailable, using fallback", e);
return CompletableFuture.completedFuture(InventoryDto.unknown(productId));
}
}
Retry Pattern
// Resilience4j Retry
@Bean
public RetryConfig retryConfig() {
return RetryConfig.custom()
.maxAttempts(3)
.waitDuration(Duration.ofMillis(500))
.retryExceptions(ConnectTimeoutException.class, IOException.class)
.ignoreExceptions(BadRequestException.class) // don't retry 4xx
.build();
}
// Exponential backoff with jitter
RetryConfig.custom()
.intervalFunction(IntervalFunction.ofExponentialRandomBackoff(
Duration.ofMillis(200), // initial
2.0, // multiplier
Duration.ofSeconds(10))) // max
.build();
Bulkhead Pattern
Isolate resources per dependency — one failing service can't exhaust all threads.
// Resilience4j Bulkhead (semaphore)
@Bulkhead(name = "inventory", type = Bulkhead.Type.SEMAPHORE)
public InventoryDto checkInventory(Long productId) { ... }
// Thread pool bulkhead
@Bulkhead(name = "inventory", type = Bulkhead.Type.THREADPOOL)
public CompletableFuture<InventoryDto> checkInventory(Long productId) { ... }
@Bean
public BulkheadConfig bulkheadConfig() {
return BulkheadConfig.custom()
.maxConcurrentCalls(10) // max concurrent calls allowed
.maxWaitDuration(Duration.ofMillis(50)) // wait if at limit
.build();
}
Service Mesh
A service mesh moves cross-cutting network concerns out of application code into infrastructure.
Without mesh: every service implements auth, observability, retry, circuit breaking
With mesh: sidecar proxy handles it — app just sends plain HTTP
┌─────────────────────────────────────┐
Pod A │ [App Container] ←→ [Envoy Sidecar] │──── mTLS ────
└─────────────────────────────────────┘
│
┌─────────────────────────────────────┐ │
Pod B │ [Envoy Sidecar] ←→ [App Container] │ ───────────┘
└─────────────────────────────────────┘
Control Plane (Istio): pushes config to all Envoy proxies
Data Plane (Envoy): actual traffic handling (mTLS, retries, tracing)
Istio Features
| Feature | Description |
|---|---|
| mTLS | Auto-provisioned certs, service-to-service encryption |
| Traffic management | Load balancing, retries, circuit breaking, timeouts |
| Observability | Auto distributed tracing (Jaeger/Zipkin), metrics, access logs |
| Authorization | Service-level RBAC policies (which services can talk to whom) |
| Traffic splitting | Canary deployments (send 5% to v2) |
| Fault injection | Test resilience by injecting delays/errors |
# Istio VirtualService: canary deployment
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: order-service
spec:
hosts:
- order-service
http:
- route:
- destination:
host: order-service
subset: v1
weight: 95
- destination:
host: order-service
subset: v2
weight: 5 # 5% canary traffic
---
# Istio DestinationRule: circuit breaking
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: inventory-service
spec:
host: inventory-service
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 50
http2MaxRequests: 100
outlierDetection:
consecutive5xxErrors: 5
interval: 30s
baseEjectionTime: 30s # eject unhealthy pod for 30s
maxEjectionPercent: 50 # never eject >50% of pods
---
# Istio AuthorizationPolicy: zero trust
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: inventory-policy
spec:
selector:
matchLabels:
app: inventory
action: ALLOW
rules:
- from:
- source:
principals: ["cluster.local/ns/default/sa/order-service"] # only order-service
to:
- operation:
methods: ["GET"]
paths: ["/api/inventory/*"]
Envoy Proxy
Envoy is the data plane proxy used by Istio, AWS App Mesh, and many others.
Envoy capabilities:
L3/L4: TCP proxy, TLS termination/origination
L7: HTTP/1.1, HTTP/2, gRPC, WebSocket
Observability: distributed tracing (Zipkin, Jaeger, X-Ray), stats
Service discovery: via xDS API from control plane
Load balancing: round-robin, least-request, ring hash, Maglev
Fault injection: inject delays and errors for testing
Kubernetes Networking Concepts
Pod networking:
Every pod gets its own IP (flat network)
Pods can reach each other directly across nodes
CNI plugin handles this (Calico, Flannel, Cilium, Weave)
Service types:
ClusterIP: internal-only VIP, reachable within cluster
NodePort: exposes on every node's IP:port (30000-32767)
LoadBalancer: provisions cloud LB (AWS ELB, GCP NLB)
ExternalName: maps to external DNS name
Ingress:
L7 HTTP routing → backend Services
nginx Ingress, Traefik, AWS ALB Ingress Controller
# Kubernetes Ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api-ingress
annotations:
nginx.ingress.kubernetes.io/rate-limit: "100"
cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
tls:
- hosts: [api.example.com]
secretName: api-tls
rules:
- host: api.example.com
http:
paths:
- path: /api/orders
pathType: Prefix
backend:
service:
name: order-service
port:
number: 8080
- path: /api/inventory
pathType: Prefix
backend:
service:
name: inventory-service
port:
number: 8080
🎯 Interview Questions
Q1. What is a circuit breaker and why is it needed in microservices?
A circuit breaker prevents cascading failures: if Service A calls Service B and B is slow/down, without a circuit breaker, A's threads fill up waiting for B's timeouts — eventually A becomes unavailable too. The circuit breaker opens after N failures, immediately failing calls with a fallback (rather than waiting for timeout). After a recovery period, it lets test calls through — if successful, closes and resumes normal operation.
Q2. What is a service mesh and what problems does it solve?
A service mesh adds a sidecar proxy (Envoy) to every pod, intercepting all network traffic. It moves cross-cutting concerns out of application code: automatic mTLS between services (zero-trust), observability (distributed tracing, metrics without code changes), traffic management (retries, circuit breaking, timeouts), canary deployments, and authorization policies. The app just speaks plain HTTP — the sidecar handles everything.
Q3. What is the difference between client-side and server-side service discovery?
Client-side: the service queries a registry (Eureka, Consul) to get a list of healthy instances and load-balances among them. Client needs registry client library. Server-side: the client sends to a stable address (Kubernetes Service ClusterIP), and the infrastructure (kube-proxy, load balancer) routes to a healthy instance. Client has no discovery logic. Kubernetes uses server-side discovery — DNS resolves to ClusterIP, kube-proxy routes to pods.
Q4. What is the bulkhead pattern?
Named after ship compartments, bulkheads isolate resources per dependency. Each downstream service gets its own thread pool (or semaphore limit). If one service is slow and exhausts its thread pool, other services' thread pools are unaffected — the failure is contained. Without bulkheads, one slow service can consume all application threads, bringing down all other endpoints.
Q5. How does Istio implement mTLS without changing application code?
Istio's control plane (Istiod) automatically provisions X.509 certificates for every service account. The Envoy sidecar intercepts all inbound/outbound traffic — it terminates incoming mTLS and initiates outgoing mTLS, transparently to the application. The app speaks plain HTTP to the sidecar on localhost. The sidecar upgrades to mTLS for inter-service calls. Certificate rotation is also automatic.
Q6. What is a Kubernetes ClusterIP service and how does kube-proxy route traffic to it?
A ClusterIP is a virtual IP (VIP) — it doesn't correspond to any actual network interface. kube-proxy watches the Kubernetes API and programs iptables (or IPVS) rules on every node: packets destined for the ClusterIP:port are DNAT'd to a randomly selected healthy pod IP:port. This happens in the Linux kernel before the packet reaches any application, with no extra network hops.
Q7. What is a canary deployment in the context of a service mesh?
A canary deployment gradually routes a small percentage of traffic to a new version of a service while the rest continues to the stable version. Istio VirtualService weight routing allows this:
v1: 95%, v2: 5%. Monitor v2's error rate and latency. If healthy, increase to 20%, 50%, 100%. If problems appear, instantly route 100% back to v1. The service mesh makes this seamless — no DNS changes, no infrastructure changes, just a YAML update.
Q8. What is the difference between retry and circuit breaker patterns?
Retries handle transient failures — try again immediately or with backoff, hoping the next attempt succeeds (network blip, momentary unavailability). Circuit breakers handle sustained failures — stop trying when a service is clearly down, instead failing fast and returning a fallback immediately. They work together: retry handles flickers; circuit breaker trips when retries consistently fail, preventing retry storms from overwhelming a struggling service.