Skip to main content

Deployment Strategies

A Deployment Strategy defines how a new version of a service is introduced into production. The goal is always the same: ship new code with zero downtime, minimize the blast radius of unexpected failures, and maintain a rapid, rehearsed rollback path.

Choosing the wrong strategy โ€” or implementing the right one incorrectly โ€” is one of the most common causes of production incidents during deployments. This guide covers every major strategy, their operational mechanics, configuration, failure modes, and when to use each one.


1. Strategy Comparison Matrix

StrategyRollback SpeedBlast RadiusInfra CostComplexityBest For
RecreateMedium100%1xLowDev/test environments; batch jobs with no active users
RollingSlow (minutes)Medium (~25%)1xLowStandard stateless microservice releases
Blue-GreenInstant (<30s)Low2xMediumCritical services; major DB schema changes
CanaryFast (minutes)Very Low (<5%)~1.1xHighUser-facing features; behavioral validation at scale
ShadowN/ANone~2xHighLoad testing; ML model validation; algorithm changes
Feature FlagsInstantConfigurable1xMediumLong-lived feature development; kill switches
A/B TestingInstantConfigurable~1.1xHighUX experiments; conversion rate optimization

2. Recreate Deployment

The simplest strategy: terminate all instances of the old version, then start the new version. There is a downtime window between teardown and startup.

State 1 (running): [ v1 ] [ v1 ] [ v1 ]
State 2 (gap): [ ] [ ] [ ] โ† downtime window
State 3 (complete): [ v2 ] [ v2 ] [ v2 ]
spec:
strategy:
type: Recreate # Kubernetes terminates all v1 pods before creating v2

When to use: Development environments, batch processing jobs, or any non-production workload where a brief outage is acceptable in exchange for deployment simplicity. Never use in production for user-facing services.

Why it exists: Recreate is the only safe choice when v1 and v2 cannot co-exist โ€” for example, when a DB migration is not backward-compatible and you cannot run mixed versions even momentarily.


3. Rolling Deployment

Replaces instances of the old version one at a time (or in small batches), so the deployment set always has a mix of old and new versions until the rollout completes.

Start: [ v1 ] [ v1 ] [ v1 ] [ v1 ] (100% v1)
Step 1: [ v2 ] [ v1 ] [ v1 ] [ v1 ] (25% v2, 75% v1)
Step 2: [ v2 ] [ v2 ] [ v1 ] [ v1 ] (50% / 50%)
Step 3: [ v2 ] [ v2 ] [ v2 ] [ v1 ] (75% v2, 25% v1)
Complete: [ v2 ] [ v2 ] [ v2 ] [ v2 ] (100% v2)

Kubernetes Configuration

apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
spec:
replicas: 4
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Allow 1 extra pod above the desired count during update
# e.g., 4 desired โ†’ max 5 pods temporarily exist
maxUnavailable: 0 # Never reduce below desired count during update
# Guarantees full capacity is always available
selector:
matchLabels:
app: order-service
template:
metadata:
labels:
app: order-service
spec:
containers:
- name: order-service
image: order-service:v2.1.0
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3 # Pod must pass 3 consecutive checks before traffic is routed to it
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
terminationGracePeriodSeconds: 60 # Give in-flight requests time to complete
Readiness vs Liveness Probes

Readiness probe controls when a pod receives traffic โ€” a pod that fails readiness is removed from the load balancer endpoint slice but is not restarted. Use this to protect users from hitting an unready pod.

Liveness probe controls when a pod is restarted โ€” a pod that fails liveness is killed and replaced. Use this to recover from deadlocks or internal corruption.

Both must be configured correctly for rolling deployments to be safe. Missing readiness probes are the most common cause of brief errors during rolling updates.

Rolling Deployment: The Mixed-Version Window Problem

During a rolling update, v1 and v2 are both serving production traffic simultaneously. This is the critical constraint that rolling deployments impose:

User A's request: โ”€โ”€โ–บ v1 pod (sees old schema, old business logic)
User B's request: โ”€โ”€โ–บ v2 pod (sees new schema, new business logic)
Same request type, different behavior โ€” must be acceptable

This means:

  • API contracts must be backward-compatible between v1 and v2.
  • Database schema changes cannot be breaking โ€” v1 must still function against the post-migration schema. See Section 8.
  • Message formats (Kafka events, gRPC proto changes) must be forward- and backward-compatible.

Rollback: Kubernetes rolls back a deployment by triggering a reverse rolling update โ€” it replaces v2 pods with v1 pods following the same maxSurge/maxUnavailable rules. Rollback is not instant: it takes the same amount of time as the original rollout.

# Rollback to the previous revision
kubectl rollout undo deployment/order-service

# Rollback to a specific revision
kubectl rollout undo deployment/order-service --to-revision=3

# Check rollout history
kubectl rollout history deployment/order-service

4. Blue-Green Deployment

Maintains two complete, identical production environments โ€” Blue and Green. At any time, only one environment serves live traffic; the other is idle and available for testing or immediate rollback.

Before deployment:
Users โ”€โ”€โ–บ [ Load Balancer ] โ”€โ”€โ–บ Blue (v1.0.0) โ† ACTIVE
Green (idle)

Preparation (Green deployment and testing):
Users โ”€โ”€โ–บ [ Load Balancer ] โ”€โ”€โ–บ Blue (v1.0.0) โ† ACTIVE
Green (v2.0.0) โ† being tested, no traffic

Cutover (atomic switch):
Users โ”€โ”€โ–บ [ Load Balancer ] โ”€โ”€โ–บ Blue (v1.0.0) โ† idle, kept for rollback
Green (v2.0.0) โ† ACTIVE

Rollback (if needed, instant):
Users โ”€โ”€โ–บ [ Load Balancer ] โ”€โ”€โ–บ Blue (v1.0.0) โ† ACTIVE again
Green (v2.0.0) โ† decommissioned

Kubernetes Implementation via Service Selectors

The cutover mechanism is a label selector change on the Kubernetes Service โ€” a near-instant operation that redirects all traffic atomically.

# Blue deployment โ€” always running
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service-blue
spec:
replicas: 4
selector:
matchLabels:
app: order-service
version: blue
template:
metadata:
labels:
app: order-service
version: blue
spec:
containers:
- name: order-service
image: order-service:v1.0.0
---
# Green deployment โ€” staged, not yet receiving traffic
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service-green
spec:
replicas: 4
selector:
matchLabels:
app: order-service
version: green
template:
metadata:
labels:
app: order-service
version: green
spec:
containers:
- name: order-service
image: order-service:v2.0.0
---
# The Service controls which environment receives traffic โ€” change `version` to flip
apiVersion: v1
kind: Service
metadata:
name: order-service
spec:
selector:
app: order-service
version: blue # โ† Change to "green" to perform cutover
ports:
- port: 80
targetPort: 8080
# Cutover: patch the selector to switch to green
kubectl patch service order-service \
-p '{"spec":{"selector":{"version":"green"}}}'

# Rollback: patch back to blue โ€” takes effect in <5 seconds
kubectl patch service order-service \
-p '{"spec":{"selector":{"version":"blue"}}}'

DNS-Based Cutover (AWS Route 53 / Multi-Region)

For cross-region or infrastructure-level blue-green, traffic switching can be done at the DNS layer:

Before:
order-service.company.com โ†’ Route53 โ†’ Blue ALB (Weight: 100%)
Green ALB (Weight: 0%)

Cutover:
order-service.company.com โ†’ Route53 โ†’ Blue ALB (Weight: 0%)
Green ALB (Weight: 100%)
DNS TTL and Stale Clients

DNS changes are not instant. Clients cache DNS responses for the duration of the TTL. Before a blue-green cutover via DNS, lower the TTL to 60 seconds well in advance (at least 2x the current TTL before the deployment window). After confirming the green environment is stable, raise the TTL back to normal. Without this, some clients will continue hitting the blue environment for minutes after the intended cutover.

Blue-Green Costs and Trade-offs

BenefitCost
Instant rollback (flip the selector back)2x infrastructure cost at all times
Zero traffic impact during testingDatabase and stateful session migrations are complex
Clean separation of old and new environmentsWarming up a fresh green environment takes time
Can run full integration tests against green before cutoverLong-lived connections (WebSockets) must be drained manually

The database problem: Blue-Green is straightforward for stateless services. For services backed by a shared database, both environments share the same database during the transition โ€” meaning the same backward-compatibility constraints as rolling deployments apply. The database schema must support both v1 (blue) and v2 (green) simultaneously. See Section 8.


5. Canary Deployment

Exposes the new version to a small, controlled subset of real production traffic โ€” the canary โ€” while the stable version serves the majority. The canary is monitored closely; if metrics remain healthy, traffic is gradually increased. If not, the canary is killed and traffic returns entirely to stable.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
All Traffic โ”€โ”€โ”€โ”€โ–บ โ”‚ Load Balancer โ”‚
โ””โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”˜
โ”‚ โ”‚
95% โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€ 5%
โ–ผ โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ v1 Stable โ”‚ โ”‚ v2 Canary โ”‚
โ”‚ (4 pods) โ”‚ โ”‚ (1 pod) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚ โ”‚
Healthy metrics Monitored closely

Istio VirtualService Traffic Splitting

# DestinationRule defines the two subsets (stable and canary)
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: order-service
spec:
host: order-service
subsets:
- name: stable
labels:
version: v1
- name: canary
labels:
version: v2
---
# VirtualService controls the traffic split weights
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: order-service
spec:
hosts:
- order-service
http:
- route:
- destination:
host: order-service
subset: stable
weight: 95
- destination:
host: order-service
subset: canary
weight: 5

Argo Rollouts โ€” Automated Progressive Delivery

Manually managing canary weight increases is error-prone. Argo Rollouts automates progressive traffic shifting and integrates with Prometheus to automatically roll back if error rate thresholds are breached.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: order-service
spec:
replicas: 10
strategy:
canary:
# Traffic steps: increase weight over time, pausing to validate at each step
steps:
- setWeight: 5 # Step 1: send 5% to canary
- pause:
duration: 5m # Wait 5 minutes, evaluate metrics
- setWeight: 20 # Step 2: increase to 20%
- pause:
duration: 10m
- setWeight: 50 # Step 3: 50/50 split
- pause:
duration: 10m
- setWeight: 100 # Step 4: full canary rollout complete

# Automated rollback: if these Prometheus metrics breach thresholds, rollback immediately
analysis:
templates:
- templateName: success-rate-check
startingStep: 1
args:
- name: service-name
value: order-service

---
# AnalysisTemplate defines the automated health check logic
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate-check
spec:
args:
- name: service-name
metrics:
- name: success-rate
interval: 1m
# Fail the analysis if success rate drops below 99% in any 1-minute window
successCondition: result[0] >= 0.99
failureLimit: 3 # Allow up to 3 failures before triggering rollback
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{service="{{args.service-name}}",status!~"5.."}[2m]))
/
sum(rate(http_requests_total{service="{{args.service-name}}"}[2m]))

Canary Targeting โ€” Header and User-Based Routing

Beyond pure percentage-based routing, production canary systems often route to the canary based on user identity or request attributes โ€” ensuring the same user always hits the same version, and that internal users or beta testers hit the canary first.

# Istio VirtualService: route internal employees to canary, everyone else to stable
http:
- match:
- headers:
x-user-group:
exact: "internal-beta" # Route requests from internal users to canary
route:
- destination:
host: order-service
subset: canary
weight: 100
- route: # Default: all other traffic to stable
- destination:
host: order-service
subset: stable
weight: 100

What to Monitor During a Canary

A canary is only as safe as the metrics you watch. Monitor these in parallel for stable vs. canary:

Metric CategorySpecific SignalsRollback Trigger
Error RateHTTP 5xx rate, gRPC error rateCanary 5xx > Stable 5xx by >0.5%
Latencyp50, p95, p99 response timeCanary p99 > Stable p99 by >20%
SaturationCPU%, memory%, thread pool depthCanary resource usage trending upward
Business MetricsOrder success rate, payment completionCanary conversion rate significantly below stable
Downstream ImpactError rates in services called by the canaryCascading errors into dependencies

6. Shadow Deployment (Dark Launch)

Mirrors a copy of all real production requests to the new version, running in parallel. The shadow version processes each request fully but its responses are discarded โ€” users only receive responses from the stable version. The shadow cannot write to shared production databases or call external APIs.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Ingress / Service Mesh โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚ โ”‚ (mirrored copy)
100% โ”‚ โ”‚ 100% (shadowed)
โ–ผ โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ v1 Stable โ”‚ โ”‚ v2 Shadow โ”‚
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ Reads DB โ”‚ โ”‚ Reads DB (safe) โ”‚
โ”‚ Writes DB โ”‚ โ”‚ Discards writes โ”‚
โ”‚ Calls APIs โ”‚ โ”‚ Stubs ext APIs โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚ โ†‘
Response returned Response discarded
to user (metrics captured)

Istio Traffic Mirroring Configuration

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: order-service
spec:
hosts:
- order-service
http:
- route:
- destination:
host: order-service
subset: stable
weight: 100
mirror:
host: order-service
subset: shadow # Traffic is mirrored here
mirrorPercentage:
value: 100.0 # Mirror 100% of requests (or set lower for cost savings)

What Shadow Deployment Validates

Shadow deployments are the highest-confidence validation tool available before a production release โ€” but also the highest cost and complexity. Use them specifically for:

Use CaseWhat You're Validating
Algorithm replacementDoes the new recommendation engine produce different results? Are the differences acceptable?
ML model promotionDoes the new model perform better or worse on real production inputs than the current model?
Database migrationDoes the new code work correctly against the migrated schema?
Rewrite of a critical serviceDoes the rewritten service produce identical output for all production request shapes?
Load testingCan the new service sustain real production traffic volume without degrading?

Shadow Deployment Constraints

Shadow versions must be carefully isolated from shared state:

Shadow Version MUST: Shadow Version MUST NOT:
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
โœ… Read from production DB โŒ Write to production DB
โœ… Generate metrics/logs โŒ Send emails/SMS to real users
โœ… Call internal read-only APIs โŒ Call external payment APIs
โœ… Record shadow response diffs โŒ Enqueue messages to production queues

Violating these rules can cause real user harm โ€” for example, a shadow version that writes to the production DB will corrupt data, or a shadow version that calls Stripe will process real payments twice.


7. Feature Flags

Feature flags (also called feature toggles or feature gates) decouple code deployment from feature release. Code for a new feature is deployed to production but dormant โ€” the feature is activated by flipping a flag in a configuration store, without any new deployment.

Deploy: New code ships to 100% of instances (flag defaults to OFF)
โ”‚
Activate: Flag flipped to ON for 5% of users (internal beta)
โ”‚
Expand: Flag expanded to 25%, then 50%, then 100%
โ”‚
Cleanup: Flag removed from code; feature is always-on

Flag Types

TypeDescriptionExample
Release toggleHide a partially-built feature until it's completenew-checkout-flow: false
Ops toggle / kill switchDisable a feature under load without deploymentenable-recommendation-service: false
Experiment toggleRoute a subset of users for A/B testingnew-pricing-algorithm: 10%
Permission toggleEnable a feature for a specific user groupbeta-dashboard: internal-users

Implementation with a Flag SDK

// Using LaunchDarkly SDK (or OpenFeature-compatible client)
@Service
public class CheckoutService {
private final LDClient ldClient;

public CheckoutResponse processCheckout(User user, Cart cart) {
// Evaluate the flag โ€” the flag can be percentage-based, user-attribute-based, etc.
boolean useNewCheckoutFlow = ldClient.boolVariation(
"new-checkout-flow",
LDUser.builder(user.getId())
.country(user.getCountry())
.custom("plan", user.getPlan())
.build(),
false // default value if flag service is unreachable
);

if (useNewCheckoutFlow) {
return newCheckoutFlow.process(cart);
} else {
return legacyCheckoutFlow.process(cart);
}
}
}

Kill Switches โ€” The Most Important Flag Type

A kill switch is an ops toggle that immediately disables a feature that is causing production issues, without a deployment. This is the fastest possible mitigation for a production incident caused by a specific feature.

Production incident at 02:00:
New recommendation engine consuming 4x expected CPU โ†’ latency spiking

02:01: Engineer flips kill switch "enable-recommendation-service" โ†’ false
02:01: All instances read flag; recommendation service calls bypassed
02:01: Latency returns to normal. No deployment required. No rollback required.

08:00: Team investigates, fixes the issue, re-enables the flag in staging

This is why kill switches should be provisioned for every major new feature before it goes to production โ€” not as an afterthought, but as a required part of the feature's design.

Feature Flags vs. Canary Deployment

DimensionFeature FlagCanary Deployment
Requires new deployment to activateโŒ Noโœ… Yes
Requires new deployment to roll backโŒ No (flip the flag)โœ… Yes
Targets specific users/groupsโœ… Fine-grained (user attributes)Limited (header-based)
Requires code change to removeโœ… Yes (flag debt if not cleaned up)โŒ No
Infrastructure cost1x~1.1x
Best forLong-lived feature development; kill switchesInfrastructure changes; version migrations

8. Database Migrations: The Backward Compatibility Trap

No zero-downtime deployment strategy survives a breaking database schema change. If v2 renames a column and v1 is still running (rolling update) or available for rollback (blue-green), v1 will crash the moment it queries the renamed column.

This is the most commonly underestimated constraint in production deployments.

The Expand and Contract Pattern

Expand-Contract (also called Parallel Change) is the standard pattern for backward-compatible database migrations:

Phase 1 โ€” EXPAND (add, never remove)
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Goal: make the schema support BOTH old and new versions simultaneously.

ALTER TABLE orders ADD COLUMN new_status VARCHAR(50);

Application code (v2):
- Writes to BOTH old_status and new_status
- Reads from old_status (source of truth still = old column)
- v1 instances still work: they ignore new_status entirely
- v2 instances are safe: they write both columns

Phase 2 โ€” MIGRATE (backfill data)
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Goal: populate new_status for all existing rows.

UPDATE orders SET new_status = old_status WHERE new_status IS NULL;
-- Run as a background migration; batch updates to avoid table locks
-- Verify: SELECT COUNT(*) FROM orders WHERE new_status IS NULL;

Phase 3 โ€” CONTRACT (remove the old)
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Goal: switch reads to new column, then drop old column.

Deploy v3: reads and writes to new_status only; no longer references old_status
Confirm all v2 instances are gone (no mixed-version traffic)

ALTER TABLE orders DROP COLUMN old_status;
-- Safe: no live code references this column anymore

Common Database Anti-Patterns That Break Zero-Downtime Deployments

Anti-PatternWhy It BreaksSafe Alternative
ALTER TABLE ... DROP COLUMN in the same deploy as code that stops using itv1 instances crash immediatelyDrop only after all v1 instances are gone (Phase 3)
ALTER TABLE ... RENAME COLUMNBoth old and new names are live simultaneouslyAdd new column, migrate data, drop old (Expand-Contract)
Adding a NOT NULL column without a defaultExisting rows fail constraint; v1 inserts without the column failAdd with a default value first; make it NOT NULL only after migration
Changing a column's data type in-placev1 code may send incompatible typesAdd new column with new type, migrate, drop old
Removing an index that v1 queries depend onQuery plan degrades or failsOnly remove after confirming no running version uses that index

Migration Tooling

Production database migrations should be:

  • Version-controlled (Flyway, Liquibase, Alembic) โ€” every schema change tracked as a numbered migration file.
  • Idempotent โ€” safe to run twice without error.
  • Reversible โ€” every upgrade migration should have a corresponding downgrade migration, especially during the Expand phase.
  • Non-blocking โ€” use CREATE INDEX CONCURRENTLY (PostgreSQL) or ALGORITHM=INPLACE (MySQL) to avoid table locks during index creation on large tables.
-- PostgreSQL: safe concurrent index creation (does not lock the table)
CREATE INDEX CONCURRENTLY idx_orders_new_status ON orders(new_status);

-- MySQL: online DDL that minimizes locking
ALTER TABLE orders ADD COLUMN new_status VARCHAR(50), ALGORITHM=INPLACE, LOCK=NONE;

9. Observability During Deployments

A deployment without proper monitoring is flying blind. The following observability setup is required for any production deployment strategy to be operated safely.

Golden Signals โ€” What to Watch Per Version

During any deployment with mixed traffic (rolling, canary), instrument your monitoring to segment metrics by version label:

# Prometheus metric labels โ€” always include version
http_requests_total{service="order-service", version="v2", status="500"} 42
http_request_duration_seconds{service="order-service", version="v2", quantile="0.99"} 0.45
# Error rate per version โ€” compare canary (v2) vs stable (v1)
sum(rate(http_requests_total{service="order-service", version="v2", status=~"5.."}[2m]))
/
sum(rate(http_requests_total{service="order-service", version="v2"}[2m]))

# p99 latency per version
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{service="order-service"}[2m])) by (le, version)
)

Automated Rollback Triggers

Deployments should roll back automatically when health metrics breach thresholds, without requiring a human to be watching at 3 AM.

ToolMechanismIntegration
Argo RolloutsAnalysisTemplate queries Prometheus on a schedule; fails rollout if thresholds breachedKubernetes-native; integrates with Istio, Nginx
FlaggerContinuous metric checks during canary promotion; auto-promotes or rolls backWorks with any service mesh
SpinnakerDeployment pipeline stages with integrated metric gatesMulti-cloud; integrates with Datadog, New Relic, Prometheus
AWS CodeDeployCloudWatch Alarms can trigger automatic deployment rollbackAWS-native; ECS, EC2, Lambda

Pre-Deployment Checklist

Before any production deployment, verify:

  • Readiness and liveness probes configured and tested
  • terminationGracePeriodSeconds set to exceed longest expected in-flight request duration
  • Database migrations reviewed for backward compatibility with the previous version
  • Feature flag kill switch provisioned for major new features
  • Prometheus dashboards segmented by version label
  • Automated rollback trigger thresholds configured (error rate, latency)
  • Rollback procedure documented and rehearsed (not just written)
  • Stakeholder notification plan ready (support, on-call, external status page)

10. Common Failure Modes and Anti-Patterns

1. Missing Readiness Probes โ†’ User-Visible Errors During Rolling Updates

Kubernetes routes traffic to a new pod as soon as it is Running, not when it is ready to serve traffic. Without a readiness probe, newly started pods receive requests while still initializing (loading caches, establishing DB connections, warming JIT), causing errors.

Fix: always configure a readiness probe that returns 200 only when the pod
is fully initialized and capable of handling requests.

2. Stateful Connections Not Drained โ†’ Broken WebSockets and Uploads

When a rolling update terminates a v1 pod, any long-lived connections (WebSockets, SSE, streaming uploads) on that pod are abruptly closed.

# Fix: configure a preStop hook to drain connections gracefully before shutdown
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"] # Allow load balancer to deregister pod
# and existing connections to close naturally
terminationGracePeriodSeconds: 75 # Must be > preStop sleep + longest request duration

3. Missing Automated Rollback โ†’ Manual Intervention at 2 AM

Relying on a human to notice a deployment is degrading and manually trigger a rollback means:

  • Mean time to recovery (MTTR) is as long as it takes someone to notice and act.
  • Errors at night or weekends go undetected longer.
  • Inconsistent execution of rollback procedures under stress.
Fix: use Argo Rollouts or Flagger with Prometheus-driven AnalysisTemplates.
If p99 latency exceeds threshold for 3 consecutive minutes, rollback automatically.

4. Session Affinity Masking Canary Failure Rates

If the load balancer uses sticky sessions (cookie-based or IP-based affinity), some users will be permanently pinned to the canary pod. This corrupts percentage-based error rate analysis โ€” if the canary is broken for specific user flows, only those pinned users are affected, and aggregate metrics may look healthy.

Fix: disable sticky sessions during canary testing, or use user-ID-based
consistent routing (via Istio header matching) instead of infrastructure-level
session affinity, so you retain control over which users hit the canary.

5. Treating All Deployments as Identical

Not every deployment carries the same risk. A CSS change and a payment processing engine change should not use the same deployment strategy.

Risk Classification:
LOW โ†’ CSS/static assets, config-only changes, copy updates
Strategy: Rolling with fast maxSurge

MEDIUM โ†’ New API endpoints, non-breaking logic changes
Strategy: Canary at 5% โ†’ 25% โ†’ 100% with automated rollback

HIGH โ†’ Payment flows, auth systems, DB schema changes
Strategy: Blue-Green with manual smoke tests before cutover

CRITICAL โ†’ Core ledger changes, regulatory-mandatory changes
Strategy: Blue-Green + shadow validation + extended canary

11. Strategy Selection Guide

Is the service stateless?
โ”œโ”€โ”€ NO (stateful: WebSockets, active uploads, session-pinned)
โ”‚ โ””โ”€โ”€โ–บ Rolling with long terminationGracePeriodSeconds + preStop drain hook
โ”‚
โ””โ”€โ”€ YES
โ”‚
Can v1 and v2 run simultaneously? (API compat, DB compat)
โ”œโ”€โ”€ NO
โ”‚ โ””โ”€โ”€โ–บ Blue-Green (complete switch) or Recreate (if downtime is acceptable)
โ”‚
โ””โ”€โ”€ YES
โ”‚
Is this a high-risk change? (payments, auth, major new feature)
โ”œโ”€โ”€ YES
โ”‚ โ”œโ”€โ”€ Need to validate under real load before full release?
โ”‚ โ”‚ โ””โ”€โ”€โ–บ Shadow (validate) โ†’ then Canary (progressive rollout)
โ”‚ โ””โ”€โ”€ Need instant guaranteed rollback?
โ”‚ โ””โ”€โ”€โ–บ Blue-Green
โ”‚
โ””โ”€โ”€ NO (standard low-risk release)
โ”œโ”€โ”€ Need user-targeted rollout? (internal beta, A/B test)
โ”‚ โ””โ”€โ”€โ–บ Canary with header-based routing or Feature Flags
โ””โ”€โ”€ Standard release
โ””โ”€โ”€โ–บ Rolling (simple, cost-effective)