Deployment Strategies
A Deployment Strategy defines how a new version of a service is introduced into production. The goal is always the same: ship new code with zero downtime, minimize the blast radius of unexpected failures, and maintain a rapid, rehearsed rollback path.
Choosing the wrong strategy โ or implementing the right one incorrectly โ is one of the most common causes of production incidents during deployments. This guide covers every major strategy, their operational mechanics, configuration, failure modes, and when to use each one.
1. Strategy Comparison Matrix
| Strategy | Rollback Speed | Blast Radius | Infra Cost | Complexity | Best For |
|---|---|---|---|---|---|
| Recreate | Medium | 100% | 1x | Low | Dev/test environments; batch jobs with no active users |
| Rolling | Slow (minutes) | Medium (~25%) | 1x | Low | Standard stateless microservice releases |
| Blue-Green | Instant (<30s) | Low | 2x | Medium | Critical services; major DB schema changes |
| Canary | Fast (minutes) | Very Low (<5%) | ~1.1x | High | User-facing features; behavioral validation at scale |
| Shadow | N/A | None | ~2x | High | Load testing; ML model validation; algorithm changes |
| Feature Flags | Instant | Configurable | 1x | Medium | Long-lived feature development; kill switches |
| A/B Testing | Instant | Configurable | ~1.1x | High | UX experiments; conversion rate optimization |
2. Recreate Deployment
The simplest strategy: terminate all instances of the old version, then start the new version. There is a downtime window between teardown and startup.
State 1 (running): [ v1 ] [ v1 ] [ v1 ]
State 2 (gap): [ ] [ ] [ ] โ downtime window
State 3 (complete): [ v2 ] [ v2 ] [ v2 ]
spec:
strategy:
type: Recreate # Kubernetes terminates all v1 pods before creating v2
When to use: Development environments, batch processing jobs, or any non-production workload where a brief outage is acceptable in exchange for deployment simplicity. Never use in production for user-facing services.
Why it exists: Recreate is the only safe choice when v1 and v2 cannot co-exist โ for example, when a DB migration is not backward-compatible and you cannot run mixed versions even momentarily.
3. Rolling Deployment
Replaces instances of the old version one at a time (or in small batches), so the deployment set always has a mix of old and new versions until the rollout completes.
Start: [ v1 ] [ v1 ] [ v1 ] [ v1 ] (100% v1)
Step 1: [ v2 ] [ v1 ] [ v1 ] [ v1 ] (25% v2, 75% v1)
Step 2: [ v2 ] [ v2 ] [ v1 ] [ v1 ] (50% / 50%)
Step 3: [ v2 ] [ v2 ] [ v2 ] [ v1 ] (75% v2, 25% v1)
Complete: [ v2 ] [ v2 ] [ v2 ] [ v2 ] (100% v2)
Kubernetes Configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
spec:
replicas: 4
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Allow 1 extra pod above the desired count during update
# e.g., 4 desired โ max 5 pods temporarily exist
maxUnavailable: 0 # Never reduce below desired count during update
# Guarantees full capacity is always available
selector:
matchLabels:
app: order-service
template:
metadata:
labels:
app: order-service
spec:
containers:
- name: order-service
image: order-service:v2.1.0
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3 # Pod must pass 3 consecutive checks before traffic is routed to it
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
terminationGracePeriodSeconds: 60 # Give in-flight requests time to complete
Readiness probe controls when a pod receives traffic โ a pod that fails readiness is removed from the load balancer endpoint slice but is not restarted. Use this to protect users from hitting an unready pod.
Liveness probe controls when a pod is restarted โ a pod that fails liveness is killed and replaced. Use this to recover from deadlocks or internal corruption.
Both must be configured correctly for rolling deployments to be safe. Missing readiness probes are the most common cause of brief errors during rolling updates.
Rolling Deployment: The Mixed-Version Window Problem
During a rolling update, v1 and v2 are both serving production traffic simultaneously. This is the critical constraint that rolling deployments impose:
User A's request: โโโบ v1 pod (sees old schema, old business logic)
User B's request: โโโบ v2 pod (sees new schema, new business logic)
Same request type, different behavior โ must be acceptable
This means:
- API contracts must be backward-compatible between v1 and v2.
- Database schema changes cannot be breaking โ v1 must still function against the post-migration schema. See Section 8.
- Message formats (Kafka events, gRPC proto changes) must be forward- and backward-compatible.
Rollback: Kubernetes rolls back a deployment by triggering a reverse rolling update โ it replaces v2 pods with v1 pods following the same maxSurge/maxUnavailable rules. Rollback is not instant: it takes the same amount of time as the original rollout.
# Rollback to the previous revision
kubectl rollout undo deployment/order-service
# Rollback to a specific revision
kubectl rollout undo deployment/order-service --to-revision=3
# Check rollout history
kubectl rollout history deployment/order-service
4. Blue-Green Deployment
Maintains two complete, identical production environments โ Blue and Green. At any time, only one environment serves live traffic; the other is idle and available for testing or immediate rollback.
Before deployment:
Users โโโบ [ Load Balancer ] โโโบ Blue (v1.0.0) โ ACTIVE
Green (idle)
Preparation (Green deployment and testing):
Users โโโบ [ Load Balancer ] โโโบ Blue (v1.0.0) โ ACTIVE
Green (v2.0.0) โ being tested, no traffic
Cutover (atomic switch):
Users โโโบ [ Load Balancer ] โโโบ Blue (v1.0.0) โ idle, kept for rollback
Green (v2.0.0) โ ACTIVE
Rollback (if needed, instant):
Users โโโบ [ Load Balancer ] โโโบ Blue (v1.0.0) โ ACTIVE again
Green (v2.0.0) โ decommissioned
Kubernetes Implementation via Service Selectors
The cutover mechanism is a label selector change on the Kubernetes Service โ a near-instant operation that redirects all traffic atomically.
# Blue deployment โ always running
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service-blue
spec:
replicas: 4
selector:
matchLabels:
app: order-service
version: blue
template:
metadata:
labels:
app: order-service
version: blue
spec:
containers:
- name: order-service
image: order-service:v1.0.0
---
# Green deployment โ staged, not yet receiving traffic
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service-green
spec:
replicas: 4
selector:
matchLabels:
app: order-service
version: green
template:
metadata:
labels:
app: order-service
version: green
spec:
containers:
- name: order-service
image: order-service:v2.0.0
---
# The Service controls which environment receives traffic โ change `version` to flip
apiVersion: v1
kind: Service
metadata:
name: order-service
spec:
selector:
app: order-service
version: blue # โ Change to "green" to perform cutover
ports:
- port: 80
targetPort: 8080
# Cutover: patch the selector to switch to green
kubectl patch service order-service \
-p '{"spec":{"selector":{"version":"green"}}}'
# Rollback: patch back to blue โ takes effect in <5 seconds
kubectl patch service order-service \
-p '{"spec":{"selector":{"version":"blue"}}}'
DNS-Based Cutover (AWS Route 53 / Multi-Region)
For cross-region or infrastructure-level blue-green, traffic switching can be done at the DNS layer:
Before:
order-service.company.com โ Route53 โ Blue ALB (Weight: 100%)
Green ALB (Weight: 0%)
Cutover:
order-service.company.com โ Route53 โ Blue ALB (Weight: 0%)
Green ALB (Weight: 100%)
DNS changes are not instant. Clients cache DNS responses for the duration of the TTL. Before a blue-green cutover via DNS, lower the TTL to 60 seconds well in advance (at least 2x the current TTL before the deployment window). After confirming the green environment is stable, raise the TTL back to normal. Without this, some clients will continue hitting the blue environment for minutes after the intended cutover.
Blue-Green Costs and Trade-offs
| Benefit | Cost |
|---|---|
| Instant rollback (flip the selector back) | 2x infrastructure cost at all times |
| Zero traffic impact during testing | Database and stateful session migrations are complex |
| Clean separation of old and new environments | Warming up a fresh green environment takes time |
| Can run full integration tests against green before cutover | Long-lived connections (WebSockets) must be drained manually |
The database problem: Blue-Green is straightforward for stateless services. For services backed by a shared database, both environments share the same database during the transition โ meaning the same backward-compatibility constraints as rolling deployments apply. The database schema must support both v1 (blue) and v2 (green) simultaneously. See Section 8.
5. Canary Deployment
Exposes the new version to a small, controlled subset of real production traffic โ the canary โ while the stable version serves the majority. The canary is monitored closely; if metrics remain healthy, traffic is gradually increased. If not, the canary is killed and traffic returns entirely to stable.
โโโโโโโโโโโโโโโโโโโโโโโ
All Traffic โโโโโบ โ Load Balancer โ
โโโโฌโโโโโโโโโโโโโโโฌโโโ
โ โ
95% โโโโ โโโโ 5%
โผ โผ
โโโโโโโโโโโโโโ โโโโโโโโโโโโโ
โ v1 Stable โ โ v2 Canary โ
โ (4 pods) โ โ (1 pod) โ
โโโโโโโโโโโโโโ โโโโโโโโโโโโโ
โ โ
Healthy metrics Monitored closely
Istio VirtualService Traffic Splitting
# DestinationRule defines the two subsets (stable and canary)
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: order-service
spec:
host: order-service
subsets:
- name: stable
labels:
version: v1
- name: canary
labels:
version: v2
---
# VirtualService controls the traffic split weights
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: order-service
spec:
hosts:
- order-service
http:
- route:
- destination:
host: order-service
subset: stable
weight: 95
- destination:
host: order-service
subset: canary
weight: 5
Argo Rollouts โ Automated Progressive Delivery
Manually managing canary weight increases is error-prone. Argo Rollouts automates progressive traffic shifting and integrates with Prometheus to automatically roll back if error rate thresholds are breached.
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: order-service
spec:
replicas: 10
strategy:
canary:
# Traffic steps: increase weight over time, pausing to validate at each step
steps:
- setWeight: 5 # Step 1: send 5% to canary
- pause:
duration: 5m # Wait 5 minutes, evaluate metrics
- setWeight: 20 # Step 2: increase to 20%
- pause:
duration: 10m
- setWeight: 50 # Step 3: 50/50 split
- pause:
duration: 10m
- setWeight: 100 # Step 4: full canary rollout complete
# Automated rollback: if these Prometheus metrics breach thresholds, rollback immediately
analysis:
templates:
- templateName: success-rate-check
startingStep: 1
args:
- name: service-name
value: order-service
---
# AnalysisTemplate defines the automated health check logic
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate-check
spec:
args:
- name: service-name
metrics:
- name: success-rate
interval: 1m
# Fail the analysis if success rate drops below 99% in any 1-minute window
successCondition: result[0] >= 0.99
failureLimit: 3 # Allow up to 3 failures before triggering rollback
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{service="{{args.service-name}}",status!~"5.."}[2m]))
/
sum(rate(http_requests_total{service="{{args.service-name}}"}[2m]))
Canary Targeting โ Header and User-Based Routing
Beyond pure percentage-based routing, production canary systems often route to the canary based on user identity or request attributes โ ensuring the same user always hits the same version, and that internal users or beta testers hit the canary first.
# Istio VirtualService: route internal employees to canary, everyone else to stable
http:
- match:
- headers:
x-user-group:
exact: "internal-beta" # Route requests from internal users to canary
route:
- destination:
host: order-service
subset: canary
weight: 100
- route: # Default: all other traffic to stable
- destination:
host: order-service
subset: stable
weight: 100
What to Monitor During a Canary
A canary is only as safe as the metrics you watch. Monitor these in parallel for stable vs. canary:
| Metric Category | Specific Signals | Rollback Trigger |
|---|---|---|
| Error Rate | HTTP 5xx rate, gRPC error rate | Canary 5xx > Stable 5xx by >0.5% |
| Latency | p50, p95, p99 response time | Canary p99 > Stable p99 by >20% |
| Saturation | CPU%, memory%, thread pool depth | Canary resource usage trending upward |
| Business Metrics | Order success rate, payment completion | Canary conversion rate significantly below stable |
| Downstream Impact | Error rates in services called by the canary | Cascading errors into dependencies |
6. Shadow Deployment (Dark Launch)
Mirrors a copy of all real production requests to the new version, running in parallel. The shadow version processes each request fully but its responses are discarded โ users only receive responses from the stable version. The shadow cannot write to shared production databases or call external APIs.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Ingress / Service Mesh โ
โโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโ
โ โ (mirrored copy)
100% โ โ 100% (shadowed)
โผ โผ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ
โ v1 Stable โ โ v2 Shadow โ
โ โ โ โ
โ Reads DB โ โ Reads DB (safe) โ
โ Writes DB โ โ Discards writes โ
โ Calls APIs โ โ Stubs ext APIs โ
โโโโโโโโฌโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ
โ โ
Response returned Response discarded
to user (metrics captured)
Istio Traffic Mirroring Configuration
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: order-service
spec:
hosts:
- order-service
http:
- route:
- destination:
host: order-service
subset: stable
weight: 100
mirror:
host: order-service
subset: shadow # Traffic is mirrored here
mirrorPercentage:
value: 100.0 # Mirror 100% of requests (or set lower for cost savings)
What Shadow Deployment Validates
Shadow deployments are the highest-confidence validation tool available before a production release โ but also the highest cost and complexity. Use them specifically for:
| Use Case | What You're Validating |
|---|---|
| Algorithm replacement | Does the new recommendation engine produce different results? Are the differences acceptable? |
| ML model promotion | Does the new model perform better or worse on real production inputs than the current model? |
| Database migration | Does the new code work correctly against the migrated schema? |
| Rewrite of a critical service | Does the rewritten service produce identical output for all production request shapes? |
| Load testing | Can the new service sustain real production traffic volume without degrading? |
Shadow Deployment Constraints
Shadow versions must be carefully isolated from shared state:
Shadow Version MUST: Shadow Version MUST NOT:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
Read from production DB โ Write to production DB
โ
Generate metrics/logs โ Send emails/SMS to real users
โ
Call internal read-only APIs โ Call external payment APIs
โ
Record shadow response diffs โ Enqueue messages to production queues
Violating these rules can cause real user harm โ for example, a shadow version that writes to the production DB will corrupt data, or a shadow version that calls Stripe will process real payments twice.
7. Feature Flags
Feature flags (also called feature toggles or feature gates) decouple code deployment from feature release. Code for a new feature is deployed to production but dormant โ the feature is activated by flipping a flag in a configuration store, without any new deployment.
Deploy: New code ships to 100% of instances (flag defaults to OFF)
โ
Activate: Flag flipped to ON for 5% of users (internal beta)
โ
Expand: Flag expanded to 25%, then 50%, then 100%
โ
Cleanup: Flag removed from code; feature is always-on
Flag Types
| Type | Description | Example |
|---|---|---|
| Release toggle | Hide a partially-built feature until it's complete | new-checkout-flow: false |
| Ops toggle / kill switch | Disable a feature under load without deployment | enable-recommendation-service: false |
| Experiment toggle | Route a subset of users for A/B testing | new-pricing-algorithm: 10% |
| Permission toggle | Enable a feature for a specific user group | beta-dashboard: internal-users |
Implementation with a Flag SDK
// Using LaunchDarkly SDK (or OpenFeature-compatible client)
@Service
public class CheckoutService {
private final LDClient ldClient;
public CheckoutResponse processCheckout(User user, Cart cart) {
// Evaluate the flag โ the flag can be percentage-based, user-attribute-based, etc.
boolean useNewCheckoutFlow = ldClient.boolVariation(
"new-checkout-flow",
LDUser.builder(user.getId())
.country(user.getCountry())
.custom("plan", user.getPlan())
.build(),
false // default value if flag service is unreachable
);
if (useNewCheckoutFlow) {
return newCheckoutFlow.process(cart);
} else {
return legacyCheckoutFlow.process(cart);
}
}
}
Kill Switches โ The Most Important Flag Type
A kill switch is an ops toggle that immediately disables a feature that is causing production issues, without a deployment. This is the fastest possible mitigation for a production incident caused by a specific feature.
Production incident at 02:00:
New recommendation engine consuming 4x expected CPU โ latency spiking
02:01: Engineer flips kill switch "enable-recommendation-service" โ false
02:01: All instances read flag; recommendation service calls bypassed
02:01: Latency returns to normal. No deployment required. No rollback required.
08:00: Team investigates, fixes the issue, re-enables the flag in staging
This is why kill switches should be provisioned for every major new feature before it goes to production โ not as an afterthought, but as a required part of the feature's design.
Feature Flags vs. Canary Deployment
| Dimension | Feature Flag | Canary Deployment |
|---|---|---|
| Requires new deployment to activate | โ No | โ Yes |
| Requires new deployment to roll back | โ No (flip the flag) | โ Yes |
| Targets specific users/groups | โ Fine-grained (user attributes) | Limited (header-based) |
| Requires code change to remove | โ Yes (flag debt if not cleaned up) | โ No |
| Infrastructure cost | 1x | ~1.1x |
| Best for | Long-lived feature development; kill switches | Infrastructure changes; version migrations |
8. Database Migrations: The Backward Compatibility Trap
No zero-downtime deployment strategy survives a breaking database schema change. If v2 renames a column and v1 is still running (rolling update) or available for rollback (blue-green), v1 will crash the moment it queries the renamed column.
This is the most commonly underestimated constraint in production deployments.
The Expand and Contract Pattern
Expand-Contract (also called Parallel Change) is the standard pattern for backward-compatible database migrations:
Phase 1 โ EXPAND (add, never remove)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Goal: make the schema support BOTH old and new versions simultaneously.
ALTER TABLE orders ADD COLUMN new_status VARCHAR(50);
Application code (v2):
- Writes to BOTH old_status and new_status
- Reads from old_status (source of truth still = old column)
- v1 instances still work: they ignore new_status entirely
- v2 instances are safe: they write both columns
Phase 2 โ MIGRATE (backfill data)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Goal: populate new_status for all existing rows.
UPDATE orders SET new_status = old_status WHERE new_status IS NULL;
-- Run as a background migration; batch updates to avoid table locks
-- Verify: SELECT COUNT(*) FROM orders WHERE new_status IS NULL;
Phase 3 โ CONTRACT (remove the old)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Goal: switch reads to new column, then drop old column.
Deploy v3: reads and writes to new_status only; no longer references old_status
Confirm all v2 instances are gone (no mixed-version traffic)
ALTER TABLE orders DROP COLUMN old_status;
-- Safe: no live code references this column anymore
Common Database Anti-Patterns That Break Zero-Downtime Deployments
| Anti-Pattern | Why It Breaks | Safe Alternative |
|---|---|---|
ALTER TABLE ... DROP COLUMN in the same deploy as code that stops using it | v1 instances crash immediately | Drop only after all v1 instances are gone (Phase 3) |
ALTER TABLE ... RENAME COLUMN | Both old and new names are live simultaneously | Add new column, migrate data, drop old (Expand-Contract) |
Adding a NOT NULL column without a default | Existing rows fail constraint; v1 inserts without the column fail | Add with a default value first; make it NOT NULL only after migration |
| Changing a column's data type in-place | v1 code may send incompatible types | Add new column with new type, migrate, drop old |
| Removing an index that v1 queries depend on | Query plan degrades or fails | Only remove after confirming no running version uses that index |
Migration Tooling
Production database migrations should be:
- Version-controlled (Flyway, Liquibase, Alembic) โ every schema change tracked as a numbered migration file.
- Idempotent โ safe to run twice without error.
- Reversible โ every
upgrademigration should have a correspondingdowngrademigration, especially during the Expand phase. - Non-blocking โ use
CREATE INDEX CONCURRENTLY(PostgreSQL) orALGORITHM=INPLACE(MySQL) to avoid table locks during index creation on large tables.
-- PostgreSQL: safe concurrent index creation (does not lock the table)
CREATE INDEX CONCURRENTLY idx_orders_new_status ON orders(new_status);
-- MySQL: online DDL that minimizes locking
ALTER TABLE orders ADD COLUMN new_status VARCHAR(50), ALGORITHM=INPLACE, LOCK=NONE;
9. Observability During Deployments
A deployment without proper monitoring is flying blind. The following observability setup is required for any production deployment strategy to be operated safely.
Golden Signals โ What to Watch Per Version
During any deployment with mixed traffic (rolling, canary), instrument your monitoring to segment metrics by version label:
# Prometheus metric labels โ always include version
http_requests_total{service="order-service", version="v2", status="500"} 42
http_request_duration_seconds{service="order-service", version="v2", quantile="0.99"} 0.45
# Error rate per version โ compare canary (v2) vs stable (v1)
sum(rate(http_requests_total{service="order-service", version="v2", status=~"5.."}[2m]))
/
sum(rate(http_requests_total{service="order-service", version="v2"}[2m]))
# p99 latency per version
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{service="order-service"}[2m])) by (le, version)
)
Automated Rollback Triggers
Deployments should roll back automatically when health metrics breach thresholds, without requiring a human to be watching at 3 AM.
| Tool | Mechanism | Integration |
|---|---|---|
| Argo Rollouts | AnalysisTemplate queries Prometheus on a schedule; fails rollout if thresholds breached | Kubernetes-native; integrates with Istio, Nginx |
| Flagger | Continuous metric checks during canary promotion; auto-promotes or rolls back | Works with any service mesh |
| Spinnaker | Deployment pipeline stages with integrated metric gates | Multi-cloud; integrates with Datadog, New Relic, Prometheus |
| AWS CodeDeploy | CloudWatch Alarms can trigger automatic deployment rollback | AWS-native; ECS, EC2, Lambda |
Pre-Deployment Checklist
Before any production deployment, verify:
- Readiness and liveness probes configured and tested
-
terminationGracePeriodSecondsset to exceed longest expected in-flight request duration - Database migrations reviewed for backward compatibility with the previous version
- Feature flag kill switch provisioned for major new features
- Prometheus dashboards segmented by version label
- Automated rollback trigger thresholds configured (error rate, latency)
- Rollback procedure documented and rehearsed (not just written)
- Stakeholder notification plan ready (support, on-call, external status page)
10. Common Failure Modes and Anti-Patterns
1. Missing Readiness Probes โ User-Visible Errors During Rolling Updates
Kubernetes routes traffic to a new pod as soon as it is Running, not when it is ready to serve traffic. Without a readiness probe, newly started pods receive requests while still initializing (loading caches, establishing DB connections, warming JIT), causing errors.
Fix: always configure a readiness probe that returns 200 only when the pod
is fully initialized and capable of handling requests.
2. Stateful Connections Not Drained โ Broken WebSockets and Uploads
When a rolling update terminates a v1 pod, any long-lived connections (WebSockets, SSE, streaming uploads) on that pod are abruptly closed.
# Fix: configure a preStop hook to drain connections gracefully before shutdown
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"] # Allow load balancer to deregister pod
# and existing connections to close naturally
terminationGracePeriodSeconds: 75 # Must be > preStop sleep + longest request duration
3. Missing Automated Rollback โ Manual Intervention at 2 AM
Relying on a human to notice a deployment is degrading and manually trigger a rollback means:
- Mean time to recovery (MTTR) is as long as it takes someone to notice and act.
- Errors at night or weekends go undetected longer.
- Inconsistent execution of rollback procedures under stress.
Fix: use Argo Rollouts or Flagger with Prometheus-driven AnalysisTemplates.
If p99 latency exceeds threshold for 3 consecutive minutes, rollback automatically.
4. Session Affinity Masking Canary Failure Rates
If the load balancer uses sticky sessions (cookie-based or IP-based affinity), some users will be permanently pinned to the canary pod. This corrupts percentage-based error rate analysis โ if the canary is broken for specific user flows, only those pinned users are affected, and aggregate metrics may look healthy.
Fix: disable sticky sessions during canary testing, or use user-ID-based
consistent routing (via Istio header matching) instead of infrastructure-level
session affinity, so you retain control over which users hit the canary.
5. Treating All Deployments as Identical
Not every deployment carries the same risk. A CSS change and a payment processing engine change should not use the same deployment strategy.
Risk Classification:
LOW โ CSS/static assets, config-only changes, copy updates
Strategy: Rolling with fast maxSurge
MEDIUM โ New API endpoints, non-breaking logic changes
Strategy: Canary at 5% โ 25% โ 100% with automated rollback
HIGH โ Payment flows, auth systems, DB schema changes
Strategy: Blue-Green with manual smoke tests before cutover
CRITICAL โ Core ledger changes, regulatory-mandatory changes
Strategy: Blue-Green + shadow validation + extended canary
11. Strategy Selection Guide
Is the service stateless?
โโโ NO (stateful: WebSockets, active uploads, session-pinned)
โ โโโโบ Rolling with long terminationGracePeriodSeconds + preStop drain hook
โ
โโโ YES
โ
Can v1 and v2 run simultaneously? (API compat, DB compat)
โโโ NO
โ โโโโบ Blue-Green (complete switch) or Recreate (if downtime is acceptable)
โ
โโโ YES
โ
Is this a high-risk change? (payments, auth, major new feature)
โโโ YES
โ โโโ Need to validate under real load before full release?
โ โ โโโโบ Shadow (validate) โ then Canary (progressive rollout)
โ โโโ Need instant guaranteed rollback?
โ โโโโบ Blue-Green
โ
โโโ NO (standard low-risk release)
โโโ Need user-targeted rollout? (internal beta, A/B test)
โ โโโโบ Canary with header-based routing or Feature Flags
โโโ Standard release
โโโโบ Rolling (simple, cost-effective)