Phase 7 — Maintenance
Overview
The Maintenance phase begins immediately after deployment and continues for the lifetime of the system. It encompasses monitoring, incident response, performance optimisation, security patching, and iterative enhancements.
In a modern continuous delivery model, maintenance and development phases overlap — teams continuously ship improvements while the current version is in production.
Goals
- Maintain system health, availability, and performance
- Respond to and resolve incidents quickly
- Apply security and dependency patches proactively
- Gather feedback for the next development iteration
Maintenance Activities
1. Monitoring and Alerting
Continuously observe key system signals:
# Prometheus alerting rules example
groups:
- name: sdlc-service-alerts
rules:
- alert: HighErrorRate
expr: rate(http_server_requests_seconds_count{status=~"5.."}[5m]) > 0.01
for: 2m
labels:
severity: critical
annotations:
summary: "High HTTP error rate on {{ $labels.service }}"
- alert: HighLatency
expr: histogram_quantile(0.99, rate(http_server_requests_seconds_bucket[5m])) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "p99 latency > 500ms on {{ $labels.service }}"
2. Incident Management
| Severity | Response Time | Escalation |
|---|---|---|
| P1 — Critical | 15 minutes | Immediate — all hands |
| P2 — High | 1 hour | On-call engineer + Tech Lead |
| P3 — Medium | Next business day | Regular sprint planning |
| P4 — Low | Next sprint | Backlog |
Incident lifecycle:
- Alert fires → On-call engineer notified
- Acknowledge and assess severity
- Create incident channel (Slack/PagerDuty)
- Mitigate (roll backward if necessary)
- Resolve root cause
- Post-incident review (PIR) within 48 hours
- Action items tracked in Jira
3. Post-Incident Review (PIR)
PIR is blameless and focuses on systemic improvements:
## PIR: Payment service timeout — 2024-03-15
**Duration:** 22:14 – 22:47 UTC (33 minutes)
**Impact:** 12% of payment requests timed out
### Timeline
- 22:14 Alert: p99 latency > 2s on payment-service
- 22:18 On-call acknowledged, began investigation
- 22:31 Root cause identified: DB connection pool exhausted
- 22:47 Fix deployed (connection pool size increased)
### Root Cause
HikariCP max-pool-size was set to 10 for a service handling
500 RPS. A spike in slow queries exhausted available connections.
### Action Items
- [ ] Increase pool size to 50 and add connection pool monitoring
- [ ] Add database slow query alert (> 100ms threshold)
- [ ] Add load test to CI for this service
4. Dependency Management
Proactively manage library and platform versions:
# Maven dependency check
mvn versions:display-dependency-updates
# OWASP security scan
mvn dependency-check:check
# Renovate / Dependabot — automated PR creation for updates
5. Technical Debt Management
Track and schedule reduction of technical debt:
- Dedicate 10–20% of each sprint to tech debt
- Tag Jira items with
tech-debtlabel - Refactor incrementally — avoid big-bang rewrites
- Use SonarQube tech debt dashboard for visibility
Operational Runbooks
Maintain runbooks for common operational tasks:
- How to restart a service safely (drain, restart, verify)
- How to run a database migration in production
- How to increase JVM heap size without restart (JVM flags)
- How to flush a Redis cache
- How to replay Kafka messages from a topic offset
Exit Criteria
Maintenance is ongoing — it formally "ends" when the system is decommissioned:
- System reaches end-of-life
- Data migration completed
- Service deregistered from service registry and DNS
- Infrastructure destroyed (cloud resources terminated)
- Documentation archived
Enable Spring Boot Actuator for runtime visibility:
/actuator/health— liveness and readiness/actuator/metrics— all Micrometer metrics/actuator/loggers— change log level at runtime without restart/actuator/env— inspect active configuration/actuator/threaddump— diagnose thread starvation