Phase 7 โ Maintenance
Overviewโ
The Maintenance phase begins immediately after deployment and continues for the lifetime of the system. It encompasses monitoring, incident response, performance optimisation, security patching, and iterative enhancements.
In a modern continuous delivery model, maintenance and development phases overlap โ teams continuously ship improvements while the current version is in production.
Goalsโ
- Maintain system health, availability, and performance
- Respond to and resolve incidents quickly
- Apply security and dependency patches proactively
- Gather feedback for the next development iteration
Maintenance Activitiesโ
1. Monitoring and Alertingโ
Continuously observe key system signals:
# Prometheus alerting rules example
groups:
- name: sdlc-service-alerts
rules:
- alert: HighErrorRate
expr: rate(http_server_requests_seconds_count{status=~"5.."}[5m]) > 0.01
for: 2m
labels:
severity: critical
annotations:
summary: "High HTTP error rate on {{ $labels.service }}"
- alert: HighLatency
expr: histogram_quantile(0.99, rate(http_server_requests_seconds_bucket[5m])) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "p99 latency > 500ms on {{ $labels.service }}"
2. Incident Managementโ
| Severity | Response Time | Escalation |
|---|---|---|
| P1 โ Critical | 15 minutes | Immediate โ all hands |
| P2 โ High | 1 hour | On-call engineer + Tech Lead |
| P3 โ Medium | Next business day | Regular sprint planning |
| P4 โ Low | Next sprint | Backlog |
Incident lifecycle:
- Alert fires โ On-call engineer notified
- Acknowledge and assess severity
- Create incident channel (Slack/PagerDuty)
- Mitigate (roll backward if necessary)
- Resolve root cause
- Post-incident review (PIR) within 48 hours
- Action items tracked in Jira
3. Post-Incident Review (PIR)โ
PIR is blameless and focuses on systemic improvements:
## PIR: Payment service timeout โ 2024-03-15
**Duration:** 22:14 โ 22:47 UTC (33 minutes)
**Impact:** 12% of payment requests timed out
### Timeline
- 22:14 Alert: p99 latency > 2s on payment-service
- 22:18 On-call acknowledged, began investigation
- 22:31 Root cause identified: DB connection pool exhausted
- 22:47 Fix deployed (connection pool size increased)
### Root Cause
HikariCP max-pool-size was set to 10 for a service handling
500 RPS. A spike in slow queries exhausted available connections.
### Action Items
- [ ] Increase pool size to 50 and add connection pool monitoring
- [ ] Add database slow query alert (> 100ms threshold)
- [ ] Add load test to CI for this service
4. Dependency Managementโ
Proactively manage library and platform versions:
# Maven dependency check
mvn versions:display-dependency-updates
# OWASP security scan
mvn dependency-check:check
# Renovate / Dependabot โ automated PR creation for updates
5. Technical Debt Managementโ
Track and schedule reduction of technical debt:
- Dedicate 10โ20% of each sprint to tech debt
- Tag Jira items with
tech-debtlabel - Refactor incrementally โ avoid big-bang rewrites
- Use SonarQube tech debt dashboard for visibility
Operational Runbooksโ
Maintain runbooks for common operational tasks:
- How to restart a service safely (drain, restart, verify)
- How to run a database migration in production
- How to increase JVM heap size without restart (JVM flags)
- How to flush a Redis cache
- How to replay Kafka messages from a topic offset
Exit Criteriaโ
Maintenance is ongoing โ it formally "ends" when the system is decommissioned:
- System reaches end-of-life
- Data migration completed
- Service deregistered from service registry and DNS
- Infrastructure destroyed (cloud resources terminated)
- Documentation archived
Enable Spring Boot Actuator for runtime visibility:
/actuator/healthโ liveness and readiness/actuator/metricsโ all Micrometer metrics/actuator/loggersโ change log level at runtime without restart/actuator/envโ inspect active configuration/actuator/threaddumpโ diagnose thread starvation