Incident Response & Security Operations
The question is not whether you'll have a security incident โ it's whether you'll be prepared when it happens.
Incident Response Lifecycle (NIST)โ
1. PREPARATION โ Build IR capability before incidents happen
2. DETECTION โ Identify that an incident has occurred
3. CONTAINMENT โ Stop the damage from spreading
4. ERADICATION โ Remove the threat from environment
5. RECOVERY โ Restore systems to normal operation
6. POST-INCIDENT โ Learn and improve
Phase 1: Preparationโ
Severity Levelsโ
| Severity | Definition | Response SLA | Escalation |
|---|---|---|---|
| P1 Critical | Production breach, complete outage | 15 min | CTO, Legal, DPO |
| P2 High | Significant data exposure, major degradation | 1 hour | Engineering Lead |
| P3 Medium | Limited exposure, degraded service | 4 hours | On-call Engineer |
| P4 Low | Minor concern, no exposure | 24 hours | Next business day |
Phase 2: Detectionโ
Application-Level Detectionโ
@Service
public class SecurityEventPublisher {
public void publishLoginFailure(String username, String ip, String reason) {
log.warn("{}", Json.encode(Map.of(
"eventType", "AUTH_FAILURE",
"severity", "MEDIUM",
"username", username,
"sourceIp", ip,
"reason", reason,
"timestamp", Instant.now()
)));
metricsService.increment("security.login.failure", "ip", ip);
}
}
// Detection rules in SIEM:
// RULE: 10+ login failures then success from same IP โ credential stuffing + breach
// RULE: data_export.records > 10000 outside business hours โ anomalous bulk access
// RULE: NEW admin user created โ privilege escalation event
Phase 3: Containmentโ
@PostMapping("/admin/security/lockout/{userId}")
@PreAuthorize("hasRole('SECURITY_ADMIN')")
public ResponseEntity<Void> emergencyLockout(@PathVariable Long userId,
@RequestParam String reason) {
// 1. Disable account
userRepository.findById(userId).ifPresent(user -> {
user.setLocked(true);
userRepository.save(user);
});
// 2. Invalidate all sessions
sessionRepository.deleteAllByUserId(userId);
// 3. Blacklist all JWTs issued before now
redis.opsForValue().set("user:tokens:blacklist:" + userId,
Instant.now().toString(), Duration.ofDays(7));
auditService.record(AuditEvent.securityAction("EMERGENCY_LOCKOUT", userId, reason));
return ResponseEntity.noContent().build();
}
// Check in JWT filter
public boolean isUserBlacklisted(Long userId, Instant tokenIssuedAt) {
String blacklistedAt = redis.opsForValue().get("user:tokens:blacklist:" + userId);
if (blacklistedAt == null) return false;
return tokenIssuedAt.isBefore(Instant.parse(blacklistedAt));
}
# Network containment โ isolate compromised instance (AWS)
aws ec2 modify-instance-attribute \
--instance-id i-1234567890abcdef0 \
--groups sg-isolation-group # SG with NO inbound/outbound
# Block IP at WAF
aws wafv2 update-ip-set --name BlockedIPs --addresses "1.2.3.4/32"
# Rotate compromised credentials immediately
aws iam delete-access-key --access-key-id AKIAIOSFODNN7EXAMPLE
Phase 6: Post-Incident Review Templateโ
## Post-Incident Review โ INC-2024-001
**Severity:** P1 Critical
**Duration:** 4h 23min
### Timeline
- 14:00 โ Alert: anomalous S3 access pattern
- 14:15 โ On-call acknowledged
- 14:45 โ Scope: 50,000 user emails accessed
- 15:30 โ Compromised credential rotated
- 18:23 โ All-clear declared
### Root Cause
Long-lived AWS access key exposed in public GitHub repo (committed 6 months ago).
### Contributing Factors
- No secrets scanning in CI pipeline
- Access key had overly broad S3 permissions (s3:* on all buckets)
- No CloudTrail alert on unusual S3 GetObject patterns
### Action Items
| Action | Owner | Due |
|---|---|---|
| Add Gitleaks to all repos | DevOps | +7 days |
| Audit all IAM access keys | Security | +7 days |
| Add S3 CloudTrail alerting | Security | +14 days |
| Implement least-privilege IAM | IAM team | +30 days |
Vulnerability Managementโ
CVSS Scoring & Remediation SLAsโ
| CVSS | Severity | SLA |
|---|---|---|
| 9.0โ10.0 | Critical | 24 hours |
| 7.0โ8.9 | High | 7 days |
| 4.0โ6.9 | Medium | 30 days |
| 0.1โ3.9 | Low | 90 days |
Security Metricsโ
| Metric | Definition | Target |
|---|---|---|
| MTTD | Mean Time to Detect โ how long before breach detected | < 1h for P1 |
| MTTR | Mean Time to Respond โ contain + remediate | < 4h for P1 |
| Dwell Time | How long attacker was in environment undetected | < 24h |
| False Positive Rate | % of alerts that are false positives | < 10% |
Interview Questionsโ
- Describe the 6 phases of the NIST incident response lifecycle.
- How do you contain a compromised user account in a microservices system?
- What is dwell time and why does it matter?
- What should a post-incident review cover?
- What is CVSS and how does it drive remediation SLAs?
- What is the difference between vulnerability assessment and penetration testing?
- What security events should trigger an alert in your system?
- How do you handle a situation where an AWS access key is committed to GitHub?
- What metrics would you track to measure the effectiveness of a security program?
- What is threat hunting and how does it differ from reactive incident response?