Skip to main content

Incident Response & Security Operations

The question is not whether you'll have a security incident โ€” it's whether you'll be prepared when it happens.


Incident Response Lifecycle (NIST)โ€‹

1. PREPARATION โ€” Build IR capability before incidents happen
2. DETECTION โ€” Identify that an incident has occurred
3. CONTAINMENT โ€” Stop the damage from spreading
4. ERADICATION โ€” Remove the threat from environment
5. RECOVERY โ€” Restore systems to normal operation
6. POST-INCIDENT โ€” Learn and improve

Phase 1: Preparationโ€‹

Severity Levelsโ€‹

SeverityDefinitionResponse SLAEscalation
P1 CriticalProduction breach, complete outage15 minCTO, Legal, DPO
P2 HighSignificant data exposure, major degradation1 hourEngineering Lead
P3 MediumLimited exposure, degraded service4 hoursOn-call Engineer
P4 LowMinor concern, no exposure24 hoursNext business day

Phase 2: Detectionโ€‹

Application-Level Detectionโ€‹

@Service
public class SecurityEventPublisher {

public void publishLoginFailure(String username, String ip, String reason) {
log.warn("{}", Json.encode(Map.of(
"eventType", "AUTH_FAILURE",
"severity", "MEDIUM",
"username", username,
"sourceIp", ip,
"reason", reason,
"timestamp", Instant.now()
)));
metricsService.increment("security.login.failure", "ip", ip);
}
}

// Detection rules in SIEM:
// RULE: 10+ login failures then success from same IP โ†’ credential stuffing + breach
// RULE: data_export.records > 10000 outside business hours โ†’ anomalous bulk access
// RULE: NEW admin user created โ†’ privilege escalation event

Phase 3: Containmentโ€‹

@PostMapping("/admin/security/lockout/{userId}")
@PreAuthorize("hasRole('SECURITY_ADMIN')")
public ResponseEntity<Void> emergencyLockout(@PathVariable Long userId,
@RequestParam String reason) {
// 1. Disable account
userRepository.findById(userId).ifPresent(user -> {
user.setLocked(true);
userRepository.save(user);
});

// 2. Invalidate all sessions
sessionRepository.deleteAllByUserId(userId);

// 3. Blacklist all JWTs issued before now
redis.opsForValue().set("user:tokens:blacklist:" + userId,
Instant.now().toString(), Duration.ofDays(7));

auditService.record(AuditEvent.securityAction("EMERGENCY_LOCKOUT", userId, reason));
return ResponseEntity.noContent().build();
}

// Check in JWT filter
public boolean isUserBlacklisted(Long userId, Instant tokenIssuedAt) {
String blacklistedAt = redis.opsForValue().get("user:tokens:blacklist:" + userId);
if (blacklistedAt == null) return false;
return tokenIssuedAt.isBefore(Instant.parse(blacklistedAt));
}
# Network containment โ€” isolate compromised instance (AWS)
aws ec2 modify-instance-attribute \
--instance-id i-1234567890abcdef0 \
--groups sg-isolation-group # SG with NO inbound/outbound

# Block IP at WAF
aws wafv2 update-ip-set --name BlockedIPs --addresses "1.2.3.4/32"

# Rotate compromised credentials immediately
aws iam delete-access-key --access-key-id AKIAIOSFODNN7EXAMPLE

Phase 6: Post-Incident Review Templateโ€‹

## Post-Incident Review โ€” INC-2024-001

**Severity:** P1 Critical
**Duration:** 4h 23min

### Timeline
- 14:00 โ€” Alert: anomalous S3 access pattern
- 14:15 โ€” On-call acknowledged
- 14:45 โ€” Scope: 50,000 user emails accessed
- 15:30 โ€” Compromised credential rotated
- 18:23 โ€” All-clear declared

### Root Cause
Long-lived AWS access key exposed in public GitHub repo (committed 6 months ago).

### Contributing Factors
- No secrets scanning in CI pipeline
- Access key had overly broad S3 permissions (s3:* on all buckets)
- No CloudTrail alert on unusual S3 GetObject patterns

### Action Items
| Action | Owner | Due |
|---|---|---|
| Add Gitleaks to all repos | DevOps | +7 days |
| Audit all IAM access keys | Security | +7 days |
| Add S3 CloudTrail alerting | Security | +14 days |
| Implement least-privilege IAM | IAM team | +30 days |

Vulnerability Managementโ€‹

CVSS Scoring & Remediation SLAsโ€‹

CVSSSeveritySLA
9.0โ€“10.0Critical24 hours
7.0โ€“8.9High7 days
4.0โ€“6.9Medium30 days
0.1โ€“3.9Low90 days

Security Metricsโ€‹

MetricDefinitionTarget
MTTDMean Time to Detect โ€” how long before breach detected< 1h for P1
MTTRMean Time to Respond โ€” contain + remediate< 4h for P1
Dwell TimeHow long attacker was in environment undetected< 24h
False Positive Rate% of alerts that are false positives< 10%

Interview Questionsโ€‹

  1. Describe the 6 phases of the NIST incident response lifecycle.
  2. How do you contain a compromised user account in a microservices system?
  3. What is dwell time and why does it matter?
  4. What should a post-incident review cover?
  5. What is CVSS and how does it drive remediation SLAs?
  6. What is the difference between vulnerability assessment and penetration testing?
  7. What security events should trigger an alert in your system?
  8. How do you handle a situation where an AWS access key is committed to GitHub?
  9. What metrics would you track to measure the effectiveness of a security program?
  10. What is threat hunting and how does it differ from reactive incident response?