Managing Long-Running Tasks
Operations taking more than ~2 seconds should be made asynchronous. Never block an HTTP connection for a long operation.
The Core Pattern: Async Job API
1. Client POSTs job request → API returns 202 Accepted + job_id
2. Job runs asynchronously in background
3. Client polls GET /jobs/{job_id}/status → {status, progress, result_url}
4. On completion → Client fetches result
HTTP Status Codes
| Status | Meaning |
|---|---|
202 Accepted | Job submitted, not yet complete |
200 OK | Job complete, result in body |
303 See Other | Redirect to result location |
REST API Design for Async Jobs
// Submit job
@PostMapping("/reports")
public ResponseEntity<JobResponse> generateReport(@RequestBody ReportRequest req) {
String jobId = jobService.submit(req);
return ResponseEntity
.accepted()
.header("Location", "/reports/" + jobId)
.body(new JobResponse(jobId, JobStatus.PENDING));
}
// Poll status
@GetMapping("/reports/{jobId}")
public ResponseEntity<JobStatusResponse> getStatus(@PathVariable String jobId) {
Job job = jobService.findById(jobId);
return switch (job.getStatus()) {
case PENDING, RUNNING -> ResponseEntity.ok(
new JobStatusResponse(jobId, job.getStatus(), job.getProgress())
);
case COMPLETED -> ResponseEntity
.status(HttpStatus.SEE_OTHER)
.header("Location", "/reports/" + jobId + "/result")
.build();
case FAILED -> ResponseEntity.ok(
new JobStatusResponse(jobId, FAILED, job.getErrorMessage())
);
};
}
// Get result
@GetMapping("/reports/{jobId}/result")
public ResponseEntity<ReportResult> getResult(@PathVariable String jobId) {
return ResponseEntity.ok(jobService.getResult(jobId));
}
Job Queue Architecture
API Server → Job Queue (Kafka / RabbitMQ / SQS / Redis)
↓
Worker Pool
(auto-scalable)
↓
Result Store (DB / S3)
↓
Notification (webhook / SSE / email)
Spring Boot + Kafka Workers
@KafkaListener(topics = "report-jobs", concurrency = "5")
public void processJob(ReportJob job) {
jobRepository.updateStatus(job.getId(), RUNNING);
try {
ReportResult result = reportGenerator.generate(job);
String resultKey = s3Service.store(result);
jobRepository.complete(job.getId(), resultKey);
notificationService.notifyComplete(job.getUserId(), job.getId());
} catch (Exception e) {
jobRepository.fail(job.getId(), e.getMessage());
log.error("Job {} failed", job.getId(), e);
}
}
Progress Tracking
Store Progress in Redis
@Service
public class ProgressTracker {
@Autowired private RedisTemplate<String, String> redis;
public void updateProgress(String jobId, int percent, String message) {
String key = "job:progress:" + jobId;
Map<String, String> progress = Map.of(
"percent", String.valueOf(percent),
"message", message,
"updatedAt", Instant.now().toString()
);
redis.opsForHash().putAll(key, progress);
redis.expire(key, Duration.ofHours(24));
}
public JobProgress getProgress(String jobId) {
Map<Object, Object> data = redis.opsForHash()
.entries("job:progress:" + jobId);
return JobProgress.fromMap(data);
}
}
Real-Time Progress via SSE
@GetMapping(value = "/jobs/{jobId}/progress", produces = TEXT_EVENT_STREAM_VALUE)
public SseEmitter streamProgress(@PathVariable String jobId) {
SseEmitter emitter = new SseEmitter(300_000L); // 5 min timeout
scheduledExecutor.scheduleAtFixedRate(() -> {
JobProgress progress = progressTracker.getProgress(jobId);
emitter.send(progress);
if (progress.isTerminal()) emitter.complete();
}, 0, 1, TimeUnit.SECONDS);
return emitter;
}
Webhooks (Push Callbacks)
Instead of polling, the server POSTs to the client when done.
1. Client registers: POST /webhooks {url: "https://client.com/callback", events: ["job.complete"]}
2. Job completes → Server POSTs to callback URL
3. Client processes result
@Service
public class WebhookService {
@Async
public void deliver(String callbackUrl, Object payload) {
int maxRetries = 3;
for (int attempt = 1; attempt <= maxRetries; attempt++) {
try {
restTemplate.postForEntity(callbackUrl, payload, Void.class);
return; // Success
} catch (Exception e) {
log.warn("Webhook delivery attempt {} failed for {}", attempt, callbackUrl);
if (attempt < maxRetries) {
Thread.sleep(exponentialBackoff(attempt)); // 2s, 4s, 8s
}
}
}
log.error("Webhook delivery failed after {} attempts", maxRetries);
}
private long exponentialBackoff(int attempt) {
return (long) Math.pow(2, attempt) * 1000;
}
}
Webhook Security
- HMAC signature on payload:
X-Signature: sha256=<hmac> - Expiring timestamps to prevent replay attacks
- Retry with idempotency keys
Job Scheduling (Cron / Delayed Jobs)
Spring Scheduler (Single Node)
@Scheduled(cron = "0 0 2 * * ?") // 2am daily
public void generateDailyReport() {
jobService.submit(new DailyReportJob());
}
Distributed Scheduling (Multi-Node)
Use distributed scheduler to prevent duplicate execution:
// ShedLock — prevents concurrent execution across nodes
@Scheduled(fixedDelay = 60_000)
@SchedulerLock(name = "generateDailyReport", lockAtMostFor = "10m", lockAtLeastFor = "5m")
public void generateDailyReport() {
// Only one node executes at a time
}
Quartz Scheduler (Enterprise)
@Bean
public JobDetail reportJobDetail() {
return JobBuilder.newJob(ReportJob.class)
.withIdentity("reportJob")
.storeDurably()
.build();
}
@Bean
public Trigger reportTrigger(JobDetail reportJobDetail) {
return TriggerBuilder.newTrigger()
.forJob(reportJobDetail)
.withSchedule(CronScheduleBuilder.cronSchedule("0 0 2 * * ?"))
.build();
}
Dead Letter Queue (DLQ) — Failed Job Handling
Normal Queue → Worker (fails 3x) → Dead Letter Queue (DLQ)
↓
Alert / Manual inspection / Replay
# Spring Kafka DLQ config
spring:
kafka:
consumer:
group-id: report-workers
listener:
ack-mode: manual_immediate
# Custom error handler → send to DLQ after N retries
Task State Machine
PENDING → QUEUED → RUNNING → COMPLETED
↘ FAILED → (retry) → RUNNING
→ (max retries) → DEAD
public enum JobStatus {
PENDING, QUEUED, RUNNING, COMPLETED, FAILED, DEAD;
public boolean isTerminal() {
return this == COMPLETED || this == DEAD;
}
}
Interview Questions
- Why should long-running operations be made asynchronous? What's the risk of blocking?
- How do you design a REST API for an async operation? What HTTP status codes apply?
- What is a dead letter queue and when is it used?
- How do you track progress of a long-running background job?
- What are webhooks? What are their reliability challenges?
- How do you prevent duplicate job execution in a distributed scheduler?
- How would you design a system that processes video uploads asynchronously?
- What happens if a worker crashes mid-job? How do you ensure the job is retried?