Skip to main content

Managing Long-Running Tasks

Who this guide is for

Why Long-Running Tasks Need Special Handlingโ€‹

HTTP is designed for short request-response cycles. A typical Nginx timeout is 60 seconds, an AWS ALB times out at 60 seconds, and a client's browser may abort after 30โ€“60 seconds.

Operations that take more than 2โ€“5 seconds โ€” video transcoding, PDF generation, bulk data exports, ML inference, email campaigns โ€” cannot safely run inside an HTTP request handler because:

  1. Thread exhaustion โ€” each blocked HTTP thread cannot serve other requests, causing your server to saturate under concurrent long operations.
  2. Client-side timeouts โ€” the client will give up and retry, potentially triggering duplicate processing.
  3. Load balancer timeouts โ€” the connection is terminated by infrastructure even if the server hasn't finished.
  4. No progress visibility โ€” the client sees nothing until the operation completes or fails.
  5. No retry safety โ€” if the server crashes mid-operation, there's no record of what state things are in.
The rule of thumb

Any operation expected to take more than 2 seconds should be made asynchronous.


The Core Async Job Patternโ€‹

The solution is a three-step pattern:

Step 1: Client submits job
Client โ†’ POST /api/reports โ†’ 202 Accepted { "job_id": "abc-123", "status_url": "/api/reports/abc-123" }

Step 2: Job runs asynchronously
API Server โ†’ Job Queue โ†’ Worker Pool โ†’ Result Store

Step 3: Client polls for result
Client โ†’ GET /api/reports/abc-123 โ†’ { "status": "RUNNING", "progress": 45 }
Client โ†’ GET /api/reports/abc-123 โ†’ { "status": "COMPLETED", "result_url": "/api/reports/abc-123/result" }
Client โ†’ GET /api/reports/abc-123/result โ†’ <report data>

HTTP Status Codesโ€‹

StatusWhen to Use
202 AcceptedJob submitted successfully, processing not yet complete
200 OKJob status or completed result returned in body
303 See OtherRedirect to the result resource (on completion)
404 Not FoundJob ID does not exist
410 GoneJob result has expired and been cleaned up

REST API Design for Async Jobsโ€‹

@RestController
@RequestMapping("/api/reports")
@RequiredArgsConstructor
@Slf4j
public class ReportController {

private final JobService jobService;

// Step 1: Submit job โ€” returns immediately with 202
@PostMapping
public ResponseEntity<JobResponse> submitReport(@RequestBody @Valid ReportRequest req,
Authentication auth) {
String jobId = jobService.submit(req, auth.getName());
String statusUrl = "/api/reports/" + jobId;

log.info("Report job {} submitted for user {}", jobId, auth.getName());

return ResponseEntity
.accepted()
.header("Location", statusUrl) // RFC-compliant: Location points to status
.header("Retry-After", "5") // Hint to client: poll after 5s
.body(new JobResponse(jobId, JobStatus.PENDING, statusUrl));
}

// Step 2: Poll status
@GetMapping("/{jobId}")
public ResponseEntity<JobStatusResponse> getStatus(@PathVariable String jobId,
Authentication auth) {
Job job = jobService.findByIdAndUser(jobId, auth.getName())
.orElseThrow(() -> new JobNotFoundException(jobId));

return switch (job.getStatus()) {
case PENDING, QUEUED -> ResponseEntity.ok()
.header("Retry-After", "3") // poll again in 3s
.body(JobStatusResponse.pending(job));

case RUNNING -> ResponseEntity.ok()
.header("Retry-After", "1") // job is active โ€” poll more frequently
.body(JobStatusResponse.running(job));

case COMPLETED -> ResponseEntity.status(HttpStatus.SEE_OTHER)
.header("Location", "/api/reports/" + jobId + "/result")
.body(JobStatusResponse.completed(job));

case FAILED -> ResponseEntity.ok()
.body(JobStatusResponse.failed(job));

case DEAD -> ResponseEntity.ok()
.body(JobStatusResponse.dead(job, "Job exceeded retry limit"));
};
}

// Step 3: Fetch result
@GetMapping("/{jobId}/result")
public ResponseEntity<ReportResult> getResult(@PathVariable String jobId,
Authentication auth) {
Job job = jobService.findByIdAndUser(jobId, auth.getName())
.orElseThrow(() -> new JobNotFoundException(jobId));

if (job.getStatus() != JobStatus.COMPLETED) {
return ResponseEntity.status(HttpStatus.CONFLICT)
.build(); // 409 โ€” job not yet complete
}

ReportResult result = jobService.getResult(jobId);
return ResponseEntity.ok()
.header("Cache-Control", "no-store") // results may be sensitive
.body(result);
}
}

Job Queue Architectureโ€‹

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ JOB QUEUE ARCHITECTURE โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Client Request
โ”‚
โ–ผ
API Server โ”€โ”€โ”€โ”€โ”€โ”€โ–บ Job Metadata DB โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Admin Dashboard
(202 + job_id) (PostgreSQL) (progress, status)
โ”‚
โ–ผ
Message Queue โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
(Kafka / SQS / RabbitMQ / Redis Streams) โ”‚
โ”‚ โ”‚
โ–ผ โ”‚
Worker Pool โ”€โ”€โ”€โ”€โ–บ Progress Store (Redis) โ”‚
(auto-scalable) โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ–บ SSE / WebSocket โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ–ผ (real-time client) โ”‚
Result Store โ”‚
(DB / S3 / GCS) โ”‚
โ”‚ โ”‚
โ–ผ โ”‚
Notification โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
(Webhook / Email / Push)

Task State Machineโ€‹

A robust job system models the job lifecycle as a state machine to prevent invalid state transitions and enable clear recovery logic.

PENDING โ†’ QUEUED โ†’ RUNNING โ†’ COMPLETED โœ…
โ”‚
โ””โ”€โ”€โ–บ FAILED โ”€โ”€โ–บ (retry counter < max) โ”€โ”€โ–บ QUEUED
โ””โ”€โ”€โ–บ (retry counter >= max) โ”€โ”€โ–บ DEAD โ˜ ๏ธ
public enum JobStatus {
PENDING, // Created but not yet in queue
QUEUED, // In the message queue, awaiting a worker
RUNNING, // Worker is actively processing
COMPLETED, // Finished successfully
FAILED, // Failed this attempt, eligible for retry
DEAD; // Exceeded max retries โ€” manual intervention needed

public boolean isTerminal() {
return this == COMPLETED || this == DEAD;
}

public boolean isRetryable() {
return this == FAILED;
}
}
@Entity
@Table(name = "jobs")
@Data
@Builder
public class Job {
@Id private String id;

@Enumerated(EnumType.STRING)
private JobStatus status;

private String type; // e.g., "REPORT_GENERATION"
private String userId;
private String payload; // serialized job parameters (JSONB)
private String resultKey; // S3 key or DB reference for result

private int progress; // 0โ€“100
private String progressMessage;
private String errorMessage;

private int retryCount;
private int maxRetries; // configurable per job type

private Instant createdAt;
private Instant startedAt;
private Instant completedAt;
private Instant expiresAt; // result TTL โ€” clean up after X days

public void transition(JobStatus newStatus) {
// Prevent illegal transitions
if (this.status.isTerminal()) {
throw new IllegalStateException(
"Cannot transition from terminal status " + this.status);
}
this.status = newStatus;
}

public boolean canRetry() {
return this.retryCount < this.maxRetries;
}
}

Worker Implementation (Spring Boot + Kafka)โ€‹

Basic Workerโ€‹

@Component
@RequiredArgsConstructor
@Slf4j
public class ReportWorker {

private final JobRepository jobRepository;
private final ReportGenerator reportGenerator;
private final S3Service s3Service;
private final ProgressTracker progressTracker;
private final NotificationService notificationService;

@KafkaListener(topics = "report-jobs", groupId = "report-workers", concurrency = "5")
public void processJob(ReportJobMessage message, Acknowledgment ack) {
String jobId = message.getJobId();
log.info("Worker picked up job {}", jobId);

Job job = jobRepository.findById(jobId).orElse(null);
if (job == null) {
log.warn("Job {} not found โ€” may have been deleted. Skipping.", jobId);
ack.acknowledge(); // don't retry โ€” message is stale
return;
}

// Check for idempotency: don't re-process a completed job
if (job.getStatus() == JobStatus.COMPLETED) {
log.warn("Job {} already completed โ€” duplicate message, skipping.", jobId);
ack.acknowledge();
return;
}

// Mark as RUNNING
job.transition(JobStatus.RUNNING);
job.setStartedAt(Instant.now());
jobRepository.save(job);

try {
// Execute the job with progress updates
ReportResult result = reportGenerator.generate(
message.getReportParams(),
(percent, msg) -> progressTracker.update(jobId, percent, msg)
);

// Store result
String resultKey = s3Service.store(jobId, result);

// Mark COMPLETED
job.transition(JobStatus.COMPLETED);
job.setResultKey(resultKey);
job.setCompletedAt(Instant.now());
job.setProgress(100);
jobRepository.save(job);

// Notify user
notificationService.notifyComplete(job.getUserId(), jobId);
log.info("Job {} completed successfully in {}ms", jobId,
Duration.between(job.getStartedAt(), job.getCompletedAt()).toMillis());

ack.acknowledge(); // commit Kafka offset after successful processing

} catch (Exception e) {
handleFailure(job, e);
ack.acknowledge(); // always ack โ€” retry is handled via re-queueing
}
}

private void handleFailure(Job job, Exception e) {
log.error("Job {} failed (attempt {}/{}): {}",
job.getId(), job.getRetryCount() + 1, job.getMaxRetries(), e.getMessage(), e);

job.setRetryCount(job.getRetryCount() + 1);
job.setErrorMessage(e.getMessage());

if (job.canRetry()) {
job.transition(JobStatus.FAILED);
jobRepository.save(job);
// Re-queue for retry (with delay via separate scheduler or DLQ re-drive)
} else {
job.transition(JobStatus.DEAD);
jobRepository.save(job);
notificationService.notifyFailed(job.getUserId(), job.getId(), e.getMessage());
}
}
}

Worker Reliability & Exactly-Once Processingโ€‹

The "Double-Processing" Problemโ€‹

When a worker processes a job and then crashes before acknowledging the Kafka message, the message becomes visible again (after the visibility timeout) and another worker picks it up. The job runs twice.

To handle this safely:

@KafkaListener(topics = "report-jobs")
@Transactional // DB operations are atomic
public void processJob(ReportJobMessage message, Acknowledgment ack) {
// Use optimistic locking or CAS to claim the job atomically
int updated = jobRepository.claimJob(message.getJobId(), JobStatus.QUEUED, JobStatus.RUNNING);

if (updated == 0) {
// Another worker already claimed this job โ€” skip
log.info("Job {} already claimed by another worker", message.getJobId());
ack.acknowledge();
return;
}

// ... process the job
}
public interface JobRepository extends JpaRepository<Job, String> {
// Atomic compare-and-set: only transitions to RUNNING if currently QUEUED
@Modifying
@Query("UPDATE Job j SET j.status = :newStatus, j.startedAt = :now " +
"WHERE j.id = :jobId AND j.status = :currentStatus")
int claimJob(@Param("jobId") String jobId,
@Param("currentStatus") JobStatus currentStatus,
@Param("newStatus") JobStatus newStatus,
@Param("now") Instant now);
}

Checkpoint Pattern for Resumable Jobsโ€‹

For very long jobs (multi-hour data exports), use checkpoints to avoid restarting from zero on failure:

@Service
public class ResumableExportJob {

public void export(String jobId, ExportParams params) {
Job job = jobRepository.findById(jobId).orElseThrow();

// Load last checkpoint (if resuming after crash)
int startPage = job.getCheckpoint() != null
? Integer.parseInt(job.getCheckpoint())
: 0;

int totalPages = dataRepository.countPages(params);

for (int page = startPage; page < totalPages; page++) {
List<Record> batch = dataRepository.fetchPage(page, params);
exportService.writeBatch(jobId, batch);

// Save checkpoint after each page
jobRepository.saveCheckpoint(jobId, String.valueOf(page + 1));
jobRepository.updateProgress(jobId, (page + 1) * 100 / totalPages);

// Allow graceful shutdown check
if (Thread.currentThread().isInterrupted()) {
throw new JobInterruptedException("Export interrupted at page " + page);
}
}
}
}

Progress Trackingโ€‹

Store Progress in Redisโ€‹

@Service
@RequiredArgsConstructor
public class ProgressTracker {

private final RedisTemplate<String, String> redis;

public void update(String jobId, int percent, String message) {
String key = "job:progress:" + jobId;
Map<String, String> progress = Map.of(
"percent", String.valueOf(percent),
"message", message,
"updatedAt", Instant.now().toString()
);
redis.opsForHash().putAll(key, progress);
redis.expire(key, Duration.ofHours(24)); // TTL matches job result retention
}

public JobProgress get(String jobId) {
Map<Object, Object> data = redis.opsForHash().entries("job:progress:" + jobId);
if (data.isEmpty()) return JobProgress.unknown();
return JobProgress.fromMap(data);
}
}

Real-Time Progress via Server-Sent Events (SSE)โ€‹

SSE is a lightweight protocol โ€” a persistent HTTP connection where the server pushes events. Ideal for progress bars.

@GetMapping(value = "/api/jobs/{jobId}/progress", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
public SseEmitter streamProgress(@PathVariable String jobId, Authentication auth) {
// 5-minute max duration โ€” client must reconnect for very long jobs
SseEmitter emitter = new SseEmitter(300_000L);

ScheduledFuture<?> task = scheduler.scheduleAtFixedRate(() -> {
try {
JobProgress progress = progressTracker.get(jobId);

emitter.send(SseEmitter.event()
.name("progress")
.data(progress)
.id(String.valueOf(System.currentTimeMillis())));

if (progress.isTerminal()) {
emitter.complete();
}
} catch (IOException e) {
emitter.completeWithError(e); // client disconnected
}
}, 0, 1, TimeUnit.SECONDS);

// Clean up the scheduled task when SSE connection closes
emitter.onCompletion(() -> task.cancel(true));
emitter.onTimeout(() -> task.cancel(true));
emitter.onError(ex -> task.cancel(true));

return emitter;
}

Real-Time Progress via WebSocketโ€‹

For bidirectional communication (user can pause/cancel the job):

@Controller
@RequiredArgsConstructor
public class JobWebSocketController {

private final SimpMessagingTemplate messagingTemplate;

// Worker calls this to push updates to subscribed clients
public void pushProgress(String jobId, JobProgress progress) {
messagingTemplate.convertAndSend(
"/topic/jobs/" + jobId, // client subscribes to this
progress
);
}

// Client can send a cancel command
@MessageMapping("/jobs/{jobId}/cancel")
public void cancelJob(@DestinationVariable String jobId, Principal user) {
jobService.cancel(jobId, user.getName());
}
}

Webhooks (Push Callbacks)โ€‹

Instead of the client polling, the server pushes a notification to a registered callback URL when the job completes.

Reliable Webhook Deliveryโ€‹

@Service
@RequiredArgsConstructor
@Slf4j
public class WebhookDeliveryService {

private final WebhookRepository webhookRepository;
private final RestTemplate restTemplate;

@Async("webhookExecutor")
public void deliver(String webhookId, String callbackUrl, WebhookPayload payload) {
int maxRetries = 5;
long[] backoffMs = {1_000, 5_000, 30_000, 300_000, 1_800_000}; // 1s, 5s, 30s, 5m, 30m

for (int attempt = 0; attempt < maxRetries; attempt++) {
try {
String signature = signPayload(payload); // HMAC-SHA256

ResponseEntity<Void> response = restTemplate.exchange(
RequestEntity.post(URI.create(callbackUrl))
.header("X-Webhook-Id", webhookId)
.header("X-Webhook-Signature", signature)
.header("X-Webhook-Timestamp", String.valueOf(Instant.now().getEpochSecond()))
.contentType(MediaType.APPLICATION_JSON)
.body(payload),
Void.class
);

if (response.getStatusCode().is2xxSuccessful()) {
webhookRepository.markDelivered(webhookId, attempt + 1);
log.info("Webhook {} delivered on attempt {}", webhookId, attempt + 1);
return;
}

log.warn("Webhook {} got non-2xx response: {} (attempt {})",
webhookId, response.getStatusCode(), attempt + 1);

} catch (Exception e) {
log.warn("Webhook {} delivery attempt {} failed: {}", webhookId, attempt + 1, e.getMessage());
}

if (attempt < maxRetries - 1) {
sleep(backoffMs[attempt]);
}
}

webhookRepository.markFailed(webhookId, "Exhausted " + maxRetries + " delivery attempts");
log.error("Webhook {} permanently failed after {} attempts", webhookId, maxRetries);
}

private String signPayload(WebhookPayload payload) {
// HMAC-SHA256 of payload JSON with the tenant's webhook secret
byte[] secret = webhookSecretService.getSecret(payload.getTenantId());
return HmacUtils.hmacSha256Hex(secret, payload.toJson());
}
}

Webhook Securityโ€‹

// Consumer side: verify the webhook signature before processing
@PostMapping("/webhook")
public ResponseEntity<Void> handleWebhook(
@RequestBody String rawBody,
@RequestHeader("X-Webhook-Signature") String signature,
@RequestHeader("X-Webhook-Timestamp") long timestamp) {

// 1. Reject stale webhooks (replay attack prevention)
if (Math.abs(Instant.now().getEpochSecond() - timestamp) > 300) {
return ResponseEntity.status(HttpStatus.UNAUTHORIZED).build();
}

// 2. Verify HMAC signature
String expectedSig = HmacUtils.hmacSha256Hex(webhookSecret, rawBody);
if (!MessageDigest.isEqual(expectedSig.getBytes(), signature.getBytes())) {
return ResponseEntity.status(HttpStatus.UNAUTHORIZED).build();
}

// 3. Process idempotently using X-Webhook-Id header
// ...

return ResponseEntity.ok().build();
}

Job Schedulingโ€‹

Single-Node Scheduler (Spring @Scheduled)โ€‹

@Scheduled(cron = "0 0 2 * * ?") // 2am daily โ€” UTC
public void generateDailyReport() {
jobService.submit(new DailyReportJobParams());
}
Single-node only

@Scheduled runs on every instance in a multi-node deployment. If you have 3 replicas, the job runs 3 times simultaneously.

Distributed Scheduling (ShedLock)โ€‹

ShedLock uses a database (or Redis) lock to ensure only one node executes a scheduled job at a time:

@Scheduled(fixedDelay = 60_000)
@SchedulerLock(
name = "generateDailyReport",
lockAtMostFor = "PT10M", // release lock after 10m even if node crashes
lockAtLeastFor = "PT5M" // hold lock for at least 5m to prevent quick re-execution
)
public void generateDailyReport() {
// Only one node executes this at a time across the entire cluster
jobService.submit(new DailyReportJobParams());
}
-- ShedLock requires this table
CREATE TABLE shedlock (
name VARCHAR(64) NOT NULL,
lock_until TIMESTAMP(3) NOT NULL,
locked_at TIMESTAMP(3) NOT NULL,
locked_by VARCHAR(255) NOT NULL,
PRIMARY KEY (name)
);

Enterprise Scheduler (Quartz Clustered)โ€‹

Quartz persists job schedules and execution history in a database, enabling full clustering with failover:

@Configuration
public class QuartzConfig {

@Bean
public SchedulerFactoryBean schedulerFactory(DataSource dataSource) {
SchedulerFactoryBean factory = new SchedulerFactoryBean();
factory.setDataSource(dataSource);

Properties props = new Properties();
props.setProperty("org.quartz.scheduler.instanceId", "AUTO"); // unique per node
props.setProperty("org.quartz.jobStore.class",
"org.quartz.impl.jdbcjobstore.JobStoreTX");
props.setProperty("org.quartz.jobStore.isClustered", "true");
props.setProperty("org.quartz.jobStore.clusterCheckinInterval", "20000");
factory.setQuartzProperties(props);
return factory;
}

@Bean
public JobDetail reportJobDetail() {
return JobBuilder.newJob(DailyReportJob.class)
.withIdentity("dailyReport", "reporting")
.storeDurably()
.build();
}

@Bean
public Trigger reportTrigger(JobDetail reportJobDetail) {
return TriggerBuilder.newTrigger()
.forJob(reportJobDetail)
.withSchedule(CronScheduleBuilder.cronSchedule("0 0 2 * * ?")
.withMisfireHandlingInstructionDoNothing()) // skip misfired runs
.build();
}
}

Distributed Schedulingโ€‹

Comparison of Scheduling Approachesโ€‹

ApproachMulti-Node SafePersistenceFailoverComplexityBest For
@ScheduledโŒ NoNoNoTrivialSingle-node dev/test
ShedLockโœ… YesLock onlyAutomatic (lock TTL)LowMost production use cases
Quartz Clusteredโœ… YesFull historyAutomaticMediumComplex scheduling, audit trail
Temporal.ioโœ… YesFull workflowAutomaticHighLong-running durable workflows
AWS EventBridgeโœ… YesManagedManagedLowAWS-native serverless jobs

Production Observabilityโ€‹

Job Metricsโ€‹

@Component
@RequiredArgsConstructor
public class JobMetrics {

private final MeterRegistry registry;

public void recordSubmitted(String jobType) {
registry.counter("jobs.submitted", "type", jobType).increment();
}

public void recordCompleted(String jobType, Duration duration) {
registry.timer("jobs.duration", "type", jobType, "result", "success")
.record(duration);
registry.counter("jobs.completed", "type", jobType).increment();
}

public void recordFailed(String jobType, int retryCount) {
registry.counter("jobs.failed", "type", jobType,
"retry_count", String.valueOf(retryCount)).increment();
}

public void recordDead(String jobType) {
registry.counter("jobs.dead", "type", jobType).increment();
}

public void recordQueueDepth(String jobType, long depth) {
registry.gauge("jobs.queue_depth", Tags.of("type", jobType), depth);
}

public void recordWorkerUtilization(int active, int total) {
registry.gauge("jobs.worker.active", active);
registry.gauge("jobs.worker.total", total);
registry.gauge("jobs.worker.utilization_pct",
total > 0 ? (double) active / total * 100 : 0);
}
}

Key alerts:

MetricAlert ThresholdMeaning
jobs.queue_depth> 10,000Workers can't keep up โ€” scale out or investigate
jobs.dead rateAny occurrencePermanent failures โ€” check DLQ and error logs
jobs.duration p99> expected timeoutDownstream slowness or deadlock in job
jobs.worker.utilization_pct> 90% consistentlyAuto-scale workers
Stale RUNNING jobs> 2ร— expected job durationWorker crashed without updating status

Finding Stale/Stuck Jobsโ€‹

@Scheduled(fixedDelay = 60_000) // every minute
public void detectStuckJobs() {
// Jobs that have been RUNNING for more than 2ร— their expected duration
Instant staleThreshold = Instant.now().minus(Duration.ofMinutes(30));
List<Job> stuckJobs = jobRepository.findStuckJobs(staleThreshold);

for (Job job : stuckJobs) {
log.warn("Detected stuck job {} (status=RUNNING since {})", job.getId(), job.getStartedAt());
alerting.alert(Alert.warn(
"Stuck job detected",
Map.of("jobId", job.getId(), "type", job.getType(), "startedAt", job.getStartedAt())
));

// Optionally: mark as FAILED and re-queue for retry
if (job.canRetry()) {
job.transition(JobStatus.FAILED);
jobRepository.save(job);
jobQueue.requeue(job);
}
}
}

Senior Interview Questionsโ€‹

Q: A worker processes a report job and writes the result to S3, but crashes before updating the job status in the database to COMPLETED. What happens when the message becomes visible again in SQS/Kafka?โ€‹

A: Another worker picks up the message and processes the job again โ€” writing the report to S3 a second time (overwriting or creating a duplicate key). To prevent this:

  1. Idempotency check โ€” before processing, check if job.status == RUNNING and job.startedAt is recent (within the visibility timeout window). If so, skip.
  2. Result key as idempotency key โ€” use the job ID as the S3 key. The second S3 write is identical and safe (overwrite).
  3. Atomic status update โ€” use a CAS (compare-and-set) DB update: UPDATE jobs SET status='RUNNING' WHERE id=? AND status='QUEUED'. Only one worker claims the job.

Q: How do you design the polling endpoint to be cache-friendly for completed jobs?โ€‹

A: For RUNNING/PENDING jobs: Cache-Control: no-store, must-revalidate โ€” status changes frequently. For COMPLETED/FAILED jobs (terminal states): Cache-Control: public, max-age=3600 โ€” terminal states never change, safe to cache at CDN/browser for 1 hour. This dramatically reduces DB load for popular completed jobs (e.g., a report viewed by many users).

Q: How would you prevent a misfired Quartz job (e.g., server was down at 2am) from running 24 historical executions on restart?โ€‹

A: Use withMisfireHandlingInstructionDoNothing() on the trigger. This discards misfired executions and waits for the next scheduled time. If the job must run (e.g., daily revenue report), use withMisfireHandlingInstructionFireAndProceed() โ€” it fires once immediately on recovery and then resumes the normal schedule.

Q: Design a system that processes 100,000 video transcoding jobs per day with progress reporting and failure retry.โ€‹

A: Architecture:

  • Submit: POST /videos โ†’ 202 + job_id. Job written to DB, message to SQS.
  • Workers: Auto-scaling ECS/K8s workers (scale based on SQS queue depth), each running FFmpeg.
  • Progress: Worker writes {percent, step} to Redis every 10s. Client polls via SSE or WebSocket.
  • Retry: SQS visibility timeout (1hr). Failed jobs automatically reappear after 1hr. DLQ after 3 attempts.
  • Result: Transcoded files stored in S3 with presigned URLs (24hr TTL) returned to client.
  • Scheduling: Workers lock the job via UPDATE SET status='RUNNING' WHERE status='QUEUED' CAS before processing.
  • Observability: Metrics on queue depth, worker utilization, p95 transcoding duration, DLQ depth. Alert if queue depth > 10,000 (scale event).

Q: What is the difference between lockAtMostFor and lockAtLeastFor in ShedLock?โ€‹

A: lockAtMostFor is the maximum time the lock is held โ€” if the lock holder crashes, the lock automatically expires after this duration so another node can run the job. It prevents "lock orphaning". lockAtLeastFor is the minimum time the lock is held โ€” even if the job completes in 1 second, the lock is kept for this duration to prevent the same job from immediately running again on another node before the lock release propagates.


See Alsoโ€‹