Managing Long-Running Tasks
- New learners โ start at Why Long-Running Tasks Need Special Handling and The Core Async Job Pattern to understand the fundamental approach.
- Senior engineers โ jump to Task State Machine, Worker Reliability & Exactly-Once Processing, Distributed Scheduling, or Production Observability.
Why Long-Running Tasks Need Special Handlingโ
HTTP is designed for short request-response cycles. A typical Nginx timeout is 60 seconds, an AWS ALB times out at 60 seconds, and a client's browser may abort after 30โ60 seconds.
Operations that take more than 2โ5 seconds โ video transcoding, PDF generation, bulk data exports, ML inference, email campaigns โ cannot safely run inside an HTTP request handler because:
- Thread exhaustion โ each blocked HTTP thread cannot serve other requests, causing your server to saturate under concurrent long operations.
- Client-side timeouts โ the client will give up and retry, potentially triggering duplicate processing.
- Load balancer timeouts โ the connection is terminated by infrastructure even if the server hasn't finished.
- No progress visibility โ the client sees nothing until the operation completes or fails.
- No retry safety โ if the server crashes mid-operation, there's no record of what state things are in.
Any operation expected to take more than 2 seconds should be made asynchronous.
The Core Async Job Patternโ
The solution is a three-step pattern:
Step 1: Client submits job
Client โ POST /api/reports โ 202 Accepted { "job_id": "abc-123", "status_url": "/api/reports/abc-123" }
Step 2: Job runs asynchronously
API Server โ Job Queue โ Worker Pool โ Result Store
Step 3: Client polls for result
Client โ GET /api/reports/abc-123 โ { "status": "RUNNING", "progress": 45 }
Client โ GET /api/reports/abc-123 โ { "status": "COMPLETED", "result_url": "/api/reports/abc-123/result" }
Client โ GET /api/reports/abc-123/result โ <report data>
HTTP Status Codesโ
| Status | When to Use |
|---|---|
202 Accepted | Job submitted successfully, processing not yet complete |
200 OK | Job status or completed result returned in body |
303 See Other | Redirect to the result resource (on completion) |
404 Not Found | Job ID does not exist |
410 Gone | Job result has expired and been cleaned up |
REST API Design for Async Jobsโ
@RestController
@RequestMapping("/api/reports")
@RequiredArgsConstructor
@Slf4j
public class ReportController {
private final JobService jobService;
// Step 1: Submit job โ returns immediately with 202
@PostMapping
public ResponseEntity<JobResponse> submitReport(@RequestBody @Valid ReportRequest req,
Authentication auth) {
String jobId = jobService.submit(req, auth.getName());
String statusUrl = "/api/reports/" + jobId;
log.info("Report job {} submitted for user {}", jobId, auth.getName());
return ResponseEntity
.accepted()
.header("Location", statusUrl) // RFC-compliant: Location points to status
.header("Retry-After", "5") // Hint to client: poll after 5s
.body(new JobResponse(jobId, JobStatus.PENDING, statusUrl));
}
// Step 2: Poll status
@GetMapping("/{jobId}")
public ResponseEntity<JobStatusResponse> getStatus(@PathVariable String jobId,
Authentication auth) {
Job job = jobService.findByIdAndUser(jobId, auth.getName())
.orElseThrow(() -> new JobNotFoundException(jobId));
return switch (job.getStatus()) {
case PENDING, QUEUED -> ResponseEntity.ok()
.header("Retry-After", "3") // poll again in 3s
.body(JobStatusResponse.pending(job));
case RUNNING -> ResponseEntity.ok()
.header("Retry-After", "1") // job is active โ poll more frequently
.body(JobStatusResponse.running(job));
case COMPLETED -> ResponseEntity.status(HttpStatus.SEE_OTHER)
.header("Location", "/api/reports/" + jobId + "/result")
.body(JobStatusResponse.completed(job));
case FAILED -> ResponseEntity.ok()
.body(JobStatusResponse.failed(job));
case DEAD -> ResponseEntity.ok()
.body(JobStatusResponse.dead(job, "Job exceeded retry limit"));
};
}
// Step 3: Fetch result
@GetMapping("/{jobId}/result")
public ResponseEntity<ReportResult> getResult(@PathVariable String jobId,
Authentication auth) {
Job job = jobService.findByIdAndUser(jobId, auth.getName())
.orElseThrow(() -> new JobNotFoundException(jobId));
if (job.getStatus() != JobStatus.COMPLETED) {
return ResponseEntity.status(HttpStatus.CONFLICT)
.build(); // 409 โ job not yet complete
}
ReportResult result = jobService.getResult(jobId);
return ResponseEntity.ok()
.header("Cache-Control", "no-store") // results may be sensitive
.body(result);
}
}
Job Queue Architectureโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ JOB QUEUE ARCHITECTURE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Client Request
โ
โผ
API Server โโโโโโโบ Job Metadata DB โโโโโโโโโโโโโ Admin Dashboard
(202 + job_id) (PostgreSQL) (progress, status)
โ
โผ
Message Queue โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
(Kafka / SQS / RabbitMQ / Redis Streams) โ
โ โ
โผ โ
Worker Pool โโโโโบ Progress Store (Redis) โ
(auto-scalable) โ โ
โ โโโโโโโโบ SSE / WebSocket โโโโโโโโโโโโค
โผ (real-time client) โ
Result Store โ
(DB / S3 / GCS) โ
โ โ
โผ โ
Notification โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
(Webhook / Email / Push)
Task State Machineโ
A robust job system models the job lifecycle as a state machine to prevent invalid state transitions and enable clear recovery logic.
PENDING โ QUEUED โ RUNNING โ COMPLETED โ
โ
โโโโบ FAILED โโโบ (retry counter < max) โโโบ QUEUED
โโโโบ (retry counter >= max) โโโบ DEAD โ ๏ธ
public enum JobStatus {
PENDING, // Created but not yet in queue
QUEUED, // In the message queue, awaiting a worker
RUNNING, // Worker is actively processing
COMPLETED, // Finished successfully
FAILED, // Failed this attempt, eligible for retry
DEAD; // Exceeded max retries โ manual intervention needed
public boolean isTerminal() {
return this == COMPLETED || this == DEAD;
}
public boolean isRetryable() {
return this == FAILED;
}
}
@Entity
@Table(name = "jobs")
@Data
@Builder
public class Job {
@Id private String id;
@Enumerated(EnumType.STRING)
private JobStatus status;
private String type; // e.g., "REPORT_GENERATION"
private String userId;
private String payload; // serialized job parameters (JSONB)
private String resultKey; // S3 key or DB reference for result
private int progress; // 0โ100
private String progressMessage;
private String errorMessage;
private int retryCount;
private int maxRetries; // configurable per job type
private Instant createdAt;
private Instant startedAt;
private Instant completedAt;
private Instant expiresAt; // result TTL โ clean up after X days
public void transition(JobStatus newStatus) {
// Prevent illegal transitions
if (this.status.isTerminal()) {
throw new IllegalStateException(
"Cannot transition from terminal status " + this.status);
}
this.status = newStatus;
}
public boolean canRetry() {
return this.retryCount < this.maxRetries;
}
}
Worker Implementation (Spring Boot + Kafka)โ
Basic Workerโ
@Component
@RequiredArgsConstructor
@Slf4j
public class ReportWorker {
private final JobRepository jobRepository;
private final ReportGenerator reportGenerator;
private final S3Service s3Service;
private final ProgressTracker progressTracker;
private final NotificationService notificationService;
@KafkaListener(topics = "report-jobs", groupId = "report-workers", concurrency = "5")
public void processJob(ReportJobMessage message, Acknowledgment ack) {
String jobId = message.getJobId();
log.info("Worker picked up job {}", jobId);
Job job = jobRepository.findById(jobId).orElse(null);
if (job == null) {
log.warn("Job {} not found โ may have been deleted. Skipping.", jobId);
ack.acknowledge(); // don't retry โ message is stale
return;
}
// Check for idempotency: don't re-process a completed job
if (job.getStatus() == JobStatus.COMPLETED) {
log.warn("Job {} already completed โ duplicate message, skipping.", jobId);
ack.acknowledge();
return;
}
// Mark as RUNNING
job.transition(JobStatus.RUNNING);
job.setStartedAt(Instant.now());
jobRepository.save(job);
try {
// Execute the job with progress updates
ReportResult result = reportGenerator.generate(
message.getReportParams(),
(percent, msg) -> progressTracker.update(jobId, percent, msg)
);
// Store result
String resultKey = s3Service.store(jobId, result);
// Mark COMPLETED
job.transition(JobStatus.COMPLETED);
job.setResultKey(resultKey);
job.setCompletedAt(Instant.now());
job.setProgress(100);
jobRepository.save(job);
// Notify user
notificationService.notifyComplete(job.getUserId(), jobId);
log.info("Job {} completed successfully in {}ms", jobId,
Duration.between(job.getStartedAt(), job.getCompletedAt()).toMillis());
ack.acknowledge(); // commit Kafka offset after successful processing
} catch (Exception e) {
handleFailure(job, e);
ack.acknowledge(); // always ack โ retry is handled via re-queueing
}
}
private void handleFailure(Job job, Exception e) {
log.error("Job {} failed (attempt {}/{}): {}",
job.getId(), job.getRetryCount() + 1, job.getMaxRetries(), e.getMessage(), e);
job.setRetryCount(job.getRetryCount() + 1);
job.setErrorMessage(e.getMessage());
if (job.canRetry()) {
job.transition(JobStatus.FAILED);
jobRepository.save(job);
// Re-queue for retry (with delay via separate scheduler or DLQ re-drive)
} else {
job.transition(JobStatus.DEAD);
jobRepository.save(job);
notificationService.notifyFailed(job.getUserId(), job.getId(), e.getMessage());
}
}
}
Worker Reliability & Exactly-Once Processingโ
The "Double-Processing" Problemโ
When a worker processes a job and then crashes before acknowledging the Kafka message, the message becomes visible again (after the visibility timeout) and another worker picks it up. The job runs twice.
To handle this safely:
@KafkaListener(topics = "report-jobs")
@Transactional // DB operations are atomic
public void processJob(ReportJobMessage message, Acknowledgment ack) {
// Use optimistic locking or CAS to claim the job atomically
int updated = jobRepository.claimJob(message.getJobId(), JobStatus.QUEUED, JobStatus.RUNNING);
if (updated == 0) {
// Another worker already claimed this job โ skip
log.info("Job {} already claimed by another worker", message.getJobId());
ack.acknowledge();
return;
}
// ... process the job
}
public interface JobRepository extends JpaRepository<Job, String> {
// Atomic compare-and-set: only transitions to RUNNING if currently QUEUED
@Modifying
@Query("UPDATE Job j SET j.status = :newStatus, j.startedAt = :now " +
"WHERE j.id = :jobId AND j.status = :currentStatus")
int claimJob(@Param("jobId") String jobId,
@Param("currentStatus") JobStatus currentStatus,
@Param("newStatus") JobStatus newStatus,
@Param("now") Instant now);
}
Checkpoint Pattern for Resumable Jobsโ
For very long jobs (multi-hour data exports), use checkpoints to avoid restarting from zero on failure:
@Service
public class ResumableExportJob {
public void export(String jobId, ExportParams params) {
Job job = jobRepository.findById(jobId).orElseThrow();
// Load last checkpoint (if resuming after crash)
int startPage = job.getCheckpoint() != null
? Integer.parseInt(job.getCheckpoint())
: 0;
int totalPages = dataRepository.countPages(params);
for (int page = startPage; page < totalPages; page++) {
List<Record> batch = dataRepository.fetchPage(page, params);
exportService.writeBatch(jobId, batch);
// Save checkpoint after each page
jobRepository.saveCheckpoint(jobId, String.valueOf(page + 1));
jobRepository.updateProgress(jobId, (page + 1) * 100 / totalPages);
// Allow graceful shutdown check
if (Thread.currentThread().isInterrupted()) {
throw new JobInterruptedException("Export interrupted at page " + page);
}
}
}
}
Progress Trackingโ
Store Progress in Redisโ
@Service
@RequiredArgsConstructor
public class ProgressTracker {
private final RedisTemplate<String, String> redis;
public void update(String jobId, int percent, String message) {
String key = "job:progress:" + jobId;
Map<String, String> progress = Map.of(
"percent", String.valueOf(percent),
"message", message,
"updatedAt", Instant.now().toString()
);
redis.opsForHash().putAll(key, progress);
redis.expire(key, Duration.ofHours(24)); // TTL matches job result retention
}
public JobProgress get(String jobId) {
Map<Object, Object> data = redis.opsForHash().entries("job:progress:" + jobId);
if (data.isEmpty()) return JobProgress.unknown();
return JobProgress.fromMap(data);
}
}
Real-Time Progress via Server-Sent Events (SSE)โ
SSE is a lightweight protocol โ a persistent HTTP connection where the server pushes events. Ideal for progress bars.
@GetMapping(value = "/api/jobs/{jobId}/progress", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
public SseEmitter streamProgress(@PathVariable String jobId, Authentication auth) {
// 5-minute max duration โ client must reconnect for very long jobs
SseEmitter emitter = new SseEmitter(300_000L);
ScheduledFuture<?> task = scheduler.scheduleAtFixedRate(() -> {
try {
JobProgress progress = progressTracker.get(jobId);
emitter.send(SseEmitter.event()
.name("progress")
.data(progress)
.id(String.valueOf(System.currentTimeMillis())));
if (progress.isTerminal()) {
emitter.complete();
}
} catch (IOException e) {
emitter.completeWithError(e); // client disconnected
}
}, 0, 1, TimeUnit.SECONDS);
// Clean up the scheduled task when SSE connection closes
emitter.onCompletion(() -> task.cancel(true));
emitter.onTimeout(() -> task.cancel(true));
emitter.onError(ex -> task.cancel(true));
return emitter;
}
Real-Time Progress via WebSocketโ
For bidirectional communication (user can pause/cancel the job):
@Controller
@RequiredArgsConstructor
public class JobWebSocketController {
private final SimpMessagingTemplate messagingTemplate;
// Worker calls this to push updates to subscribed clients
public void pushProgress(String jobId, JobProgress progress) {
messagingTemplate.convertAndSend(
"/topic/jobs/" + jobId, // client subscribes to this
progress
);
}
// Client can send a cancel command
@MessageMapping("/jobs/{jobId}/cancel")
public void cancelJob(@DestinationVariable String jobId, Principal user) {
jobService.cancel(jobId, user.getName());
}
}
Webhooks (Push Callbacks)โ
Instead of the client polling, the server pushes a notification to a registered callback URL when the job completes.
Reliable Webhook Deliveryโ
@Service
@RequiredArgsConstructor
@Slf4j
public class WebhookDeliveryService {
private final WebhookRepository webhookRepository;
private final RestTemplate restTemplate;
@Async("webhookExecutor")
public void deliver(String webhookId, String callbackUrl, WebhookPayload payload) {
int maxRetries = 5;
long[] backoffMs = {1_000, 5_000, 30_000, 300_000, 1_800_000}; // 1s, 5s, 30s, 5m, 30m
for (int attempt = 0; attempt < maxRetries; attempt++) {
try {
String signature = signPayload(payload); // HMAC-SHA256
ResponseEntity<Void> response = restTemplate.exchange(
RequestEntity.post(URI.create(callbackUrl))
.header("X-Webhook-Id", webhookId)
.header("X-Webhook-Signature", signature)
.header("X-Webhook-Timestamp", String.valueOf(Instant.now().getEpochSecond()))
.contentType(MediaType.APPLICATION_JSON)
.body(payload),
Void.class
);
if (response.getStatusCode().is2xxSuccessful()) {
webhookRepository.markDelivered(webhookId, attempt + 1);
log.info("Webhook {} delivered on attempt {}", webhookId, attempt + 1);
return;
}
log.warn("Webhook {} got non-2xx response: {} (attempt {})",
webhookId, response.getStatusCode(), attempt + 1);
} catch (Exception e) {
log.warn("Webhook {} delivery attempt {} failed: {}", webhookId, attempt + 1, e.getMessage());
}
if (attempt < maxRetries - 1) {
sleep(backoffMs[attempt]);
}
}
webhookRepository.markFailed(webhookId, "Exhausted " + maxRetries + " delivery attempts");
log.error("Webhook {} permanently failed after {} attempts", webhookId, maxRetries);
}
private String signPayload(WebhookPayload payload) {
// HMAC-SHA256 of payload JSON with the tenant's webhook secret
byte[] secret = webhookSecretService.getSecret(payload.getTenantId());
return HmacUtils.hmacSha256Hex(secret, payload.toJson());
}
}
Webhook Securityโ
// Consumer side: verify the webhook signature before processing
@PostMapping("/webhook")
public ResponseEntity<Void> handleWebhook(
@RequestBody String rawBody,
@RequestHeader("X-Webhook-Signature") String signature,
@RequestHeader("X-Webhook-Timestamp") long timestamp) {
// 1. Reject stale webhooks (replay attack prevention)
if (Math.abs(Instant.now().getEpochSecond() - timestamp) > 300) {
return ResponseEntity.status(HttpStatus.UNAUTHORIZED).build();
}
// 2. Verify HMAC signature
String expectedSig = HmacUtils.hmacSha256Hex(webhookSecret, rawBody);
if (!MessageDigest.isEqual(expectedSig.getBytes(), signature.getBytes())) {
return ResponseEntity.status(HttpStatus.UNAUTHORIZED).build();
}
// 3. Process idempotently using X-Webhook-Id header
// ...
return ResponseEntity.ok().build();
}
Job Schedulingโ
Single-Node Scheduler (Spring @Scheduled)โ
@Scheduled(cron = "0 0 2 * * ?") // 2am daily โ UTC
public void generateDailyReport() {
jobService.submit(new DailyReportJobParams());
}
@Scheduled runs on every instance in a multi-node deployment. If you have 3 replicas, the job runs 3 times simultaneously.
Distributed Scheduling (ShedLock)โ
ShedLock uses a database (or Redis) lock to ensure only one node executes a scheduled job at a time:
@Scheduled(fixedDelay = 60_000)
@SchedulerLock(
name = "generateDailyReport",
lockAtMostFor = "PT10M", // release lock after 10m even if node crashes
lockAtLeastFor = "PT5M" // hold lock for at least 5m to prevent quick re-execution
)
public void generateDailyReport() {
// Only one node executes this at a time across the entire cluster
jobService.submit(new DailyReportJobParams());
}
-- ShedLock requires this table
CREATE TABLE shedlock (
name VARCHAR(64) NOT NULL,
lock_until TIMESTAMP(3) NOT NULL,
locked_at TIMESTAMP(3) NOT NULL,
locked_by VARCHAR(255) NOT NULL,
PRIMARY KEY (name)
);
Enterprise Scheduler (Quartz Clustered)โ
Quartz persists job schedules and execution history in a database, enabling full clustering with failover:
@Configuration
public class QuartzConfig {
@Bean
public SchedulerFactoryBean schedulerFactory(DataSource dataSource) {
SchedulerFactoryBean factory = new SchedulerFactoryBean();
factory.setDataSource(dataSource);
Properties props = new Properties();
props.setProperty("org.quartz.scheduler.instanceId", "AUTO"); // unique per node
props.setProperty("org.quartz.jobStore.class",
"org.quartz.impl.jdbcjobstore.JobStoreTX");
props.setProperty("org.quartz.jobStore.isClustered", "true");
props.setProperty("org.quartz.jobStore.clusterCheckinInterval", "20000");
factory.setQuartzProperties(props);
return factory;
}
@Bean
public JobDetail reportJobDetail() {
return JobBuilder.newJob(DailyReportJob.class)
.withIdentity("dailyReport", "reporting")
.storeDurably()
.build();
}
@Bean
public Trigger reportTrigger(JobDetail reportJobDetail) {
return TriggerBuilder.newTrigger()
.forJob(reportJobDetail)
.withSchedule(CronScheduleBuilder.cronSchedule("0 0 2 * * ?")
.withMisfireHandlingInstructionDoNothing()) // skip misfired runs
.build();
}
}
Distributed Schedulingโ
Comparison of Scheduling Approachesโ
| Approach | Multi-Node Safe | Persistence | Failover | Complexity | Best For |
|---|---|---|---|---|---|
@Scheduled | โ No | No | No | Trivial | Single-node dev/test |
| ShedLock | โ Yes | Lock only | Automatic (lock TTL) | Low | Most production use cases |
| Quartz Clustered | โ Yes | Full history | Automatic | Medium | Complex scheduling, audit trail |
| Temporal.io | โ Yes | Full workflow | Automatic | High | Long-running durable workflows |
| AWS EventBridge | โ Yes | Managed | Managed | Low | AWS-native serverless jobs |
Production Observabilityโ
Job Metricsโ
@Component
@RequiredArgsConstructor
public class JobMetrics {
private final MeterRegistry registry;
public void recordSubmitted(String jobType) {
registry.counter("jobs.submitted", "type", jobType).increment();
}
public void recordCompleted(String jobType, Duration duration) {
registry.timer("jobs.duration", "type", jobType, "result", "success")
.record(duration);
registry.counter("jobs.completed", "type", jobType).increment();
}
public void recordFailed(String jobType, int retryCount) {
registry.counter("jobs.failed", "type", jobType,
"retry_count", String.valueOf(retryCount)).increment();
}
public void recordDead(String jobType) {
registry.counter("jobs.dead", "type", jobType).increment();
}
public void recordQueueDepth(String jobType, long depth) {
registry.gauge("jobs.queue_depth", Tags.of("type", jobType), depth);
}
public void recordWorkerUtilization(int active, int total) {
registry.gauge("jobs.worker.active", active);
registry.gauge("jobs.worker.total", total);
registry.gauge("jobs.worker.utilization_pct",
total > 0 ? (double) active / total * 100 : 0);
}
}
Key alerts:
| Metric | Alert Threshold | Meaning |
|---|---|---|
jobs.queue_depth | > 10,000 | Workers can't keep up โ scale out or investigate |
jobs.dead rate | Any occurrence | Permanent failures โ check DLQ and error logs |
jobs.duration p99 | > expected timeout | Downstream slowness or deadlock in job |
jobs.worker.utilization_pct | > 90% consistently | Auto-scale workers |
Stale RUNNING jobs | > 2ร expected job duration | Worker crashed without updating status |
Finding Stale/Stuck Jobsโ
@Scheduled(fixedDelay = 60_000) // every minute
public void detectStuckJobs() {
// Jobs that have been RUNNING for more than 2ร their expected duration
Instant staleThreshold = Instant.now().minus(Duration.ofMinutes(30));
List<Job> stuckJobs = jobRepository.findStuckJobs(staleThreshold);
for (Job job : stuckJobs) {
log.warn("Detected stuck job {} (status=RUNNING since {})", job.getId(), job.getStartedAt());
alerting.alert(Alert.warn(
"Stuck job detected",
Map.of("jobId", job.getId(), "type", job.getType(), "startedAt", job.getStartedAt())
));
// Optionally: mark as FAILED and re-queue for retry
if (job.canRetry()) {
job.transition(JobStatus.FAILED);
jobRepository.save(job);
jobQueue.requeue(job);
}
}
}
Senior Interview Questionsโ
Q: A worker processes a report job and writes the result to S3, but crashes before updating the job status in the database to COMPLETED. What happens when the message becomes visible again in SQS/Kafka?โ
A: Another worker picks up the message and processes the job again โ writing the report to S3 a second time (overwriting or creating a duplicate key). To prevent this:
- Idempotency check โ before processing, check if
job.status == RUNNINGandjob.startedAtis recent (within the visibility timeout window). If so, skip. - Result key as idempotency key โ use the job ID as the S3 key. The second S3 write is identical and safe (overwrite).
- Atomic status update โ use a CAS (compare-and-set) DB update:
UPDATE jobs SET status='RUNNING' WHERE id=? AND status='QUEUED'. Only one worker claims the job.
Q: How do you design the polling endpoint to be cache-friendly for completed jobs?โ
A: For RUNNING/PENDING jobs: Cache-Control: no-store, must-revalidate โ status changes frequently.
For COMPLETED/FAILED jobs (terminal states): Cache-Control: public, max-age=3600 โ terminal states never change, safe to cache at CDN/browser for 1 hour. This dramatically reduces DB load for popular completed jobs (e.g., a report viewed by many users).
Q: How would you prevent a misfired Quartz job (e.g., server was down at 2am) from running 24 historical executions on restart?โ
A: Use withMisfireHandlingInstructionDoNothing() on the trigger. This discards misfired executions and waits for the next scheduled time. If the job must run (e.g., daily revenue report), use withMisfireHandlingInstructionFireAndProceed() โ it fires once immediately on recovery and then resumes the normal schedule.
Q: Design a system that processes 100,000 video transcoding jobs per day with progress reporting and failure retry.โ
A: Architecture:
- Submit:
POST /videos โ 202 + job_id. Job written to DB, message to SQS. - Workers: Auto-scaling ECS/K8s workers (scale based on SQS queue depth), each running FFmpeg.
- Progress: Worker writes
{percent, step}to Redis every 10s. Client polls via SSE or WebSocket. - Retry: SQS visibility timeout (1hr). Failed jobs automatically reappear after 1hr. DLQ after 3 attempts.
- Result: Transcoded files stored in S3 with presigned URLs (24hr TTL) returned to client.
- Scheduling: Workers lock the job via
UPDATE SET status='RUNNING' WHERE status='QUEUED'CAS before processing. - Observability: Metrics on queue depth, worker utilization, p95 transcoding duration, DLQ depth. Alert if queue depth > 10,000 (scale event).
Q: What is the difference between lockAtMostFor and lockAtLeastFor in ShedLock?โ
A: lockAtMostFor is the maximum time the lock is held โ if the lock holder crashes, the lock automatically expires after this duration so another node can run the job. It prevents "lock orphaning". lockAtLeastFor is the minimum time the lock is held โ even if the job completes in 1 second, the lock is kept for this duration to prevent the same job from immediately running again on another node before the lock release propagates.
See Alsoโ
- Dead Letter Queue (DLQ) โ Handling permanently failed job messages
- Message Queues โ Kafka, SQS, RabbitMQ comparisons for job queues
- Transactional Outbox Pattern โ Reliable job submission as an event
- Saga Pattern โ When a "job" is a multi-step distributed workflow