Stalled Jobs

A stalled job is a job that was being processed by a worker but didn’t complete within the expected time frame. BullMQ automatically detects and recovers stalled jobs, moving them back to the waiting state for reprocessing.

What Causes Stalled Jobs?

Jobs become stalled when:

CPU-intensive operations block the event loop: The worker can’t renew the job’s lock
Worker crashes: The process terminates unexpectedly
Network issues: Worker loses connection to Redis
Long-running jobs: Processing exceeds the lockDuration
High system load: Worker can’t respond in time

How Stalled Detection Works

BullMQ uses a lock-based mechanism:

Worker acquires lock

When a worker starts processing a job, it acquires a lock with a TTL (time-to-live).

Lock renewal

The worker automatically renews the lock periodically while processing.

Stalled detection

If the lock expires (not renewed in time), the job is marked as stalled.

Recovery

The stalled checker moves the job back to waiting for another worker to process.

import { Worker } from 'bullmq';

const worker = new Worker('queueName', processorFunction, {
  // Lock duration: How long before a job is considered stalled
  lockDuration: 30000, // 30 seconds (default)
  
  // Stalled check interval: How often to check for stalled jobs
  stalledInterval: 30000, // 30 seconds (default)
  
  // Lock renewal frequency (auto-calculated as lockDuration / 2)
  lockRenewTime: 15000, // 15 seconds
});

Stalled Job Configuration

lockDuration

The maximum time a job can be locked before being considered stalled:

const worker = new Worker('queueName', processorFunction, {
  lockDuration: 60000, // 60 seconds
});

Set lockDuration longer than your longest expected job duration. If jobs regularly take 45 seconds, use at least 60 seconds.

stalledInterval

How often the worker checks for stalled jobs:

const worker = new Worker('queueName', processorFunction, {
  stalledInterval: 10000, // Check every 10 seconds
});

Lower stalledInterval means faster recovery but slightly higher Redis load.

maxStalledCount

Maximum times a job can be recovered from stalled state before moving to failed:

const worker = new Worker('queueName', processorFunction, {
  maxStalledCount: 2, // Allow 2 stalled recoveries
});

After exceeding maxStalledCount, the job moves to the failed state.

skipStalledCheck

Disable stalled checking for a specific worker:

const worker = new Worker('queueName', processorFunction, {
  skipStalledCheck: true, // This worker won't check for stalled jobs
});

Even if one worker skips stalled checks, other workers on the same queue can still detect and recover stalled jobs.

skipLockRenewal

Disable automatic lock renewal:

const worker = new Worker('queueName', processorFunction, {
  skipLockRenewal: true, // Don't renew locks automatically
});

Only use skipLockRenewal if you have a custom lock management strategy. Without renewal, jobs will become stalled after lockDuration.

Preventing Stalled Jobs

1. Keep Processors Async

Avoid blocking the event loop:

// ❌ Bad: Blocks the event loop
const badWorker = new Worker('queue', async (job) => {
  // CPU-intensive synchronous operation
  let result = 0;
  for (let i = 0; i < 10000000000; i++) {
    result += Math.sqrt(i);
  }
  return result;
});

// ✅ Good: Async I/O operations
const goodWorker = new Worker('queue', async (job) => {
  const data = await fetch(job.data.url);
  await database.save(data);
  return data;
});

2. Use Sandboxed Processors for CPU Work

Isolate CPU-intensive operations:

import { Worker } from 'bullmq';
import path from 'path';

// Run CPU-intensive work in separate process
const worker = new Worker(
  'cpuQueue',
  path.join(__dirname, 'cpu-processor.js'),
  {
    useWorkerThreads: true,
    concurrency: 4,
  },
);

See Sandboxed Processors for details.

3. Set Appropriate Lock Duration

Match lockDuration to job processing time:

// Jobs take ~20 seconds on average
const worker = new Worker('queue', processorFunction, {
  lockDuration: 30000, // 20s + 10s buffer = 30s
});

// Jobs take ~5 minutes on average
const longWorker = new Worker('longQueue', longProcessor, {
  lockDuration: 360000, // 6 minutes
});

4. Monitor and Adjust Concurrency

High concurrency with CPU work causes stalls:

import os from 'os';

// For I/O work: High concurrency is fine
const ioWorker = new Worker('ioQueue', ioProcessor, {
  concurrency: 50,
});

// For CPU work: Match CPU cores
const cpuWorker = new Worker('cpuQueue', './cpu-processor.js', {
  concurrency: os.cpus().length,
});

Monitoring Stalled Jobs

Listen for stalled events:

import { Worker } from 'bullmq';

const worker = new Worker('queueName', processorFunction, {
  lockDuration: 30000,
  maxStalledCount: 2,
});

// Job was moved back to waiting due to stall
worker.on('stalled', (jobId: string, prev: string) => {
  console.log(`Job ${jobId} stalled and moved from ${prev} back to waiting`);
  
  // Alert or log for monitoring
  alerting.notify(`Job ${jobId} stalled`);
});

// Track lock renewal
worker.on('locksRenewed', ({ count, jobIds }) => {
  console.log(`Renewed locks for ${count} jobs:`, jobIds);
});

// Lock renewal failed
worker.on('lockRenewalFailed', (jobIds: string[]) => {
  console.error('Lock renewal failed for jobs:', jobIds);
  
  // Optionally cancel these jobs
  jobIds.forEach(id => worker.cancelJob(id));
});

Stalled Job Recovery Process

When a job becomes stalled:

Detection: Worker’s stalled checker identifies the expired lock
Event emission: stalled event is emitted with the job ID
Move to waiting: Job is moved back to the waiting state
Increment counter: Job’s stalled count is incremented
Re-processing: Another worker (or the same one) picks up the job
Failure on limit: If maxStalledCount is exceeded, job moves to failed

import { Worker, Queue } from 'bullmq';

const worker = new Worker('queue', processorFunction, {
  maxStalledCount: 2,
});

worker.on('stalled', async (jobId) => {
  console.log(`Job ${jobId} stalled`);
  
  // Check the job's stall count
  const queue = new Queue('queue');
  const job = await queue.getJob(jobId);
  
  if (job) {
    console.log(`Job has stalled ${job.attemptsMade} times`);
  }
});

worker.on('failed', async (job, error) => {
  if (error.message.includes('job stalled more than allowable limit')) {
    console.error(`Job ${job?.id} failed due to excessive stalls`);
  }
});

Troubleshooting Stalled Jobs

Frequent Stalling

Symptoms: Jobs stall repeatedly. Causes & Solutions:

CPU blocking:

// Solution: Use sandboxed processors
const worker = new Worker('queue', './processor.js');

Lock duration too short:

// Solution: Increase lockDuration
const worker = new Worker('queue', processor, {
  lockDuration: 60000, // Increase from 30s to 60s
});

High concurrency:

// Solution: Reduce concurrency
const worker = new Worker('queue', processor, {
  concurrency: 5, // Reduce from 50 to 5
});

Jobs Fail After Multiple Stalls

Cause: Exceeding maxStalledCount. Solution: Increase maxStalledCount or fix the root cause:

const worker = new Worker('queue', processor, {
  maxStalledCount: 3, // Increase from 1 to 3
});

No Stalled Detection

Cause: All workers have skipStalledCheck: true. Solution: Ensure at least one worker performs stalled checks:

// At least one worker should NOT skip stalled checks
const checkerWorker = new Worker('queue', processor, {
  skipStalledCheck: false, // Default
});

Best Practices

Set lockDuration appropriately

Make it 1.5-2x your longest expected job duration.

Use sandboxing for CPU work

Prevent event loop blocking with sandboxed processors.

Monitor stalled events

Alert on frequent stalls to identify problematic jobs:

worker.on('stalled', (jobId) => {
  metrics.increment('jobs.stalled');
});

Adjust maxStalledCount

Set based on acceptable retry attempts for your use case.

Keep workers healthy

Ensure workers have sufficient resources (CPU, memory, network).

Test under load

Simulate high load to identify stalling issues before production.

Sandboxed Processors

Prevent stalls with process isolation

Concurrency

Configure parallel processing

Graceful Shutdown

Minimize stalls during shutdown

Cancelling Jobs

Handle lock renewal failures

Getting Started

Core Concepts

Queue Management

Workers

Job Types & Features

Job Schedulers

Flows

Advanced Features

Patterns & Best Practices

Redis Integration

Framework Integration

Production & Operations

Migration Guides

What Causes Stalled Jobs?

How Stalled Detection Works

Stalled Job Configuration

lockDuration

stalledInterval

maxStalledCount

skipStalledCheck

skipLockRenewal

Preventing Stalled Jobs

1. Keep Processors Async

2. Use Sandboxed Processors for CPU Work

3. Set Appropriate Lock Duration

4. Monitor and Adjust Concurrency

Monitoring Stalled Jobs

Stalled Job Recovery Process

Troubleshooting Stalled Jobs

Frequent Stalling

Jobs Fail After Multiple Stalls

No Stalled Detection

Best Practices

Sandboxed Processors

Concurrency

Graceful Shutdown

Cancelling Jobs

API Reference

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Queue Management

Workers

Job Types & Features

Job Schedulers

Flows

Advanced Features

Patterns & Best Practices

Redis Integration

Framework Integration

Production & Operations

Migration Guides

​What Causes Stalled Jobs?

​How Stalled Detection Works

​Stalled Job Configuration

​lockDuration

​stalledInterval

​maxStalledCount

​skipStalledCheck

​skipLockRenewal

​Preventing Stalled Jobs

​1. Keep Processors Async

​2. Use Sandboxed Processors for CPU Work

​3. Set Appropriate Lock Duration

​4. Monitor and Adjust Concurrency

​Monitoring Stalled Jobs

​Stalled Job Recovery Process

​Troubleshooting Stalled Jobs

​Frequent Stalling

​Jobs Fail After Multiple Stalls

​No Stalled Detection

​Best Practices

​Related Topics

Sandboxed Processors

Concurrency

Graceful Shutdown

Cancelling Jobs

​API Reference

Build docs developers (and LLMs) love

What Causes Stalled Jobs?

How Stalled Detection Works

Stalled Job Configuration

lockDuration

stalledInterval

maxStalledCount

skipStalledCheck

skipLockRenewal

Preventing Stalled Jobs

1. Keep Processors Async

2. Use Sandboxed Processors for CPU Work

3. Set Appropriate Lock Duration

4. Monitor and Adjust Concurrency

Monitoring Stalled Jobs

Stalled Job Recovery Process

Troubleshooting Stalled Jobs

Frequent Stalling

Jobs Fail After Multiple Stalls

No Stalled Detection

Best Practices

Related Topics

API Reference