Hung cron jobs: what they are, why they happen, and how to detect them

A crashed cron job is easy to catch. It exits with a non-zero code. Your monitoring fires. You get an alert.

A hung cron job is different. It starts. It keeps running. It never crashes and never finishes. It holds a database connection, locks a file, consumes memory, and blocks every subsequent execution. Days later you notice your data is three days stale and your server is running out of memory.

No alert fired. No exit code was ever logged. The job was "running" the entire time.

This is a hung job, and it's one of the most damaging failure modes in scheduled task infrastructure because it's entirely invisible to standard monitoring.

What causes cron jobs to hang

Hung jobs almost always trace back to one of five causes:

Database deadlocks or long-running queries. A query acquires a lock and waits for another lock that's held by a different process. Neither transaction can proceed. The job sits in a waiting state indefinitely, holding its own lock and blocking other operations.

Network timeouts without timeout configuration. An HTTP request to an external API, a message queue connection, a remote file transfer — if no timeout is explicitly set, the default in most runtimes is to wait indefinitely. A server that stops responding mid-response will leave your job waiting forever.

Infinite loops from unexpected input. A loop that processes records one by one, where a malformed record causes the loop to re-process the same item, or where a queue never empties because new items are added as fast as they're consumed.

Memory pressure causing extreme slowdown. A job that processes large datasets and runs out of heap memory doesn't always crash — it can enter a state of constant garbage collection where it's technically running but making no forward progress.

File system issues. A job that writes to a full disk, tries to acquire a file lock held by a crashed previous instance, or waits for input from a pipe that nothing is writing to.

Why standard cron job monitoring doesn't catch hung jobs

Standard heartbeat monitoring — where your job pings a URL when it completes — cannot detect hung jobs by design.

The ping only fires when the job finishes. A hung job never finishes, so the ping never fires. From the monitor's perspective, the job simply hasn't completed yet. It has no way to know whether the job is still actively working or has been frozen for six hours.

To detect hung jobs, you need two things:

A start ping — so the monitor knows when the job began
A maximum duration threshold — so the monitor knows when the job has been running too long

Only when both are present can an external service detect the difference between "job is still working" and "job has been stuck for four hours".

This is why Crontify uses a start/success/fail ping model rather than a single completion heartbeat.

How Crontify detects hung jobs

When your job calls start(), Crontify creates a run record with the current timestamp. The run is in a running state.

Every minute, Crontify's scheduler checks all runs in running state. For each one, it calculates:

seconds_running = now - startedAt
hung_threshold = monitor.gracePeriod * hung_job_timeout_multiplier

If seconds_running exceeds the threshold, the run is marked as hung and an alert fires. The default multiplier is 2 — so a monitor with a 30-minute grace period triggers a hung alert after 60 minutes of continuous running.

This detection is entirely external to your process. It fires even if your job is completely frozen, even if your process is consuming 100% CPU in a tight loop, even if the event loop is blocked.

Adding hung job detection to your cron jobs

Install the SDK:

npm install @crontify/sdk

The minimal instrumentation to enable hung job detection:

import { CrontifyMonitor } from '@crontify/sdk';

const monitor = new CrontifyMonitor({
  apiKey: process.env.CRONTIFY_API_KEY!,
  monitorId: 'your-monitor-id',
});

// wrap() calls start() at the beginning automatically
await monitor.wrap(async () => {
  await processLargeDataset();
});

That's it. The start() ping is sent when wrap() is called. If processLargeDataset() never resolves, Crontify detects the hung state after the threshold expires and sends an alert.

For manual control:

await monitor.start();

try {
  await processLargeDataset();
  await monitor.success({ meta: { records_processed: result.count } });
} catch (err) {
  await monitor.fail({ message: err.message, log: err.stack });
}
// If this code never reaches success() or fail(), 
// Crontify detects the hung state externally

Setting an appropriate maximum duration threshold

The hung job threshold is derived from the grace period you set for each monitor. Set it based on the longest reasonable runtime for your job — not the average.

A job that normally takes 5 minutes but can legitimately take 20 minutes under heavy load should have a grace period of at least 25–30 minutes. If it's still running after 50–60 minutes (2× the grace period), something is wrong.

Some rules of thumb:

Database backup jobs: 2–3× the average backup duration
API sync jobs: Set an explicit timeout on every HTTP request (e.g. 30 seconds), then set the monitor grace period to (number of records × 30 seconds) + buffer
Data processing jobs: Profile average runtime over 10 executions, set grace period to 2–3× the p95 duration
Email dispatch jobs: Usually fast (under 5 minutes); a 15-minute grace period is generous

Preventing hung jobs in the first place

Hung job detection tells you when it happens. These patterns reduce how often it happens:

Set explicit timeouts on all network calls:

// Never do this in a cron job
const response = await fetch(url);

// Always do this
const response = await fetch(url, {
  signal: AbortSignal.timeout(30_000), // 30 second timeout
});

Use database query timeouts:

// PostgreSQL statement timeout (per connection)
await prisma.$executeRaw`SET statement_timeout = '60s'`;

// Or per-query via raw SQL
await prisma.$queryRaw`
  SET LOCAL statement_timeout = '30000';
  SELECT * FROM large_table WHERE condition = true;
`;

Add a process-level timeout as a last resort:

// Kill the entire process if the job takes more than 10 minutes
// Only appropriate for jobs running in isolated processes
const TIMEOUT_MS = 10 * 60 * 1000;
const timeout = setTimeout(() => {
  console.error('Job exceeded maximum duration, exiting');
  process.exit(1);
}, TIMEOUT_MS);
timeout.unref(); // Don't prevent normal exit

try {
  await runJob();
} finally {
  clearTimeout(timeout);
}

Note that a process exit will trigger an error, which Crontify's SDK will catch and report as a failed run — which is correct and preferable to a hung run.

Frequently asked questions

What is the difference between a hung job and a missed run?

A missed run never started — no start ping arrived within the grace period after the scheduled time. A hung job started but never finished — a start ping arrived, but no success or fail ping followed within the maximum duration threshold. Both require external monitoring to detect, but they represent different root causes.

Can a job be both hung and missed?

Yes, in sequence. If a job hangs indefinitely, it may still be technically "running" when the next scheduled execution is due. If the next instance detects the previous run is still active, it may refuse to start (depending on your configuration), resulting in what appears as a missed run. Crontify detects the hung state and fires an overlap alert if a new instance starts while the previous one hasn't finished.

How do I test that hung job detection is working?

Create a test monitor, instrument a test job that calls start() and then sleeps indefinitely (or just never calls success() or fail()). Within one detection cycle after your threshold expires, you should receive an alert.

// Test hung job detection — never call this in production
await monitor.start();
await new Promise(() => {}); // never resolves

Start monitoring for free

Crontify is free for up to 5 monitors — no credit card required.

crontify.com — SDK on npm as @crontify/sdk.

Hung cron jobs: what they are, why they happen, and how to detect them

What causes cron jobs to hang

Why standard cron job monitoring doesn't catch hung jobs

How Crontify detects hung jobs

Adding hung job detection to your cron jobs

Setting an appropriate maximum duration threshold

Preventing hung jobs in the first place

Frequently asked questions

Start monitoring for free

Start monitoring your scheduled jobs

More from the blog

How to monitor cron jobs in Node.js (the right way)

The cron job that always succeeded and never worked