Skip to content
TraceStax Docs

Queue depth

Queue depth is the number of jobs waiting in a queue to be picked up by a worker. It is one of the most direct signals of whether your background job infrastructure is keeping up with demand.

A queue with zero depth means jobs are being processed as fast as they arrive — workers are keeping up. A queue with rising depth means jobs are arriving faster than workers can process them, or workers have stopped processing altogether.

Left undetected, a growing backlog can cause:

  • User-visible delays — if the queue backs up transactional jobs (emails, webhooks, order processing), users experience latency or missing notifications
  • Memory pressure — some brokers hold job payloads in memory; a deep queue can exhaust broker resources
  • Cascading failures — downstream systems that depend on job output stall, potentially triggering their own timeouts

TraceStax monitors queue depth continuously and alerts you before a backlog becomes an incident.

TraceStax collects queue depth through snapshot events. The SDK queries the broker’s queue statistics every 60 seconds and sends a snapshot containing the current depth, active count, failed count, and throughput for each queue.

Unlike task_event data — which is event-driven and only arrives when a job changes state — snapshot data gives TraceStax a regular, broker-level view of the queue regardless of whether any individual job events have arrived recently.

Every queue in TraceStax has a depth chart accessible from the Queues tab of your project. The chart shows three series over time:

  • Waiting — jobs in the queue waiting to be picked up (the depth field from snapshots)
  • Active — jobs currently being processed by a worker (active field)
  • Failed — jobs in the dead-letter or failed state (failed field)

The chart defaults to a 24-hour window and can be zoomed to 1 hour, 7 days, or 30 days. Hovering over a point shows the exact values at that timestamp.

The failed series is plotted on a separate y-axis because failed job counts tend to grow slowly and would otherwise be invisible against a large waiting count.

TraceStax’s backlog detection uses two separate conditions. Either one can fire an alert independently.

TraceStax maintains a rolling baseline of queue depth for each (queue, time-of-week) bucket. Many queues are naturally deeper during business hours than overnight; bucketing by time-of-week prevents those normal fluctuations from generating alerts.

An alert fires when the current depth exceeds the rolling baseline by more than 2σ. Like all TraceStax anomaly detection, the threshold adapts automatically — if your queue is consistently deeper during a product launch, the baseline adjusts.

Sustained non-zero depth with no throughput

Section titled “Sustained non-zero depth with no throughput”

The statistical baseline works well when a queue is normally shallow but occasionally spikes. A second condition catches a different failure mode: a queue that has had jobs waiting for an extended period but has not processed any of them.

This condition fires when:

  • Queue depth has been greater than zero continuously for more than N minutes (configurable per queue, default: 10 minutes), and
  • Throughput for that queue has been zero or near-zero during the same window

This catches situations where workers are running (so the fleet alert does not fire) but are silently refusing to pick up jobs — for example, a deserialization error that causes workers to crash immediately after dequeuing.

A queue stall is a specific and severe condition: the queue depth is growing (or non-zero) but worker throughput has dropped to zero. This indicates that workers have stopped processing the queue entirely.

Common causes:

  • Worker processes have crashed and not been restarted
  • The broker connection has been lost
  • A poison-pill job is being repeatedly dequeued, crashing the worker, and being re-enqueued

Queue stall alerts are always filed at critical severity. TraceStax detects a stall when:

  • throughput_per_min is 0 across two consecutive snapshots (≥ 60 seconds of no processing), and
  • depth is > 0 or active dropped from > 0 to 0 between snapshots

In addition to queue depth, TraceStax tracks how long jobs wait in the queue before a worker picks them up. This metric — queued_ms in the task_event payload — is the wall-clock time from when the job was enqueued to when the worker called the job’s execute method.

Latency percentiles (p50, p95) are shown on each queue’s detail page alongside the depth chart. A rising p95 latency often predicts a backlog before the depth chart makes it obvious — if workers are getting slower at picking up jobs, depth will follow.

queued_ms is populated by the SDK when the framework exposes the original enqueue timestamp in the job payload. Not all frameworks do this by default; see your SDK’s documentation for framework-specific notes.

Different frameworks expose queue statistics in different ways. The TraceStax SDK abstracts these into the common snapshot format.

FrameworkDepth sourceActive count sourceThroughput source
BullMQqueue.getWaitingCount()queue.getActiveCount()queue.getCompletedCount() delta over interval
Celerycelery inspect reserved + broker queue lengthcelery inspect activecelery inspect stats task counts delta
SidekiqSidekiq::Queue.new(name).sizeSidekiq::Workers.new countSidekiq::Stats processed delta
RQQueue.countStartedJobRegistry.countFinishedJobRegistry.count delta