Queue depth
Queue depth is the number of jobs waiting in a queue to be picked up by a worker. It is one of the most direct signals of whether your background job infrastructure is keeping up with demand.
Why queue depth matters
Section titled “Why queue depth matters”A queue with zero depth means jobs are being processed as fast as they arrive — workers are keeping up. A queue with rising depth means jobs are arriving faster than workers can process them, or workers have stopped processing altogether.
Left undetected, a growing backlog can cause:
- User-visible delays — if the queue backs up transactional jobs (emails, webhooks, order processing), users experience latency or missing notifications
- Memory pressure — some brokers hold job payloads in memory; a deep queue can exhaust broker resources
- Cascading failures — downstream systems that depend on job output stall, potentially triggering their own timeouts
TraceStax monitors queue depth continuously and alerts you before a backlog becomes an incident.
How depth data is collected
Section titled “How depth data is collected”TraceStax collects queue depth through snapshot events. The SDK queries the broker’s queue statistics every 60 seconds and sends a snapshot containing the current depth, active count, failed count, and throughput for each queue.
Unlike task_event data — which is event-driven and only arrives when a job changes state — snapshot data gives TraceStax a regular, broker-level view of the queue regardless of whether any individual job events have arrived recently.
Queue depth chart
Section titled “Queue depth chart”Every queue in TraceStax has a depth chart accessible from the Queues tab of your project. The chart shows three series over time:
- Waiting — jobs in the queue waiting to be picked up (the
depthfield from snapshots) - Active — jobs currently being processed by a worker (
activefield) - Failed — jobs in the dead-letter or failed state (
failedfield)
The chart defaults to a 24-hour window and can be zoomed to 1 hour, 7 days, or 30 days. Hovering over a point shows the exact values at that timestamp.
The failed series is plotted on a separate y-axis because failed job counts tend to grow slowly and would otherwise be invisible against a large waiting count.
Backlog detection
Section titled “Backlog detection”TraceStax’s backlog detection uses two separate conditions. Either one can fire an alert independently.
Statistical backlog
Section titled “Statistical backlog”TraceStax maintains a rolling baseline of queue depth for each (queue, time-of-week) bucket. Many queues are naturally deeper during business hours than overnight; bucketing by time-of-week prevents those normal fluctuations from generating alerts.
An alert fires when the current depth exceeds the rolling baseline by more than 2σ. Like all TraceStax anomaly detection, the threshold adapts automatically — if your queue is consistently deeper during a product launch, the baseline adjusts.
Sustained non-zero depth with no throughput
Section titled “Sustained non-zero depth with no throughput”The statistical baseline works well when a queue is normally shallow but occasionally spikes. A second condition catches a different failure mode: a queue that has had jobs waiting for an extended period but has not processed any of them.
This condition fires when:
- Queue depth has been greater than zero continuously for more than N minutes (configurable per queue, default: 10 minutes), and
- Throughput for that queue has been zero or near-zero during the same window
This catches situations where workers are running (so the fleet alert does not fire) but are silently refusing to pick up jobs — for example, a deserialization error that causes workers to crash immediately after dequeuing.
Queue stall detection
Section titled “Queue stall detection”A queue stall is a specific and severe condition: the queue depth is growing (or non-zero) but worker throughput has dropped to zero. This indicates that workers have stopped processing the queue entirely.
Common causes:
- Worker processes have crashed and not been restarted
- The broker connection has been lost
- A poison-pill job is being repeatedly dequeued, crashing the worker, and being re-enqueued
Queue stall alerts are always filed at critical severity. TraceStax detects a stall when:
throughput_per_minis 0 across two consecutive snapshots (≥ 60 seconds of no processing), anddepthis > 0 oractivedropped from > 0 to 0 between snapshots
Enqueue-to-start latency
Section titled “Enqueue-to-start latency”In addition to queue depth, TraceStax tracks how long jobs wait in the queue before a worker picks them up. This metric — queued_ms in the task_event payload — is the wall-clock time from when the job was enqueued to when the worker called the job’s execute method.
Latency percentiles (p50, p95) are shown on each queue’s detail page alongside the depth chart. A rising p95 latency often predicts a backlog before the depth chart makes it obvious — if workers are getting slower at picking up jobs, depth will follow.
queued_ms is populated by the SDK when the framework exposes the original enqueue timestamp in the job payload. Not all frameworks do this by default; see your SDK’s documentation for framework-specific notes.
Framework support
Section titled “Framework support”Different frameworks expose queue statistics in different ways. The TraceStax SDK abstracts these into the common snapshot format.
| Framework | Depth source | Active count source | Throughput source |
|---|---|---|---|
| BullMQ | queue.getWaitingCount() | queue.getActiveCount() | queue.getCompletedCount() delta over interval |
| Celery | celery inspect reserved + broker queue length | celery inspect active | celery inspect stats task counts delta |
| Sidekiq | Sidekiq::Queue.new(name).size | Sidekiq::Workers.new count | Sidekiq::Stats processed delta |
| RQ | Queue.count | StartedJobRegistry.count | FinishedJobRegistry.count delta |