Skip to content
TraceStax Docs

Anomaly detection

Traditional monitoring asks you to define rules: “alert if failure rate > 5%” or “alert if duration > 30s”. This approach has two failure modes:

  • Too sensitive — thresholds set too tight generate constant noise, leading to alert fatigue and on-call burnout
  • Not sensitive enough — thresholds set too loose miss real incidents, especially gradual degradation

TraceStax uses statistical anomaly detection instead. You never configure a threshold.

For each (job_name, queue) pair in your project, TraceStax maintains a rolling statistical baseline:

  • Median duration — p50 over the lookback window
  • p95 duration — used to set the “normal ceiling” for slow jobs
  • Failure rate — rolling failure / (failure + success) ratio
  • Throughput — jobs per minute, tracked to detect queue stalls

When a new event arrives, TraceStax computes how many standard deviations the value is from the baseline. If it exceeds the configured sensitivity level, an alert fires.

LevelDescription
lowOnly fires on extreme deviations — good for noisy jobs with high natural variance
medium (default)Balanced — fires on deviations >2σ from baseline
highFires on subtle shifts — good for SLA-critical jobs

Sensitivity is configured per-project in Settings → Alerts.

Anomaly detection requires a baseline. A new job name needs at least 50 events and 24 hours of history before alerts will fire. During the establishment period, events are collected and flagged as “learning” in the dashboard.

The baseline adapts as your workload changes. If your jobs consistently get slower after a deploy (within the acceptable range), the baseline updates and the new duration becomes the new normal. If they suddenly spike beyond the new baseline, an alert fires.

The adaptation rate uses exponential smoothing — recent events are weighted more heavily than older ones.

TraceStax currently generates alerts for:

ConditionTrigger
Duration spikep95 duration exceeds baseline by configured σ threshold
Failure rate increaseRolling failure rate exceeds baseline + threshold
Queue stallNo successful events for a job within its expected period
Worker disappearanceA worker that was active stops sending heartbeats

You can silence alerts for a job or queue for a defined period — useful during deployments or known maintenance windows. Silence windows are configured from the job detail page or via the API.