Anomaly detection
The problem with static thresholds
Section titled “The problem with static thresholds”Traditional monitoring asks you to define rules: “alert if failure rate > 5%” or “alert if duration > 30s”. This approach has two failure modes:
- Too sensitive — thresholds set too tight generate constant noise, leading to alert fatigue and on-call burnout
- Not sensitive enough — thresholds set too loose miss real incidents, especially gradual degradation
TraceStax uses statistical anomaly detection instead. You never configure a threshold.
How it works
Section titled “How it works”For each (job_name, queue) pair in your project, TraceStax maintains a rolling statistical baseline:
- Median duration — p50 over the lookback window
- p95 duration — used to set the “normal ceiling” for slow jobs
- Failure rate — rolling failure / (failure + success) ratio
- Throughput — jobs per minute, tracked to detect queue stalls
When a new event arrives, TraceStax computes how many standard deviations the value is from the baseline. If it exceeds the configured sensitivity level, an alert fires.
Sensitivity levels
Section titled “Sensitivity levels”| Level | Description |
|---|---|
low | Only fires on extreme deviations — good for noisy jobs with high natural variance |
medium (default) | Balanced — fires on deviations >2σ from baseline |
high | Fires on subtle shifts — good for SLA-critical jobs |
Sensitivity is configured per-project in Settings → Alerts.
Baseline establishment
Section titled “Baseline establishment”Anomaly detection requires a baseline. A new job name needs at least 50 events and 24 hours of history before alerts will fire. During the establishment period, events are collected and flagged as “learning” in the dashboard.
Automatic adaptation
Section titled “Automatic adaptation”The baseline adapts as your workload changes. If your jobs consistently get slower after a deploy (within the acceptable range), the baseline updates and the new duration becomes the new normal. If they suddenly spike beyond the new baseline, an alert fires.
The adaptation rate uses exponential smoothing — recent events are weighted more heavily than older ones.
Alert conditions
Section titled “Alert conditions”TraceStax currently generates alerts for:
| Condition | Trigger |
|---|---|
| Duration spike | p95 duration exceeds baseline by configured σ threshold |
| Failure rate increase | Rolling failure rate exceeds baseline + threshold |
| Queue stall | No successful events for a job within its expected period |
| Worker disappearance | A worker that was active stops sending heartbeats |
Silencing and overrides
Section titled “Silencing and overrides”You can silence alerts for a job or queue for a defined period — useful during deployments or known maintenance windows. Silence windows are configured from the job detail page or via the API.