TraceStax Docs

Anomaly detection

The problem with static thresholds

Traditional monitoring asks you to define rules: “alert if failure rate > 5%” or “alert if duration > 30s”. This approach has two failure modes:

Too sensitive — thresholds set too tight generate constant noise, leading to alert fatigue and on-call burnout
Not sensitive enough — thresholds set too loose miss real incidents, especially gradual degradation

TraceStax uses statistical anomaly detection instead. You never configure a threshold.

How it works

For each (job_name, queue) pair in your project, TraceStax maintains a rolling statistical baseline:

Median duration — p50 over the lookback window
p95 duration — used to set the “normal ceiling” for slow jobs
Failure rate — rolling failure / (failure + success) ratio
Throughput — jobs per minute, tracked to detect queue stalls

When a new event arrives, TraceStax computes how many standard deviations the value is from the baseline. If it exceeds the configured sensitivity level, an alert fires.

Sensitivity levels

Level	Description
`low`	Only fires on extreme deviations — good for noisy jobs with high natural variance
`medium` (default)	Balanced — fires on deviations >2σ from baseline
`high`	Fires on subtle shifts — good for SLA-critical jobs

Sensitivity is configured per-project in Settings → Alerts.

Baseline establishment

Anomaly detection requires a baseline. A new job name needs at least 50 events and 24 hours of history before alerts will fire. During the establishment period, events are collected and flagged as “learning” in the dashboard.

Automatic adaptation

The baseline adapts as your workload changes. If your jobs consistently get slower after a deploy (within the acceptable range), the baseline updates and the new duration becomes the new normal. If they suddenly spike beyond the new baseline, an alert fires.

The adaptation rate uses exponential smoothing — recent events are weighted more heavily than older ones.

Alert conditions

TraceStax currently generates alerts for:

Condition	Trigger
Duration spike	p95 duration exceeds baseline by configured σ threshold
Failure rate increase	Rolling failure rate exceeds baseline + threshold
Queue stall	No successful events for a job within its expected period
Worker disappearance	A worker that was active stops sending heartbeats

Silencing and overrides

You can silence alerts for a job or queue for a defined period — useful during deployments or known maintenance windows. Silence windows are configured from the job detail page or via the API.