TraceStax Docs

Event model

Everything TraceStax knows about your background jobs arrives as events sent by the SDK (or directly via the Ingest API). There are three event types, each serving a distinct purpose: task_event, heartbeat, and snapshot.

Event types

task_event

A task_event is emitted each time something notable happens to a job — it starts, succeeds, fails, is retried, stalls, or is revoked. This is the primary signal TraceStax uses for anomaly detection and failure rate tracking.

Fields:

Field	Type	Required	Description
`type`	`"task_event"`	Yes	Discriminator field identifying the event type
`framework`	string	Yes	The job framework (e.g. `celery`, `bullmq`, `sidekiq`, `rq`, `oban`)
`language`	string	Yes	The runtime language (e.g. `python`, `node`, `ruby`, `elixir`)
`sdk_version`	string	Yes	The version of the TraceStax SDK in use
`worker.key`	string	Yes	Stable identifier for the worker process. Defaults to `hostname:pid`. Overridable via SDK config
`worker.hostname`	string	Yes	Hostname of the machine running the worker
`worker.pid`	integer	Yes	Process ID of the worker
`worker.concurrency`	integer	Yes	Maximum number of concurrent jobs this worker processes
`worker.queues`	string[]	Yes	List of queue names this worker is consuming
`task.name`	string	Yes	The fully-qualified job class or function name
`task.id`	string	Yes	Unique identifier for this specific job execution
`task.queue`	string	Yes	The queue this job was dispatched to
`task.attempt`	integer	Yes	Which attempt this is (1 for first attempt, 2 for first retry, etc.)
`task.parent_id`	string	No	ID of the parent job, if this job was spawned by another
`task.chain_id`	string	No	Identifier grouping all jobs in a workflow chain
`status`	string	Yes	One of: `started`, `succeeded`, `failed`, `retried`, `stalled`, `revoked`
`metrics.duration_ms`	integer	No	Wall-clock time from start to finish, in milliseconds. Present on `succeeded`, `failed`, `retried`
`metrics.queued_ms`	integer	No	Time from job enqueue to job start, in milliseconds. Present when the framework exposes enqueue timestamp
`error.type`	string	No	Exception class name. Present on `failed` and `retried`
`error.message`	string	No	Exception message. Present on `failed` and `retried`
`error.stack_trace`	string	No	Full stack trace as a single string. Present on `failed` and `retried`

Status values:

Status	Meaning
`started`	The job has been picked up by a worker and begun execution
`succeeded`	The job completed without error
`failed`	The job raised an unhandled exception and will not be retried (exhausted retries or non-retryable error)
`retried`	The job raised an error and has been re-enqueued for another attempt
`stalled`	The job was in-progress but the worker stopped reporting; the framework has returned it to the queue
`revoked`	The job was cancelled before or during execution

Example payload:

{
  "type": "task_event",
  "framework": "celery",
  "language": "python",
  "sdk_version": "0.4.1",
  "worker": {
    "key": "worker-prod-1:14523",
    "hostname": "worker-prod-1.internal",
    "pid": 14523,
    "concurrency": 8,
    "queues": ["default", "email"]
  },
  "task": {
    "name": "app.tasks.email.send_welcome_email",
    "id": "3c8e4f12-7a1b-4d2e-9f3a-0b5c6d7e8f90",
    "queue": "email",
    "attempt": 2,
    "parent_id": null,
    "chain_id": "a1b2c3d4-onboarding-flow"
  },
  "status": "retried",
  "metrics": {
    "duration_ms": 1842,
    "queued_ms": 312
  },
  "error": {
    "type": "SMTPConnectError",
    "message": "Connection refused to smtp.example.com:587",
    "stack_trace": "Traceback (most recent call last):\n  File \"...\"\nSMTPConnectError: Connection refused"
  }
}

heartbeat

A heartbeat event is sent periodically by a running worker to confirm it is still alive. The SDK sends a heartbeat on startup and then at a regular interval (default: every 30 seconds). TraceStax uses heartbeats to populate the Worker Fleet view and to fire alerts when expected workers go offline.

Fields:

Field	Type	Required	Description
`type`	`"heartbeat"`	Yes	Discriminator field
`framework`	string	Yes	The job framework
`worker`	object	Yes	The same worker object as in `task_event` (key, hostname, pid, concurrency, queues)
`timestamp`	string	Yes	ISO 8601 timestamp of when the heartbeat was generated

Example payload:

{
  "type": "heartbeat",
  "framework": "bullmq",
  "worker": {
    "key": "api-worker-7:9801",
    "hostname": "api-worker-7.internal",
    "pid": 9801,
    "concurrency": 4,
    "queues": ["notifications", "webhooks"]
  },
  "timestamp": "2026-03-24T14:22:00.000Z"
}

snapshot

A snapshot event is sent by the SDK every 60 seconds and reports the current state of all queues the worker is aware of. Unlike task_event which reflects individual job executions, a snapshot gives TraceStax a point-in-time view of queue depth — how many jobs are waiting, active, and failed — along with a throughput measurement.

Fields:

Field	Type	Required	Description
`type`	`"snapshot"`	Yes	Discriminator field
`framework`	string	Yes	The job framework
`worker_key`	string	Yes	The stable worker key, used to attribute the snapshot to a specific worker process
`queues`	array	Yes	One entry per queue the worker monitors (see below)
`timestamp`	string	Yes	ISO 8601 timestamp of when the snapshot was taken

Queue entry fields:

Field	Type	Required	Description
`name`	string	Yes	Queue name
`depth`	integer	Yes	Number of jobs waiting to be processed
`active`	integer	Yes	Number of jobs currently being processed across all workers
`failed`	integer	Yes	Number of jobs in the failed/dead-letter state
`throughput_per_min`	number	Yes	Jobs completed per minute over the last measurement window, as reported by the framework

Example payload:

{
  "type": "snapshot",
  "framework": "sidekiq",
  "worker_key": "sidekiq-prod-3:22041",
  "queues": [
    {
      "name": "default",
      "depth": 142,
      "active": 10,
      "failed": 3,
      "throughput_per_min": 47.2
    },
    {
      "name": "critical",
      "depth": 0,
      "active": 2,
      "failed": 0,
      "throughput_per_min": 8.1
    }
  ],
  "timestamp": "2026-03-24T14:22:00.000Z"
}

How TraceStax uses each event type

Event type	Used for
`task_event`	Anomaly detection (duration, failure rate, throughput), job history, error tracking
`heartbeat`	Worker Fleet view, online/offline status, fleet size alerts
`snapshot`	Queue depth charts, backlog detection, queue stall detection

These three signals are complementary. task_event data tells you what happened to individual jobs; snapshot data tells you the state of the queue at a point in time; heartbeat data tells you which workers are alive to process that queue.

Size and batch limits

Limit	Value
Maximum event size	64 KB per event
Maximum batch size	100 events per POST to `/v1/ingest`

Events larger than 64 KB are rejected with a 413 response. The SDK truncates stack traces and error messages if necessary to stay within this limit. Batches exceeding 100 events are also rejected; the SDK automatically splits large batches into multiple requests.