Deploy & observability
Healthcheck, scheduler states, probes, alerts, and the incident playbook.
The daemon (pnpm start) opens an HTTP server alongside the runtime for external systems to probe. Two endpoints, two purposes:
GET /readyreturns 200 whenever the process is alive. It's the liveness probe.GET /healthreturns 200 if all plugins are healthy and 503 if any is unhealthy. It's the readiness probe.
Set the port via HEALTH_PORT (default 3030; 0 disables it).
Scheduler states
Every plugin is in one of three states:
| State | What triggers it | What it means |
|---|---|---|
healthy | Last tick completed without throwing | Ticking normally |
running | Tick executing right now | Anti-overlap: a cron firing during execution is skipped |
unhealthy | Last tick threw | Skips the next 5 ticks, then auto-retries. Success → healthy; throw → another skip |
What counts as "the tick threw": decide() threw, the executor threw (a client error such as RPC down, signer rejected, preflight, or auto-fund pricing), or an on-chain abort (effectsStatus === "failure"). What does not count: a noop outcome, hook exceptions (caught in the runtime), and dry-run failures (which bypass the scheduler).
Configuring probes (Kubernetes)
livenessProbe:
httpGet: { path: /ready, port: 3030 }
periodSeconds: 30
failureThreshold: 3 # restart after 90s with no response
readinessProbe:
httpGet: { path: /health, port: 3030 }
periodSeconds: 15
failureThreshold: 2 # pull from the service after 30s unhealthy/ready asks "does the Node process respond?", so it should almost never fire. /health asks "is at least one plugin working?", and a 503 there doesn't mean the process is broken, only that it shouldn't receive traffic. On a PaaS that lets you pick just one, use /health; it's an informative superset of /ready.
On a single box without K8s, a 60s cron running curl -fsS http://localhost:3030/health || <alert> is enough; -f makes curl exit non-zero on a 503.
The /health body
{
"ok": true,
"plugins": [
{
"name": "my-strategy",
"state": "healthy",
"metrics": {
"ticks": 240, "noop": 235, "submitted": 4, "failed": 1,
"consecutiveNoops": 120, "lastTickMs": 1700000000000
}
}
],
"pid": 12345, "uptimeSec": 600
}ok is the AND of every plugin's state. The body is identical between 200 and 503, and only the status code changes, so you can poll with curl -s and parse the JSON every time. metrics is null until the plugin has ticked once, so give it two intervals of grace before alerting on "metrics missing".
Alerts
- Stuck plugin (noop streak): the runtime emits a single
warnlog when it hitsnoopWarnThreshold(default 100) consecutive noops:{"msg": "plugin stuck in noop loop", "consecutiveNoops": 100}. Alert on that line; the runner already did the math. It fires once per streak. - Failing plugin:
failedclimbing whilesubmittedstays flat means client or on-chain errors. Drill into thescheduler: handler failedlogs. Anauto-fund:prefix points to pricing,not permitted by vaultto permissions, and a Move abort code to a digest you can check on the explorer.
The metrics are raw counters by design, so you build the rules in your own alerting system. The runner doesn't expose Prometheus directly; most teams write a sidecar that polls /health, parses it, and exposes /metrics.
Incident playbook
When /health returns 503:
1. Hit /health and parse it. Find which plugin is unhealthy. Several at once is probably systemic (RPC down); just one is plugin-specific. 2. Tail the logs filtered by plugin name. The binding label is on every line. Look about 10 minutes back. 3. Check whether it's transient. If it oscillates unhealthy → healthy → unhealthy, the cause is intermittent (RPC flapping, a signal source 503ing now and then) and recovers on its own on most ticks. 4. Run a dry-run. pnpm dry-run my-strategy reproduces the tick without burning gas. If it fails the same way, the problem is in decide() or pricing; if it succeeds, it's on the real submission path (signer, network). 5. Inspect the state. pnpm inspect my-strategy shows the wallet balance, vault state, permissions, and the persisted keys from ctx.state. 6. Disable the plugin temporarily. Remove its line from src/plugins/index.ts and redeploy. Faster than fixing it in production.
Log shipping
ctx.logger emits structured JSON, one line per event, via process.stdout.write, with no rotation or buffering. It ships cleanly through any tail-based collector (Vector, Fluent Bit, Datadog, Loki).
{"ts":"2026-05-26T15:30:00.123Z","level":"info","msg":"...","plugin":"my-strategy","vaultId":"0x..."}- Bigint gotcha: the logger converts bigints to strings (native
JSON.stringifythrows on bigint). SogasUsed: 12345nshows up as"gasUsed": "12345". Your downstream parser sees strings, not numbers, for fields that came from a bigint. - Level via env:
LOG_LEVEL(defaultinfo).debugshows auto-fund deliberations, vault cache hits, and scheduler skips, so don't run it in production (it's expensive).
When /health lies
The state field reflects only what the *scheduler* knows. It does not cover external liveness (if the signal source goes down, the plugin still ticks and returns noop, which looks healthy), slow ticks (a 25s tick blocks the next cron and reports running, not unhealthy), or stale state (it reads a cooldown from an old deploy, the tick passes, health is green, and the behavior is still wrong). For the first two, watch consecutiveNoops and lastTickMs. The third is data integrity; add validation in decide() itself.