Runner · Operate

Deploy & observability

Healthcheck, scheduler states, probes, alerts, and the incident playbook.

Derived from runner/notes/deploy-and-observability.md

The daemon (pnpm start) opens an HTTP server alongside the runtime for external systems to probe. Two endpoints, two purposes:

GET /ready returns 200 whenever the process is alive. It's the liveness probe.
GET /health returns 200 if all plugins are healthy and 503 if any is unhealthy. It's the readiness probe.

Set the port via HEALTH_PORT (default 3030; 0 disables it).

Scheduler states

Every plugin is in one of three states:

State	What triggers it	What it means
`healthy`	Last tick completed without throwing	Ticking normally
`running`	Tick executing right now	Anti-overlap: a cron firing during execution is skipped
`unhealthy`	Last tick threw	Skips the next 5 ticks, then auto-retries. Success → healthy; throw → another skip

What counts as "the tick threw": decide() threw, the executor threw (a client error such as RPC down, signer rejected, preflight, or auto-fund pricing), or an on-chain abort (effectsStatus === "failure"). What does not count: a noop outcome, hook exceptions (caught in the runtime), and dry-run failures (which bypass the scheduler).

Configuring probes (Kubernetes)

livenessProbe:
  httpGet: { path: /ready, port: 3030 }
  periodSeconds: 30
  failureThreshold: 3   # restart after 90s with no response

readinessProbe:
  httpGet: { path: /health, port: 3030 }
  periodSeconds: 15
  failureThreshold: 2   # pull from the service after 30s unhealthy

/ready asks "does the Node process respond?", so it should almost never fire. /health asks "is at least one plugin working?", and a 503 there doesn't mean the process is broken, only that it shouldn't receive traffic. On a PaaS that lets you pick just one, use /health; it's an informative superset of /ready.

On a single box without K8s, a 60s cron running curl -fsS http://localhost:3030/health || <alert> is enough; -f makes curl exit non-zero on a 503.

The `/health` body

{
  "ok": true,
  "plugins": [
    {
      "name": "my-strategy",
      "state": "healthy",
      "metrics": {
        "ticks": 240, "noop": 235, "submitted": 4, "failed": 1,
        "consecutiveNoops": 120, "lastTickMs": 1700000000000
      }
    }
  ],
  "pid": 12345, "uptimeSec": 600
}

ok is the AND of every plugin's state. The body is identical between 200 and 503, and only the status code changes, so you can poll with curl -s and parse the JSON every time. metrics is null until the plugin has ticked once, so give it two intervals of grace before alerting on "metrics missing".

Alerts

Stuck plugin (noop streak): the runtime emits a single warn log when it hits noopWarnThreshold (default 100) consecutive noops: {"msg": "plugin stuck in noop loop", "consecutiveNoops": 100}. Alert on that line; the runner already did the math. It fires once per streak.
Failing plugin: failed climbing while submitted stays flat means client or on-chain errors. Drill into the scheduler: handler failed logs. An auto-fund: prefix points to pricing, not permitted by vault to permissions, and a Move abort code to a digest you can check on the explorer.

The metrics are raw counters by design, so you build the rules in your own alerting system. The runner doesn't expose Prometheus directly; most teams write a sidecar that polls /health, parses it, and exposes /metrics.

Incident playbook

When /health returns 503:

1. Hit /health and parse it. Find which plugin is unhealthy. Several at once is probably systemic (RPC down); just one is plugin-specific. 2. Tail the logs filtered by plugin name. The binding label is on every line. Look about 10 minutes back. 3. Check whether it's transient. If it oscillates unhealthy → healthy → unhealthy, the cause is intermittent (RPC flapping, a signal source 503ing now and then) and recovers on its own on most ticks. 4. Run a dry-run. pnpm dry-run my-strategy reproduces the tick without burning gas. If it fails the same way, the problem is in decide() or pricing; if it succeeds, it's on the real submission path (signer, network). 5. Inspect the state. pnpm inspect my-strategy shows the wallet balance, vault state, permissions, and the persisted keys from ctx.state. 6. Disable the plugin temporarily. Remove its line from src/plugins/index.ts and redeploy. Faster than fixing it in production.

Log shipping

ctx.logger emits structured JSON, one line per event, via process.stdout.write, with no rotation or buffering. It ships cleanly through any tail-based collector (Vector, Fluent Bit, Datadog, Loki).

{"ts":"2026-05-26T15:30:00.123Z","level":"info","msg":"...","plugin":"my-strategy","vaultId":"0x..."}

Bigint gotcha: the logger converts bigints to strings (native JSON.stringify throws on bigint). So gasUsed: 12345n shows up as "gasUsed": "12345". Your downstream parser sees strings, not numbers, for fields that came from a bigint.
Level via env: LOG_LEVEL (default info). debug shows auto-fund deliberations, vault cache hits, and scheduler skips, so don't run it in production (it's expensive).

When `/health` lies

The state field reflects only what the *scheduler* knows. It does not cover external liveness (if the signal source goes down, the plugin still ticks and returns noop, which looks healthy), slow ticks (a 25s tick blocks the next cron and reports running, not unhealthy), or stale state (it reads a cooldown from an old deploy, the tick passes, health is green, and the behavior is still wrong). For the first two, watch consecutiveNoops and lastTickMs. The third is data integrity; add validation in decide() itself.