Skip to content
Runner · Operate

Production runbook

The go-live checklist tying probes, alerts, and config into one procedure.

Derived from runner/notes/deploy-and-observability.md

This is the procedure, not the explanation. Run the list top to bottom before flipping a vault live. Each item links to the page that explains the why; here you just check the box.

Rule: if you can't see the daemon and you can't roll it back, it isn't in production yet.

Before the daemon starts

  • RUNNER_PLUGINS set: the CSV of plugins the daemon loads. Confirm every name you expect is in it. trigger and dry-run only see plugins on this list.
  • OPERATOR_PRIVATE_KEY set, never logged: run pnpm show-config and check it reports ✅ set. The command prints the resolved config without secrets, so it's safe to paste into a ticket.
  • SUI_NETWORK correct: testnet | mainnet | devnet, default testnet. A mainnet vault behind a testnet daemon ticks forever and does nothing.
  • LOG_LEVEL is info (or quieter): debug prints auto-fund deliberations and scheduler skips every tick. It's expensive. Don't ship it.
  • Wallet funded: start logs the operator address and SUI balance before the first tick, but it does not assert a threshold. Check the balance yourself. An empty wallet fails every submission as a client error.
  • Dry-run each plugin: pnpm dry-run <plugin> simulates a real tick with no signature and no gas. A green dry-run is your last cheap signal before the daemon owns the keys.

See Configuration reference for the full list and CLI commands for what each verb does.

Wire the probes

The daemon opens an HTTP server on HEALTH_PORT (default 3030; 0 disables it). Point your orchestrator at both endpoints:

  • GET /ready → liveness: 200 while the process is alive. Restart the pod when it stops answering.
  • GET /health → readiness: 200 when every plugin is healthy, 503 when any plugin is unhealthy. Pull from traffic on 503; don't restart.

On a PaaS that gives you one probe, use /health — it's an informative superset of /ready. The full probe config and intervals live in Deploy & observability.

Wire the alerts

  • Alert on /health 503: the readiness probe handles routing; the alert is so a human looks. The JSON body is identical at 200 and 503, so poll it and parse the plugins[] array to find which one flipped.
  • Alert on the noop-loop line: the runtime emits one warn at noopWarnThreshold consecutive noops (plugin stuck in noop loop). It fires once per streak. A healthy plugin stuck in noop looks fine to /health; this is how you catch it.
  • Alert on failed climbing while submitted stays flat: raw counters from /health. Build the rule in your own system — the runner exposes no Prometheus directly.

Know your rollback before you need it

Have the path ready before go-live, not during the incident:

  • Disable one plugin: remove its line from src/plugins/index.ts and redeploy. The other plugins keep ticking. This is faster than fixing it in production.
  • Stop the daemon cleanly: SIGINT / SIGTERM triggers graceful shutdown — the runtime stops the scheduler and the health server, then exits. In-flight ticks are not interrupted mid-submit.
  • Roll back the deploy: redeploy the previous image. The runner holds no migration state of its own; persisted plugin state is keyed per plugin name and survives the rollback.

When 503 fires

Follow the incident playbook: parse /health, tail logs by plugin name, check whether it oscillates (transient) or sticks, then pnpm dry-run <plugin> and pnpm inspect <plugin> to localize the fault before you touch the deploy.