Production runbook
The go-live checklist tying probes, alerts, and config into one procedure.
This is the procedure, not the explanation. Run the list top to bottom before flipping a vault live. Each item links to the page that explains the why; here you just check the box.
Rule: if you can't see the daemon and you can't roll it back, it isn't in production yet.
Before the daemon starts
RUNNER_PLUGINSset: the CSV of plugins the daemon loads. Confirm every name you expect is in it.triggeranddry-runonly see plugins on this list.OPERATOR_PRIVATE_KEYset, never logged: runpnpm show-configand check it reports✅ set. The command prints the resolved config without secrets, so it's safe to paste into a ticket.SUI_NETWORKcorrect:testnet|mainnet|devnet, defaulttestnet. A mainnet vault behind a testnet daemon ticks forever and does nothing.LOG_LEVELisinfo(or quieter):debugprints auto-fund deliberations and scheduler skips every tick. It's expensive. Don't ship it.- Wallet funded:
startlogs the operator address and SUI balance before the first tick, but it does not assert a threshold. Check the balance yourself. An empty wallet fails every submission as a client error. - Dry-run each plugin:
pnpm dry-run <plugin>simulates a real tick with no signature and no gas. A green dry-run is your last cheap signal before the daemon owns the keys.
See Configuration reference for the full list and CLI commands for what each verb does.
Wire the probes
The daemon opens an HTTP server on HEALTH_PORT (default 3030; 0 disables it). Point your orchestrator at both endpoints:
GET /ready→ liveness: 200 while the process is alive. Restart the pod when it stops answering.GET /health→ readiness: 200 when every plugin is healthy, 503 when any plugin is unhealthy. Pull from traffic on 503; don't restart.
On a PaaS that gives you one probe, use /health — it's an informative superset of /ready. The full probe config and intervals live in Deploy & observability.
Wire the alerts
- Alert on
/health503: the readiness probe handles routing; the alert is so a human looks. The JSON body is identical at 200 and 503, so poll it and parse theplugins[]array to find which one flipped. - Alert on the noop-loop line: the runtime emits one
warnatnoopWarnThresholdconsecutive noops (plugin stuck in noop loop). It fires once per streak. A healthy plugin stuck in noop looks fine to/health; this is how you catch it. - Alert on
failedclimbing whilesubmittedstays flat: raw counters from/health. Build the rule in your own system — the runner exposes no Prometheus directly.
Know your rollback before you need it
Have the path ready before go-live, not during the incident:
- Disable one plugin: remove its line from
src/plugins/index.tsand redeploy. The other plugins keep ticking. This is faster than fixing it in production. - Stop the daemon cleanly:
SIGINT/SIGTERMtriggers graceful shutdown — the runtime stops the scheduler and the health server, then exits. In-flight ticks are not interrupted mid-submit. - Roll back the deploy: redeploy the previous image. The runner holds no migration state of its own; persisted plugin state is keyed per plugin name and survives the rollback.
When 503 fires
Follow the incident playbook: parse /health, tail logs by plugin name, check whether it oscillates (transient) or sticks, then pnpm dry-run <plugin> and pnpm inspect <plugin> to localize the fault before you touch the deploy.