Why we ship WRNexus as a single-VM Docker Compose stack

The “reference deployment” of WRNexus is one VM, one docker-compose.yml, and one wrnexus.service systemd unit. No Kubernetes, no Nomad, no service mesh, no operator. That is on purpose, and we run our own production exactly this way.

The shape of it

A single Linux VM (we like Ubuntu 22.04 LTS, but it does not matter) runs the following containers:

api — the Rust + actix-web backend for SSO, account, billing and admin.
marketing, sso, account, admin, docs — five Astro apps.
postgres — primary store, daily backup to S3.
redis — session store and rate-limit windows.
traefik — TLS termination via Let’s Encrypt.

A wrnexus.service systemd unit calls docker compose up -d on boot, healthchecks each container every 30 seconds, and restarts anything that falls over. A nightly cron tars pg_dump output to S3 with a retention policy of 30 dailies, 12 monthlies.

That is the whole production manifest.

Why this works at our scale

Two reasons. First, WRNexus is stateless except for Postgres and Redis, so the only thing that benefits from horizontal scaling is the process count — and a 4-vCPU VM happily serves the per-region load we have today. Second, the operational burden of even the friendliest orchestrator is non-trivial; for an early-stage product, the time to debug a misconfigured PVC is time you are not shipping features.

When we cross into territory where a single VM is genuinely the bottleneck — somewhere north of 10k QPS on the API — we will move to Fly.io regions (the same images, no rebuild) and adopt a managed Postgres. The compose file would still be the local-dev source of truth.

What we do not skimp on

Backups — we test restore from backup every quarter. A backup you have not restored from is not a backup; it’s a hope.
Monitoring — node-exporter, postgres-exporter, redis-exporter and a Grafana dashboard live in the same compose file. Alerts page on disk usage above 80%, on API p95 above 300ms, on backup age above 28 hours.
Migrations — embedded sqlx migrations, run on container startup with a lock so parallel container starts don’t race. Never run by hand.

When you would not pick this

If you need multi-region active-active, sub-50ms global latency, or zero-downtime database failover, you have outgrown the single-VM stack and that’s fine. We publish the same images to the WorkRoot registry — they run on Fly.io machines, ECS, GKE, anywhere. The architecture does not change; the orchestrator does.

The point of the single-VM default isn’t that orchestration is bad. The point is that operational simplicity is a feature, and reserving it for “day 1” and graduating to complexity only when you measurably need it is one of the best engineering decisions a small team can make.