A status page is one of those features that is easy to ship badly and hard to ship well. The trap is that the people most motivated to look at it — your customers, during an incident — are the people you have the least slack to communicate with in the moment.
Here is what we built for status.wrnexus.com, why each piece is
there, and the things we deliberately left out.
What the page actually shows
Three sections, no more:
- Component status — one row per user-visible surface (SSO, Account, Billing, Admin, API). Each row is green, yellow, or red.
- Active incidents — a banner the moment we acknowledge something, updated by the on-call engineer at least every 30 minutes until resolved.
- Historical uptime — 90 days of daily uptime per component, rendered as a small heatmap so a glance tells you “is this thing reliable?”
That’s it. No vanity graphs of request rate. No “global region health” matrix. No third-party SaaS health embedded as a tile because then the page conflates “our problem” with “GitHub’s problem.”
How the signals get there
Each component row has two inputs:
-
Synthetic probes from three regions hitting the public endpoint every 30 seconds. Two failing in a row turns the row yellow.
-
An on-call override. If our internal alerting fires before the probe does — which is most of the time — the on-call engineer can flip the row red from a Slack slash command:
/status set sso red "investigating elevated 5xx rate"
Probes alone are too slow. Slack overrides alone are too quiet at 3am. Together they catch both fast and slow incidents.
Why we host it off the main stack
The status page lives on a different cloud provider, a different DNS zone, and a separate billing account. If our primary region is on fire, the status page is not on fire with it. This sounds obvious; it is the single most common operational mistake we see in other products’ status pages.
Concretely: the page is a static site, regenerated every 60 seconds from a small Postgres database that holds the component states. The generation script and the database both run on a tiny dedicated VM that has zero shared dependencies with our production system.
Incident updates that aren’t useless
The hardest part of running a status page is writing updates that help. The rules we follow:
- First update within 5 minutes of acknowledging. Even if it just says “we’re investigating, no impact yet known,” it tells the customer we know.
- No “we’re investigating” loop. Every subsequent update has to add new information: what we know, what we don’t, when we’ll update next.
- A specific resolution sentence. Not “service has been restored” but “the deploy that introduced the regression was rolled back at 14:32 UTC; we’re keeping the incident open for 30 minutes to monitor.”
- A linked post-mortem within 5 business days for anything that affected paying customers. We’ve never regretted writing one; we have regretted not.
What we deliberately left out
- A subscribe-via-email widget. Email is brittle exactly when you need it. We publish an RSS feed and an iCal feed of incidents, which any modern team can consume in their tool of choice. We also expose a webhook for status changes.
- A subscribe-via-SMS widget. Nobody enjoys an SMS at 2am. If you’re on-call for an integration with WRNexus, set up the webhook.
- “Component health = 99.97% over 365 days” hero numbers. They obscure more than they reveal. The 90-day heatmap shows the same data in a way you can actually reason about.
The bigger lesson
A status page is a contract: when something is wrong, this page will tell you within a few minutes, and tell you the truth. The way you keep the contract is by making the page expensive to lie on — both technically (so it works when the rest doesn’t) and culturally (so your on-call engineer feels empowered to flip a component red without asking permission).
The day someone says “the status page is the first thing I check,” you know it’s working.