WRNexus
Back to the blog
engineering 7 min read

Zero-downtime Postgres migrations: our checklist

The eight rules WRNexus engineers follow to ship schema changes during business hours without locking a table or paging the on-call.

Rohan Verma

#engineering · #database · #operations

We ship Postgres migrations every day, often in the middle of the business day, against a database that handles real customer traffic. This used to be terrifying. It is now boring, and the eight rules below are most of why.

The rules are all variations on one principle: the migration and the code change are two deploys, not one.

Rule 1: Never combine a schema change with a code change

If you add a column and ship code that reads it in the same deploy, you’ve signed up for a race. Old app servers are still running while the migration is in flight; new app servers expect the column to exist. The boring path:

  1. Deploy A: ship the migration. App code is untouched.
  2. Deploy B: ship the code that reads the new column.

Two deploys, no races. It feels slower; it’s how you ship at 2pm on a Tuesday without holding your breath.

Rule 2: Adding a NOT NULL column is two migrations

A NOT NULL DEFAULT 'x' on a large table will lock it long enough that production notices. The boring path is three steps:

-- Migration A
ALTER TABLE workspace ADD COLUMN tier text;

-- Application: backfill in batches of 1000
UPDATE workspace SET tier = 'free' WHERE tier IS NULL AND id IN (...);

-- Migration B (after backfill is done)
ALTER TABLE workspace
  ALTER COLUMN tier SET DEFAULT 'free',
  ALTER COLUMN tier SET NOT NULL;

The SET NOT NULL in Postgres 12+ uses a no-rewrite path if the column has no existing NULL values, which is exactly the state your backfill leaves you in.

Rule 3: Creating an index? Use CONCURRENTLY

A plain CREATE INDEX takes a write lock for the duration. On a busy table, “the duration” is “long enough to time out every endpoint that writes to that table.” Always:

CREATE INDEX CONCURRENTLY idx_audit_log_workspace_created
  ON audit_log (workspace_id, created_at DESC);

Yes, it’s slower. Yes, it doesn’t run inside a transaction. The trade-off is that nobody pages you.

Rule 4: Dropping a column is three deploys

Same logic, opposite direction:

  1. Deploy A: stop writing to the column in application code.
  2. Deploy B: stop reading from it. (Yes, this is a separate deploy — if you do both in one go, an unrelated bug could need a rollback while old app servers are still around.)
  3. Deploy C: the migration that drops the column.

Three deploys for one column. The first time you do this you’ll think it’s overkill. The second time you do this is because the first time was actually overkill, and yet here you are.

Rule 5: Long-running migrations belong in a job, not the schema step

Anything that touches more than a few thousand rows shouldn’t be in your schema migration. The migration should add the column / index / constraint structure, and a separate background job should backfill the rows in batches. Reasons:

  • The migration is short and easy to roll back.
  • The backfill can checkpoint and resume.
  • A backfill that hits an unexpected case can be paused without blocking the next migration.

In WRNexus we run these as one-shot Celery tasks; in your stack they might be a manage.py command, a Lambda, or a Sidekiq worker. The tool matters less than the separation.

Rule 6: Always have a rollback in mind

Before you push the migration, answer two questions out loud:

  1. If the migration runs but the app deploy fails, what happens? (Answer should be: nothing. The schema change is forward- compatible with the old code.)
  2. If the app deploys but the migration is reverted, what happens? (Answer should be: an exception on the first request that touches the new column, and a clean rollback.)

If either answer is “I don’t know” or “the database is corrupt,” you haven’t followed Rule 1 yet.

Rule 7: Lock timeouts save you from yourself

Every migration in WRNexus runs with:

SET lock_timeout = '5s';
SET statement_timeout = '60s';

If a migration can’t acquire a lock in 5 seconds, it fails fast instead of queueing behind a long-running transaction and blocking every other writer. The failure is annoying. The alternative — your entire app stuck waiting for a 4-minute ANALYZE to release a lock — is a paging event.

Rule 8: Practice in a copy of production

We snapshot prod once a week and replay the migration against the snapshot. Anything that takes more than 30 seconds gets a “this is going to be noisy” tag on the PR and a discussion about whether to split it. Anything that takes more than 5 minutes gets re-thought from scratch.

The cost of the snapshot environment is trivial. The cost of finding out a migration takes 14 minutes by running it in prod is substantial.

A worked example

Here’s a real change we shipped last quarter: adding a workspace.sso_required boolean. The full sequence:

StepWhatWhere
1ADD COLUMN sso_required boolean (no default, no NOT NULL)Schema migration
2App reads as Optional[bool] and treats NULL as FalseCode change, deploy 1
3Backfill job sets every existing row to false, 1000 at a timeBackground job
4ALTER COLUMN sso_required SET DEFAULT false, SET NOT NULLSchema migration
5App reads as bool (no NULL handling)Code change, deploy 2

Five steps, three deploys, zero downtime, zero pages. The change shipped to production on a Wednesday afternoon while the team was mid-standup.

That’s the bar. None of these rules are clever, and that is the point: boring migrations are the kind that don’t wake you up at 3am.