Disaster Recovery

Tested runbooks for every failure class. RTO under 15 minutes for most scenarios.

v1.1.0

DR targets

Scenario	RTO	RPO
Service crash (auto-restart)	< 30 sec	0
Bad deploy (rollback)	< 3 min	0
Database corruption	< 15 min	Last backup
Full server loss	< 60 min	Last remote backup
Data center outage	< 2 hr	Last remote backup

Scenario 1: Service Crash

All services are configured with restart: unless-stopped. Auto-restart happens within 30 seconds. If a service keeps crashing:

# Check which service is down
nself health

# See why it crashed
nself logs <service> --tail 100

# Restart it manually
nself service restart <service>

# If it keeps crashing — run doctor
nself doctor

# Common fixes:
#   Postgres: disk full → nself db vacuum
#   Hasura: metadata inconsistency → nself db hasura reload
#   Auth: JWT config changed → nself service restart auth
#   Redis: OOM → increase REDIS_MEMORY_LIMIT in .env, redeploy

Scenario 2: Bad Deploy (Rollback)

Roll back within minutes if a deploy introduces a regression:

# List recent deploys
nself deploy history

# Output:
# ID                       Time                 Status
# deploy_20260507_143022   2026-05-07 14:30     active
# deploy_20260506_090011   2026-05-06 09:00     ok
# deploy_20260505_181533   2026-05-05 18:15     ok

# Roll back to the previous deploy
nself deploy rollback

# Roll back to a specific deploy
nself deploy rollback deploy_20260506_090011

# Rollback is zero-downtime — old containers start before new ones stop

Database migrations cannot be rolled back automatically. If the bad deploy included a migration, restore from a pre-deploy backup instead (see Scenario 3). This is why nself backup create before every production deploy is a best practice.

Scenario 3: Database Corruption / Data Loss

# 1. Stop all write traffic (enable maintenance mode)
nself maintenance on

# 2. Create a forensic snapshot of current state (even corrupted)
nself backup create --label forensic-$(date +%Y%m%d)

# 3. List available backups
nself backup list

# 4. Pick the most recent valid backup
nself backup restore bak_20260506_020000

# Output:
#   Stopping services...
#   Restoring database from bak_20260506_020000...
#   Running post-restore migrations...
#   Starting services...
#   Restore complete. Data loss: ~12 hours.

# 5. Verify data integrity
nself db check

# 6. Re-enable traffic
nself maintenance off

Point-in-time recovery (PITR)

If WAL archiving is enabled, you can restore to any point in time, not just backup snapshots:

# Enable WAL archiving (requires remote storage configured)
BACKUP_WAL_ENABLED=true
BACKUP_WAL_S3_BUCKET=your-wal-archive-bucket

# Restore to a specific timestamp
nself backup restore --pitr "2026-05-07 13:45:00"

# Verify WAL archiving is running
nself backup status --wal

Scenario 4: Full Server Loss

Complete VPS failure — disk gone, server terminated, or data center down. Recovery time: under 60 minutes with remote backups.

Prerequisites (set up before disaster)

# Remote backup configured in .env:
BACKUP_REMOTE=s3                  # s3 | r2 | gcs | azure | b2
BACKUP_S3_BUCKET=your-backup-bucket
BACKUP_S3_ENDPOINT=https://s3.amazonaws.com
BACKUP_S3_ACCESS_KEY=...
BACKUP_S3_SECRET_KEY=...

# Verify remote backups are uploading
nself backup status --remote

Recovery procedure

# 1. Provision a new server (same OS, same region if possible)
#    Hetzner: hcloud server create --name nself-prod --type cx23 --image ubuntu-22.04

# 2. Install nself on the new server
curl -fsSL https://install.nself.org | sh

# 3. Restore configuration from remote backup
nself restore --from-remote s3 --bucket your-backup-bucket --latest

# 4. Apply hardening and request new SSL cert
nself prod harden
nself ssl request --domain yourdomain.com

# 5. Update DNS A record to new server IP

# 6. Verify
nself health
nself smoke-test --domain yourdomain.com

Scenario 5: Compromised Server

If you suspect the server has been compromised:

# 1. Immediately isolate the server (Hetzner firewall / provider console)
#    Block all inbound traffic except your IP.

# 2. Create a forensic backup before touching anything
nself backup create --label forensic-incident-$(date +%Y%m%d)

# 3. Rotate all secrets from a CLEAN machine
nself secrets rotate --all --remote  # rotates in the DB but doesn't restart old server

# 4. Provision a new server and restore from last-known-good backup
#    (follow Scenario 4 procedure above)

# 5. Revoke all active sessions
nself auth revoke-all-sessions

# 6. Notify affected users if data was likely accessed

# 7. File incident report and review audit logs
nself audit export --since 30d > incident-audit.json

Restore Drills

Practice before you need it. Run a restore drill every 90 days in staging:

# Full DR drill script
nself dr drill --env staging --scenario full-restore

# What it does:
#   1. Creates a staging instance from scratch
#   2. Restores latest remote backup
#   3. Runs health checks
#   4. Runs smoke tests
#   5. Reports time taken and any failures
#   6. Tears down the test instance

# Last drill: always record the results
nself dr drill --report --output dr-drill-$(date +%Y%m%d).json

Backup Verification

# Verify the latest backup is restorable
nself backup verify

# Output:
#   Backup ID:      bak_20260507_020000
#   Checksum:       PASS
#   Restore test:   PASS (restored to ephemeral container in 4.2s)
#   Integrity:      PASS (row count matches: 142,381)
#   Backup is valid and restorable.

# Schedule weekly automatic verification
BACKUP_VERIFY_SCHEDULE=0 6 * * 0   # every Sunday at 6am

RTO/RPO Improvement

To reduce RTO further:

Enable WAL archiving for PITR (reduces RPO to near-zero).
Use Postgres streaming replication to a hot standby (reduces RTO for DB failure to <60 sec).
Pre-provision a warm standby server in a different region for critical deployments.
Use Cloudflare as a CDN — it can serve stale cached content during origin outages.

Disaster Recovery

Tested runbooks for every failure class. RTO under 15 minutes for most scenarios.

v1.1.0

DR targets

Scenario	RTO	RPO
Service crash (auto-restart)	< 30 sec	0
Bad deploy (rollback)	< 3 min	0
Database corruption	< 15 min	Last backup
Full server loss	< 60 min	Last remote backup
Data center outage	< 2 hr	Last remote backup

Scenario 1: Service Crash

All services are configured with restart: unless-stopped. Auto-restart happens within 30 seconds. If a service keeps crashing:

# Check which service is down
nself health

# See why it crashed
nself logs <service> --tail 100

# Restart it manually
nself service restart <service>

# If it keeps crashing — run doctor
nself doctor

# Common fixes:
#   Postgres: disk full → nself db vacuum
#   Hasura: metadata inconsistency → nself db hasura reload
#   Auth: JWT config changed → nself service restart auth
#   Redis: OOM → increase REDIS_MEMORY_LIMIT in .env, redeploy

Scenario 2: Bad Deploy (Rollback)

Roll back within minutes if a deploy introduces a regression:

# List recent deploys
nself deploy history

# Output:
# ID                       Time                 Status
# deploy_20260507_143022   2026-05-07 14:30     active
# deploy_20260506_090011   2026-05-06 09:00     ok
# deploy_20260505_181533   2026-05-05 18:15     ok

# Roll back to the previous deploy
nself deploy rollback

# Roll back to a specific deploy
nself deploy rollback deploy_20260506_090011

# Rollback is zero-downtime — old containers start before new ones stop

Scenario 3: Database Corruption / Data Loss

# 1. Stop all write traffic (enable maintenance mode)
nself maintenance on

# 2. Create a forensic snapshot of current state (even corrupted)
nself backup create --label forensic-$(date +%Y%m%d)

# 3. List available backups
nself backup list

# 4. Pick the most recent valid backup
nself backup restore bak_20260506_020000

# Output:
#   Stopping services...
#   Restoring database from bak_20260506_020000...
#   Running post-restore migrations...
#   Starting services...
#   Restore complete. Data loss: ~12 hours.

# 5. Verify data integrity
nself db check

# 6. Re-enable traffic
nself maintenance off

Point-in-time recovery (PITR)

If WAL archiving is enabled, you can restore to any point in time, not just backup snapshots:

# Enable WAL archiving (requires remote storage configured)
BACKUP_WAL_ENABLED=true
BACKUP_WAL_S3_BUCKET=your-wal-archive-bucket

# Restore to a specific timestamp
nself backup restore --pitr "2026-05-07 13:45:00"

# Verify WAL archiving is running
nself backup status --wal

Scenario 4: Full Server Loss

Complete VPS failure — disk gone, server terminated, or data center down. Recovery time: under 60 minutes with remote backups.

Prerequisites (set up before disaster)

# Remote backup configured in .env:
BACKUP_REMOTE=s3                  # s3 | r2 | gcs | azure | b2
BACKUP_S3_BUCKET=your-backup-bucket
BACKUP_S3_ENDPOINT=https://s3.amazonaws.com
BACKUP_S3_ACCESS_KEY=...
BACKUP_S3_SECRET_KEY=...

# Verify remote backups are uploading
nself backup status --remote

Recovery procedure

# 1. Provision a new server (same OS, same region if possible)
#    Hetzner: hcloud server create --name nself-prod --type cx23 --image ubuntu-22.04

# 2. Install nself on the new server
curl -fsSL https://install.nself.org | sh

# 3. Restore configuration from remote backup
nself restore --from-remote s3 --bucket your-backup-bucket --latest

# 4. Apply hardening and request new SSL cert
nself prod harden
nself ssl request --domain yourdomain.com

# 5. Update DNS A record to new server IP

# 6. Verify
nself health
nself smoke-test --domain yourdomain.com

Scenario 5: Compromised Server

If you suspect the server has been compromised:

# 1. Immediately isolate the server (Hetzner firewall / provider console)
#    Block all inbound traffic except your IP.

# 2. Create a forensic backup before touching anything
nself backup create --label forensic-incident-$(date +%Y%m%d)

# 3. Rotate all secrets from a CLEAN machine
nself secrets rotate --all --remote  # rotates in the DB but doesn't restart old server

# 4. Provision a new server and restore from last-known-good backup
#    (follow Scenario 4 procedure above)

# 5. Revoke all active sessions
nself auth revoke-all-sessions

# 6. Notify affected users if data was likely accessed

# 7. File incident report and review audit logs
nself audit export --since 30d > incident-audit.json

Restore Drills

Practice before you need it. Run a restore drill every 90 days in staging:

# Full DR drill script
nself dr drill --env staging --scenario full-restore

# What it does:
#   1. Creates a staging instance from scratch
#   2. Restores latest remote backup
#   3. Runs health checks
#   4. Runs smoke tests
#   5. Reports time taken and any failures
#   6. Tears down the test instance

# Last drill: always record the results
nself dr drill --report --output dr-drill-$(date +%Y%m%d).json

Backup Verification

# Verify the latest backup is restorable
nself backup verify

# Output:
#   Backup ID:      bak_20260507_020000
#   Checksum:       PASS
#   Restore test:   PASS (restored to ephemeral container in 4.2s)
#   Integrity:      PASS (row count matches: 142,381)
#   Backup is valid and restorable.

# Schedule weekly automatic verification
BACKUP_VERIFY_SCHEDULE=0 6 * * 0   # every Sunday at 6am

RTO/RPO Improvement

To reduce RTO further:

Enable WAL archiving for PITR (reduces RPO to near-zero).
Use Postgres streaming replication to a hot standby (reduces RTO for DB failure to <60 sec).
Pre-provision a warm standby server in a different region for critical deployments.
Use Cloudflare as a CDN — it can serve stale cached content during origin outages.

Disaster Recovery

DR targets

Scenario 1: Service Crash

Scenario 2: Bad Deploy (Rollback)

Scenario 3: Database Corruption / Data Loss

Point-in-time recovery (PITR)

Scenario 4: Full Server Loss

Prerequisites (set up before disaster)

Recovery procedure

Scenario 5: Compromised Server

Restore Drills

Backup Verification

RTO/RPO Improvement

Related

Disaster Recovery

DR targets

Scenario 1: Service Crash

Scenario 2: Bad Deploy (Rollback)

Scenario 3: Database Corruption / Data Loss

Point-in-time recovery (PITR)

Scenario 4: Full Server Loss

Prerequisites (set up before disaster)

Recovery procedure

Scenario 5: Compromised Server

Restore Drills

Backup Verification

RTO/RPO Improvement

Related