Tested runbooks for every failure class. RTO under 15 minutes for most scenarios.
| Scenario | RTO | RPO |
|---|---|---|
| Service crash (auto-restart) | < 30 sec | 0 |
| Bad deploy (rollback) | < 3 min | 0 |
| Database corruption | < 15 min | Last backup |
| Full server loss | < 60 min | Last remote backup |
| Data center outage | < 2 hr | Last remote backup |
All services are configured with restart: unless-stopped. Auto-restart happens within 30 seconds. If a service keeps crashing:
# Check which service is down
nself health
# See why it crashed
nself logs <service> --tail 100
# Restart it manually
nself service restart <service>
# If it keeps crashing — run doctor
nself doctor
# Common fixes:
# Postgres: disk full → nself db vacuum
# Hasura: metadata inconsistency → nself db hasura reload
# Auth: JWT config changed → nself service restart auth
# Redis: OOM → increase REDIS_MEMORY_LIMIT in .env, redeployRoll back within minutes if a deploy introduces a regression:
# List recent deploys
nself deploy history
# Output:
# ID Time Status
# deploy_20260507_143022 2026-05-07 14:30 active
# deploy_20260506_090011 2026-05-06 09:00 ok
# deploy_20260505_181533 2026-05-05 18:15 ok
# Roll back to the previous deploy
nself deploy rollback
# Roll back to a specific deploy
nself deploy rollback deploy_20260506_090011
# Rollback is zero-downtime — old containers start before new ones stopDatabase migrations cannot be rolled back automatically. If the bad deploy included a migration, restore from a pre-deploy backup instead (see Scenario 3). This is why nself backup create before every production deploy is a best practice.
# 1. Stop all write traffic (enable maintenance mode)
nself maintenance on
# 2. Create a forensic snapshot of current state (even corrupted)
nself backup create --label forensic-$(date +%Y%m%d)
# 3. List available backups
nself backup list
# 4. Pick the most recent valid backup
nself backup restore bak_20260506_020000
# Output:
# Stopping services...
# Restoring database from bak_20260506_020000...
# Running post-restore migrations...
# Starting services...
# Restore complete. Data loss: ~12 hours.
# 5. Verify data integrity
nself db check
# 6. Re-enable traffic
nself maintenance offIf WAL archiving is enabled, you can restore to any point in time, not just backup snapshots:
# Enable WAL archiving (requires remote storage configured)
BACKUP_WAL_ENABLED=true
BACKUP_WAL_S3_BUCKET=your-wal-archive-bucket
# Restore to a specific timestamp
nself backup restore --pitr "2026-05-07 13:45:00"
# Verify WAL archiving is running
nself backup status --walComplete VPS failure — disk gone, server terminated, or data center down. Recovery time: under 60 minutes with remote backups.
# Remote backup configured in .env:
BACKUP_REMOTE=s3 # s3 | r2 | gcs | azure | b2
BACKUP_S3_BUCKET=your-backup-bucket
BACKUP_S3_ENDPOINT=https://s3.amazonaws.com
BACKUP_S3_ACCESS_KEY=...
BACKUP_S3_SECRET_KEY=...
# Verify remote backups are uploading
nself backup status --remote# 1. Provision a new server (same OS, same region if possible)
# Hetzner: hcloud server create --name nself-prod --type cx23 --image ubuntu-22.04
# 2. Install nself on the new server
curl -fsSL https://install.nself.org | sh
# 3. Restore configuration from remote backup
nself restore --from-remote s3 --bucket your-backup-bucket --latest
# 4. Apply hardening and request new SSL cert
nself prod harden
nself ssl request --domain yourdomain.com
# 5. Update DNS A record to new server IP
# 6. Verify
nself health
nself smoke-test --domain yourdomain.comIf you suspect the server has been compromised:
# 1. Immediately isolate the server (Hetzner firewall / provider console)
# Block all inbound traffic except your IP.
# 2. Create a forensic backup before touching anything
nself backup create --label forensic-incident-$(date +%Y%m%d)
# 3. Rotate all secrets from a CLEAN machine
nself secrets rotate --all --remote # rotates in the DB but doesn't restart old server
# 4. Provision a new server and restore from last-known-good backup
# (follow Scenario 4 procedure above)
# 5. Revoke all active sessions
nself auth revoke-all-sessions
# 6. Notify affected users if data was likely accessed
# 7. File incident report and review audit logs
nself audit export --since 30d > incident-audit.jsonPractice before you need it. Run a restore drill every 90 days in staging:
# Full DR drill script
nself dr drill --env staging --scenario full-restore
# What it does:
# 1. Creates a staging instance from scratch
# 2. Restores latest remote backup
# 3. Runs health checks
# 4. Runs smoke tests
# 5. Reports time taken and any failures
# 6. Tears down the test instance
# Last drill: always record the results
nself dr drill --report --output dr-drill-$(date +%Y%m%d).json# Verify the latest backup is restorable
nself backup verify
# Output:
# Backup ID: bak_20260507_020000
# Checksum: PASS
# Restore test: PASS (restored to ephemeral container in 4.2s)
# Integrity: PASS (row count matches: 142,381)
# Backup is valid and restorable.
# Schedule weekly automatic verification
BACKUP_VERIFY_SCHEDULE=0 6 * * 0 # every Sunday at 6amTo reduce RTO further: