Troubleshooting playbook

Troubleshooting decision paths for Nextbase platform incidents

Structured runbook guiding on-call engineers through common outage scenarios with explicit escalation gates.

Owner: Platform Reliability Guild Last reviewed July 28, 2025 #incidents-auth

Select a decision path to jump to the relevant troubleshooting branch.

Step 1
Verify SLO breach
Confirm authentication requests exceed error budget thresholds.
If error rate > 10% for 3 minutes, initiate incident response.

Query failed auth attempts

kubectl logs deploy/auth-proxy --tail=200 | grep -i "status=500"

Check for rate limiter failures or upstream timeouts.

Page authentication on-call
Step 2
Invalidate stale cache
Flush rogue tokens if edge cache desync caused auth drift.

Purge CDN auth cache

destructive
curl -X POST https://api.nextbase.com/admin/cache/purge -H "Authorization: Bearer $TOKEN" -d '{"surface":"auth"}'

Expect 2-3 minute warmup after purge.

PreviousIncident command handbookNextPost-incident retrospective template
Prerequisites
  • Incident bridge created in Slack
  • PagerDuty engagement confirmed
  • Latest deploy status reviewed
Required tools
  • kubectl access to production cluster
  • Nextbase admin CLI
  • SQS read permissions
Escalation contacts
Route incidents to the correct owner before breach.
Primary on-callpagerduty.com/oncall
Staff engineer@alex.warden
Customer successcs-oncall@nextbase.com