incident-runbook-templates

Structured incident response runbooks with detection, triage, mitigation, and communication procedures. Provides severity-level framework (SEV1–SEV4) with response time targets and impact classifications Includes ready-to-use templates for service outages and database incidents with bash/SQL commands, health checks, and rollback procedures Covers escalation matrices, communication templates for notifications and status updates, and verification steps to confirm resolution Emphasizes runbook maintenance through regular testing, postmortem integration, and assumption documentation

INSTALLATION
npx skills add https://github.com/wshobson/agents --skill incident-runbook-templates
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

Incident Runbook Templates

Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.

When to Use This Skill

  • Creating incident response procedures
  • Building service-specific runbooks
  • Establishing escalation paths
  • Documenting recovery procedures
  • Responding to active incidents
  • Onboarding on-call engineers

Core Concepts

1. Incident Severity Levels

Severity

Impact

Response Time

Example

SEV1

Complete outage, data loss

15 min

Production down

SEV2

Major degradation

30 min

Critical feature broken

SEV3

Minor impact

2 hours

Non-critical bug

SEV4

Minimal impact

Next business day

Cosmetic issue

2. Runbook Structure

1. Overview & Impact

2. Detection & Alerts

3. Initial Triage

4. Mitigation Steps

5. Root Cause Investigation

6. Resolution Procedures

7. Verification & Rollback

8. Communication Templates

9. Escalation Matrix

Runbook Templates

Template 1: Service Outage Runbook

# [Service Name] Outage Runbook

## Overview

**Service**: Payment Processing Service

**Owner**: Platform Team

**Slack**: #payments-incidents

**PagerDuty**: payments-oncall

## Impact Assessment

- [ ] Which customers are affected?

- [ ] What percentage of traffic is impacted?

- [ ] Are there financial implications?

- [ ] What's the blast radius?

## Detection

### Alerts

- `payment_error_rate > 5%` (PagerDuty)

- `payment_latency_p99 > 2s` (Slack)

- `payment_success_rate < 95%` (PagerDuty)

### Dashboards

- [Payment Service Dashboard](https://grafana/d/payments)

- [Error Tracking](https://sentry.io/payments)

- [Dependency Status](https://status.stripe.com)

## Initial Triage (First 5 Minutes)

### 1. Assess Scope

Check service health

kubectl get pods -n payments -l app=payment-service

Check recent deployments

kubectl rollout history deployment/payment-service -n payments

Check error rates

curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"

2. Quick Health Checks

  • Can you reach the service? curl -I https://api.company.com/payments/health
  • Database connectivity? Check connection pool metrics
  • External dependencies? Check Stripe, bank API status
  • Recent changes? Check deploy history

3. Initial Classification

Symptom

Likely Cause

Go To Section

All requests failing

Service down

Section 4.1

High latency

Database/dependency

Section 4.2

Partial failures

Code bug

Section 4.3

Spike in errors

Traffic surge

Section 4.4

Mitigation Procedures

4.1 Service Completely Down

# Step 1: Check pod status

kubectl get pods -n payments

# Step 2: If pods are crash-looping, check logs

kubectl logs -n payments -l app=payment-service --tail=100

# Step 3: Check recent deployments

kubectl rollout history deployment/payment-service -n payments

# Step 4: ROLLBACK if recent deploy is suspect

kubectl rollout undo deployment/payment-service -n payments

# Step 5: Scale up if resource constrained

kubectl scale deployment/payment-service -n payments --replicas=10

# Step 6: Verify recovery

kubectl rollout status deployment/payment-service -n payments

4.2 High Latency

# Step 1: Check database connections

kubectl exec -n payments deploy/payment-service -- \

  curl localhost:8080/metrics | grep db_pool

# Step 2: Check slow queries (if DB issue)

psql -h $DB_HOST -U $DB_USER -c "

  SELECT pid, now() - query_start AS duration, query

  FROM pg_stat_activity

  WHERE state = 'active' AND duration > interval '5 seconds'

  ORDER BY duration DESC;"

# Step 3: Kill long-running queries if needed

psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"

# Step 4: Check external dependency latency

curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health

# Step 5: Enable circuit breaker if dependency is slow

kubectl set env deployment/payment-service \

  STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments

4.3 Partial Failures (Specific Errors)

# Step 1: Identify error pattern

kubectl logs -n payments -l app=payment-service --tail=500 | \

  grep -i error | sort | uniq -c | sort -rn | head -20

# Step 2: Check error tracking

# Go to Sentry: https://sentry.io/payments

# Step 3: If specific endpoint, enable feature flag to disable

curl -X POST https://api.company.com/internal/feature-flags \

  -d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'

# Step 4: If data issue, check recent data changes

psql -h $DB_HOST -c "

  SELECT * FROM audit_log

  WHERE table_name = 'payment_methods'

  AND created_at > now() - interval '1 hour';"

4.4 Traffic Surge

# Step 1: Check current request rate

kubectl top pods -n payments

# Step 2: Scale horizontally

kubectl scale deployment/payment-service -n payments --replicas=20

# Step 3: Enable rate limiting

kubectl set env deployment/payment-service \

  RATE_LIMIT_ENABLED=true \

  RATE_LIMIT_RPS=1000 -n payments

# Step 4: If attack, block suspicious IPs

kubectl apply -f - <<EOF

apiVersion: networking.k8s.io/v1

kind: NetworkPolicy

metadata:

  name: block-suspicious

  namespace: payments

spec:

  podSelector:

    matchLabels:

      app: payment-service

  ingress:

  - from:

    - ipBlock:

        cidr: 0.0.0.0/0

        except:

        - 192.168.1.0/24  # Suspicious range

EOF

Verification Steps

# Verify service is healthy

curl -s https://api.company.com/payments/health | jq

# Verify error rate is back to normal

curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'

# Verify latency is acceptable

curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq

# Smoke test critical flows

./scripts/smoke-test-payments.sh

Rollback Procedures

# Rollback Kubernetes deployment

kubectl rollout undo deployment/payment-service -n payments

# Rollback database migration (if applicable)

./scripts/db-rollback.sh $MIGRATION_VERSION

# Rollback feature flag

curl -X POST https://api.company.com/internal/feature-flags \

  -d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'

Escalation Matrix

Condition

Escalate To

Contact

15 min unresolved SEV1

Engineering Manager

@manager (Slack)

Data breach suspected

Security Team

#security-incidents

Financial impact > $10k

Finance + Legal

@finance-oncall

Customer communication needed

Support Lead

@support-lead

Communication Templates

Initial Notification (Internal)

🚨 INCIDENT: Payment Service Degradation

Severity: SEV2

Status: Investigating

Impact: ~20% of payment requests failing

Start Time: [TIME]

Incident Commander: [NAME]

Current Actions:

- Investigating root cause

- Scaling up service

- Monitoring dashboards

Updates in #payments-incidents

Status Update

📊 UPDATE: Payment Service Incident

Status: Mitigating

Impact: Reduced to ~5% failure rate

Duration: 25 minutes

Actions Taken:

- Rolled back deployment v2.3.4 → v2.3.3

- Scaled service from 5 → 10 replicas

Next Steps:

- Continuing to monitor

- Root cause analysis in progress

ETA to Resolution: ~15 minutes

Resolution Notification

✅ RESOLVED: Payment Service Incident

Duration: 45 minutes

Impact: ~5,000 affected transactions

Root Cause: Memory leak in v2.3.4

Resolution:

- Rolled back to v2.3.3

- Transactions auto-retried successfully

Follow-up:

- Postmortem scheduled for [DATE]

- Bug fix in progress
### Template 2: Database Incident Runbook

Database Incident Runbook

Quick Reference

IssueCommand
Check connectionsSELECT count(*) FROM pg_stat_activity;
Kill querySELECT pg_terminate_backend(pid);
Check replication lagSELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));
Check locksSELECT * FROM pg_locks WHERE NOT granted;

Connection Pool Exhaustion


-- Check current connections

SELECT datname, usename, state, count(*)

FROM pg_stat_activity

GROUP BY datname, usename, state

ORDER BY count(*) DESC;

-- Identify long-running connections

SELECT pid, usename, datname, state, query_start, query

FROM pg_stat_activity

WHERE state != 'idle'

ORDER BY query_start;

-- Terminate idle connections

SELECT pg_terminate_backend(pid)

FROM pg_stat_activity

WHERE state = 'idle'

AND query_start < now() - interval '10 minutes';

Replication Lag

-- Check lag on replica

SELECT

  CASE

    WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0

    ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())

  END AS lag_seconds;

-- If lag > 60s, consider:

-- 1. Check network between primary/replica

-- 2. Check replica disk I/O

-- 3. Consider failover if unrecoverable

Disk Space Critical

# Check disk usage

df -h /var/lib/postgresql/data

# Find large tables

psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))

FROM pg_catalog.pg_statio_user_tables

ORDER BY pg_total_relation_size(relid) DESC

LIMIT 10;"

# VACUUM to reclaim space

psql -c "VACUUM FULL large_table;"

# If emergency, delete old data or expand disk
## Best Practices

### Do's

- **Keep runbooks updated** - Review after every incident

- **Test runbooks regularly** - Game days, chaos engineering

- **Include rollback steps** - Always have an escape hatch

- **Document assumptions** - What must be true for steps to work

- **Link to dashboards** - Quick access during stress

### Don'ts

- **Don't assume knowledge** - Write for 3 AM brain

- **Don't skip verification** - Confirm each step worked

- **Don't forget communication** - Keep stakeholders informed

- **Don't work alone** - Escalate early

- **Don't skip postmortems** - Learn from every incident

## Troubleshooting

### Runbook steps work in staging but fail during a real incident

Steps often assume preconditions that are true in a healthy environment but not during an outage. For each command in your runbook, add a prerequisite check and a "what to do if this command fails" note:

Step: Check pod status

kubectl get pods -n payments

Prerequisites: kubectl configured, kubeconfig points to correct cluster

If this fails: run aws eks update-kubeconfig --name prod-cluster --region us-east-1

Expected output: pods in Running state


### On-call engineer panics and skips steps out of order

Add a numbered checklist at the top of the runbook that mirrors the section numbers, so responders can track progress under stress without reading the full document:

Quick Checklist

  • [ ] 1. Declare incident severity and open war room
  • [ ] 2. Check service health (Section 4.1)
  • [ ] 3. Check recent deployments (Section 4.1)
  • [ ] 4. Roll back if deploy is suspect (Section 4.1)
  • [ ] 5. Post initial notification to #payments-incidents
  • [ ] 6. Escalate if > 15 min unresolved
  • 
    ### Runbook is outdated — commands reference old cluster names or endpoints
    
    Runbooks rot because they're updated manually. Include a "Last Verified" date and owner at the top, and add a CI check that validates all `curl` endpoints and `kubectl` context names are still valid:
    

Runbook Metadata

FieldValue
Last verified2024-11-15
Owner@platform-team
Review cadenceAfter every SEV1/SEV2

### Stakeholder communication is delayed while engineers are heads-down

Assign a dedicated incident communicator role (separate from the incident commander) whose only job is to post status updates. Add a standing agenda in the communication template:

Update every 15 minutes (even if no new information):

  • Current status (Investigating / Mitigating / Monitoring)
  • Impact (what is broken, who is affected, % of traffic)
  • What we are doing right now
  • Next update in: 15 minutes
  • 
    ### Database runbook commands cause additional downtime when run incorrectly
    
    Add explicit warnings before destructive SQL commands and require a dry-run output check before executing:
    

-- WARNING: This terminates active connections. Verify count first.

-- DRY RUN (check count before terminating):

SELECT count(*) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes';

-- EXECUTE only after verifying count is reasonable (< 50):

SELECT pg_terminate_backend(pid) FROM pg_stat_activity

WHERE state = 'idle' AND query_start < now() - interval '10 minutes';

BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card