SKILL.md

Incident Runbook Templates

Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.

When to Use This Skill

Creating incident response procedures

Building service-specific runbooks

Establishing escalation paths

Documenting recovery procedures

Responding to active incidents

Onboarding on-call engineers

Core Concepts

1. Incident Severity Levels

Severity

Impact

Response Time

Example

SEV1

Complete outage, data loss

15 min

Production down

SEV2

Major degradation

30 min

Critical feature broken

SEV3

Minor impact

2 hours

Non-critical bug

SEV4

Minimal impact

Next business day

Cosmetic issue

2. Runbook Structure

1. Overview &#x26; Impact

2. Detection &#x26; Alerts

3. Initial Triage

4. Mitigation Steps

5. Root Cause Investigation

6. Resolution Procedures

7. Verification &#x26; Rollback

8. Communication Templates

9. Escalation Matrix

Runbook Templates

Template 1: Service Outage Runbook

# [Service Name] Outage Runbook

## Overview

**Service**: Payment Processing Service

**Owner**: Platform Team

**Slack**: #payments-incidents

**PagerDuty**: payments-oncall

## Impact Assessment

- [ ] Which customers are affected?

- [ ] What percentage of traffic is impacted?

- [ ] Are there financial implications?

- [ ] What's the blast radius?

## Detection

### Alerts

- `payment_error_rate > 5%` (PagerDuty)

- `payment_latency_p99 > 2s` (Slack)

- `payment_success_rate < 95%` (PagerDuty)

### Dashboards

- [Payment Service Dashboard](https://grafana/d/payments)

- [Error Tracking](https://sentry.io/payments)

- [Dependency Status](https://status.stripe.com)

## Initial Triage (First 5 Minutes)

### 1. Assess Scope

Check service health

kubectl get pods -n payments -l app=payment-service

Check recent deployments

kubectl rollout history deployment/payment-service -n payments

Check error rates

curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"

2. Quick Health Checks

Can you reach the service? curl -I https://api.company.com/payments/health

Database connectivity? Check connection pool metrics

External dependencies? Check Stripe, bank API status

Recent changes? Check deploy history

3. Initial Classification

Symptom

Likely Cause

Go To Section

All requests failing

Service down

Section 4.1

High latency

Database/dependency

Section 4.2

Partial failures

Code bug

Section 4.3

Spike in errors

Traffic surge

Section 4.4

Mitigation Procedures

4.1 Service Completely Down

# Step 1: Check pod status

kubectl get pods -n payments

# Step 2: If pods are crash-looping, check logs

kubectl logs -n payments -l app=payment-service --tail=100

# Step 3: Check recent deployments

kubectl rollout history deployment/payment-service -n payments

# Step 4: ROLLBACK if recent deploy is suspect

kubectl rollout undo deployment/payment-service -n payments

# Step 5: Scale up if resource constrained

kubectl scale deployment/payment-service -n payments --replicas=10

# Step 6: Verify recovery

kubectl rollout status deployment/payment-service -n payments

4.2 High Latency

# Step 1: Check database connections

kubectl exec -n payments deploy/payment-service -- \

  curl localhost:8080/metrics | grep db_pool

# Step 2: Check slow queries (if DB issue)

psql -h $DB_HOST -U $DB_USER -c "

  SELECT pid, now() - query_start AS duration, query

  FROM pg_stat_activity

  WHERE state = 'active' AND duration > interval '5 seconds'

  ORDER BY duration DESC;"

# Step 3: Kill long-running queries if needed

psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"

# Step 4: Check external dependency latency

curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health

# Step 5: Enable circuit breaker if dependency is slow

kubectl set env deployment/payment-service \

  STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments

4.3 Partial Failures (Specific Errors)

# Step 1: Identify error pattern

kubectl logs -n payments -l app=payment-service --tail=500 | \

  grep -i error | sort | uniq -c | sort -rn | head -20

# Step 2: Check error tracking

# Go to Sentry: https://sentry.io/payments

# Step 3: If specific endpoint, enable feature flag to disable

curl -X POST https://api.company.com/internal/feature-flags \

  -d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'

# Step 4: If data issue, check recent data changes

psql -h $DB_HOST -c "

  SELECT * FROM audit_log

  WHERE table_name = 'payment_methods'

  AND created_at > now() - interval '1 hour';"

4.4 Traffic Surge

# Step 1: Check current request rate

kubectl top pods -n payments

# Step 2: Scale horizontally

kubectl scale deployment/payment-service -n payments --replicas=20

# Step 3: Enable rate limiting

kubectl set env deployment/payment-service \

  RATE_LIMIT_ENABLED=true \

  RATE_LIMIT_RPS=1000 -n payments

# Step 4: If attack, block suspicious IPs

kubectl apply -f - <<EOF

apiVersion: networking.k8s.io/v1

kind: NetworkPolicy

metadata:

  name: block-suspicious

  namespace: payments

spec:

  podSelector:

    matchLabels:

      app: payment-service

  ingress:

  - from:

    - ipBlock:

        cidr: 0.0.0.0/0

        except:

        - 192.168.1.0/24  # Suspicious range

EOF

Verification Steps

# Verify service is healthy

curl -s https://api.company.com/payments/health | jq

# Verify error rate is back to normal

curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'

# Verify latency is acceptable

curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq

# Smoke test critical flows

./scripts/smoke-test-payments.sh

Rollback Procedures

# Rollback Kubernetes deployment

kubectl rollout undo deployment/payment-service -n payments

# Rollback database migration (if applicable)

./scripts/db-rollback.sh $MIGRATION_VERSION

# Rollback feature flag

curl -X POST https://api.company.com/internal/feature-flags \

  -d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'

Escalation Matrix

Condition

Escalate To

Contact

15 min unresolved SEV1

Engineering Manager

@manager (Slack)

Data breach suspected

Security Team

#security-incidents

Financial impact > $10k

Finance + Legal

@finance-oncall

Customer communication needed

Support Lead

@support-lead

Communication Templates

Initial Notification (Internal)

🚨 INCIDENT: Payment Service Degradation

Severity: SEV2

Status: Investigating

Impact: ~20% of payment requests failing

Start Time: [TIME]

Incident Commander: [NAME]

Current Actions:

- Investigating root cause

- Scaling up service

- Monitoring dashboards

Updates in #payments-incidents

Status Update

📊 UPDATE: Payment Service Incident

Status: Mitigating

Impact: Reduced to ~5% failure rate

Duration: 25 minutes

Actions Taken:

- Rolled back deployment v2.3.4 → v2.3.3

- Scaled service from 5 → 10 replicas

Next Steps:

- Continuing to monitor

- Root cause analysis in progress

ETA to Resolution: ~15 minutes

Resolution Notification

✅ RESOLVED: Payment Service Incident

Duration: 45 minutes

Impact: ~5,000 affected transactions

Root Cause: Memory leak in v2.3.4

Resolution:

- Rolled back to v2.3.3

- Transactions auto-retried successfully

Follow-up:

- Postmortem scheduled for [DATE]

- Bug fix in progress

### Template 2: Database Incident Runbook

Database Incident Runbook

Quick Reference

Issue	Command
Check connections	`SELECT count(*) FROM pg_stat_activity;`
Kill query	`SELECT pg_terminate_backend(pid);`
Check replication lag	`SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));`
Check locks	`SELECT * FROM pg_locks WHERE NOT granted;`

Connection Pool Exhaustion


-- Check current connections

SELECT datname, usename, state, count(*)

FROM pg_stat_activity

GROUP BY datname, usename, state

ORDER BY count(*) DESC;

-- Identify long-running connections

SELECT pid, usename, datname, state, query_start, query

FROM pg_stat_activity

WHERE state != 'idle'

ORDER BY query_start;

-- Terminate idle connections

SELECT pg_terminate_backend(pid)

FROM pg_stat_activity

WHERE state = 'idle'

AND query_start < now() - interval '10 minutes';

Replication Lag

-- Check lag on replica

SELECT

  CASE

    WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0

    ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())

  END AS lag_seconds;

-- If lag > 60s, consider:

-- 1. Check network between primary/replica

-- 2. Check replica disk I/O

-- 3. Consider failover if unrecoverable

Disk Space Critical

# Check disk usage

df -h /var/lib/postgresql/data

# Find large tables

psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))

FROM pg_catalog.pg_statio_user_tables

ORDER BY pg_total_relation_size(relid) DESC

LIMIT 10;"

# VACUUM to reclaim space

psql -c "VACUUM FULL large_table;"

# If emergency, delete old data or expand disk

## Best Practices

### Do's

- **Keep runbooks updated** - Review after every incident

- **Test runbooks regularly** - Game days, chaos engineering

- **Include rollback steps** - Always have an escape hatch

- **Document assumptions** - What must be true for steps to work

- **Link to dashboards** - Quick access during stress

### Don'ts

- **Don't assume knowledge** - Write for 3 AM brain

- **Don't skip verification** - Confirm each step worked

- **Don't forget communication** - Keep stakeholders informed

- **Don't work alone** - Escalate early

- **Don't skip postmortems** - Learn from every incident

## Troubleshooting

### Runbook steps work in staging but fail during a real incident

Steps often assume preconditions that are true in a healthy environment but not during an outage. For each command in your runbook, add a prerequisite check and a "what to do if this command fails" note:

Step: Check pod status

kubectl get pods -n payments

Prerequisites: kubectl configured, kubeconfig points to correct cluster

If this fails: run `aws eks update-kubeconfig --name prod-cluster --region us-east-1`

Expected output: pods in Running state


### On-call engineer panics and skips steps out of order

Add a numbered checklist at the top of the runbook that mirrors the section numbers, so responders can track progress under stress without reading the full document:

Quick Checklist

[ ] 1. Declare incident severity and open war room

[ ] 2. Check service health (Section 4.1)

[ ] 3. Check recent deployments (Section 4.1)

[ ] 4. Roll back if deploy is suspect (Section 4.1)

[ ] 5. Post initial notification to #payments-incidents

[ ] 6. Escalate if > 15 min unresolved


### Runbook is outdated — commands reference old cluster names or endpoints

Runbooks rot because they're updated manually. Include a "Last Verified" date and owner at the top, and add a CI check that validates all `curl` endpoints and `kubectl` context names are still valid:

Runbook Metadata

Field	Value
Last verified	2024-11-15
Owner	@platform-team
Review cadence	After every SEV1/SEV2


### Stakeholder communication is delayed while engineers are heads-down

Assign a dedicated incident communicator role (separate from the incident commander) whose only job is to post status updates. Add a standing agenda in the communication template:

Update every 15 minutes (even if no new information):

Current status (Investigating / Mitigating / Monitoring)

Impact (what is broken, who is affected, % of traffic)

What we are doing right now

Next update in: 15 minutes


### Database runbook commands cause additional downtime when run incorrectly

Add explicit warnings before destructive SQL commands and require a dry-run output check before executing:

-- WARNING: This terminates active connections. Verify count first.

-- DRY RUN (check count before terminating):

SELECT count(*) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes';

-- EXECUTE only after verifying count is reasonable (< 50):

SELECT pg_terminate_backend(pid) FROM pg_stat_activity

WHERE state = 'idle' AND query_start < now() - interval '10 minutes';

incident-runbook-templates

SKILL.md

Incident Runbook Templates

When to Use This Skill

Core Concepts

1. Incident Severity Levels

2. Runbook Structure

Runbook Templates

Template 1: Service Outage Runbook

Check service health

Check recent deployments

Check error rates

2. Quick Health Checks

3. Initial Classification

Mitigation Procedures

4.1 Service Completely Down

4.2 High Latency

4.3 Partial Failures (Specific Errors)

4.4 Traffic Surge

Verification Steps

Rollback Procedures

Escalation Matrix

Communication Templates

Initial Notification (Internal)

Status Update

Resolution Notification

Database Incident Runbook

Quick Reference

Connection Pool Exhaustion

Replication Lag

Disk Space Critical

Step: Check pod status

Prerequisites: kubectl configured, kubeconfig points to correct cluster

If this fails: run `aws eks update-kubeconfig --name prod-cluster --region us-east-1`

Expected output: pods in Running state

Quick Checklist

Runbook Metadata

Stop writing automation&scrapers

incident-runbook-templates

SKILL.md

Incident Runbook Templates

When to Use This Skill

Core Concepts

1. Incident Severity Levels

2. Runbook Structure

Runbook Templates

Template 1: Service Outage Runbook

Check service health

Check recent deployments

Check error rates

2. Quick Health Checks

3. Initial Classification

Mitigation Procedures

4.1 Service Completely Down

4.2 High Latency

4.3 Partial Failures (Specific Errors)

4.4 Traffic Surge

Verification Steps

Rollback Procedures

Escalation Matrix

Communication Templates

Initial Notification (Internal)

Status Update

Resolution Notification

Database Incident Runbook

Quick Reference

Connection Pool Exhaustion

Replication Lag

Disk Space Critical

Step: Check pod status

Prerequisites: kubectl configured, kubeconfig points to correct cluster

If this fails: run aws eks update-kubeconfig --name prod-cluster --region us-east-1

Expected output: pods in Running state

Quick Checklist

Runbook Metadata

Let your agent run on any real-world website

Related skills

Stop writing automation&scrapers

If this fails: run `aws eks update-kubeconfig --name prod-cluster --region us-east-1`