sre-engineer

Defines service level objectives, creates error budget policies, designs incident response procedures, develops capacity models, and produces monitoring…

INSTALLATION
npx skills add https://github.com/jeffallan/claude-skills --skill sre-engineer
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

SRE Engineer

Core Workflow

  • Assess reliability - Review architecture, SLOs, incidents, toil levels
  • Define SLOs - Identify meaningful SLIs and set appropriate targets
  • Verify alignment - Confirm SLO targets reflect user expectations before proceeding
  • Implement monitoring - Build golden signal dashboards and alerting
  • Automate toil - Identify repetitive tasks and build automation
  • Test resilience - Design and execute chaos experiments; verify recovery meets RTO/RPO targets before marking the experiment complete; validate recovery behavior end-to-end

Reference Guide

Load detailed guidance based on context:

Topic

Reference

Load When

SLO/SLI

references/slo-sli-management.md

Defining SLOs, calculating error budgets

Error Budgets

references/error-budget-policy.md

Managing budgets, burn rates, policies

Monitoring

references/monitoring-alerting.md

Golden signals, alert design, dashboards

Automation

references/automation-toil.md

Toil reduction, automation patterns

Incidents

references/incident-chaos.md

Incident response, chaos engineering

Constraints

MUST DO

  • Define quantitative SLOs (e.g., 99.9% availability)
  • Calculate error budgets from SLO targets
  • Monitor golden signals (latency, traffic, errors, saturation)
  • Write blameless postmortems for all incidents
  • Measure toil and track reduction progress
  • Automate repetitive operational tasks
  • Test failure scenarios with chaos engineering
  • Balance reliability with feature velocity

MUST NOT DO

  • Set SLOs without user impact justification
  • Alert on symptoms without actionable runbooks
  • Tolerate >50% toil without automation plan
  • Skip postmortems or assign blame
  • Implement manual processes for recurring tasks
  • Deploy without capacity planning
  • Ignore error budget exhaustion
  • Build systems that can't degrade gracefully

Output Templates

When implementing SRE practices, provide:

  • SLO definitions with SLI measurements and targets
  • Monitoring/alerting configuration (Prometheus, etc.)
  • Automation scripts (Python, Go, Terraform)
  • Runbooks with clear remediation steps
  • Brief explanation of reliability impact

Concrete Examples

SLO Definition & Error Budget Calculation

# 99.9% availability SLO over a 30-day window

# Allowed downtime: (1 - 0.999) * 30 * 24 * 60 = 43.2 minutes/month

# Error budget (request-based): 0.001 * total_requests

# Example: 10M requests/month → 10,000 error budget requests

# If 5,000 errors consumed in week 1 → 50% budget burned in 25% of window

# → Trigger error budget policy: freeze non-critical releases

Prometheus SLO Alerting Rule (Multiwindow Burn Rate)

groups:

  - name: slo_availability

    rules:

      # Fast burn: 2% budget in 1h (14.4x burn rate)

      - alert: HighErrorBudgetBurn

        expr: |

          (

            sum(rate(http_requests_total{status=~"5.."}[1h]))

            /

            sum(rate(http_requests_total[1h]))

          ) > 0.014400

          and

          (

            sum(rate(http_requests_total{status=~"5.."}[5m]))

            /

            sum(rate(http_requests_total[5m]))

          ) > 0.014400

        for: 2m

        labels:

          severity: critical

        annotations:

          summary: "High error budget burn rate detected"

          runbook: "https://wiki.internal/runbooks/high-error-burn"

      # Slow burn: 5% budget in 6h (1x burn rate sustained)

      - alert: SlowErrorBudgetBurn

        expr: |

          (

            sum(rate(http_requests_total{status=~"5.."}[6h]))

            /

            sum(rate(http_requests_total[6h]))

          ) > 0.001

        for: 15m

        labels:

          severity: warning

        annotations:

          summary: "Sustained error budget consumption"

          runbook: "https://wiki.internal/runbooks/slow-error-burn"

PromQL Golden Signal Queries

# Latency — 99th percentile request duration

histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

# Traffic — requests per second by service

sum(rate(http_requests_total[5m])) by (service)

# Errors — error rate ratio

sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)

  /

sum(rate(http_requests_total[5m])) by (service)

# Saturation — CPU throttling ratio

sum(rate(container_cpu_cfs_throttled_seconds_total[5m])) by (pod)

  /

sum(rate(container_cpu_cfs_periods_total[5m])) by (pod)

Toil Automation Script (Python)

#!/usr/bin/env python3

"""Auto-remediation: restart pods exceeding error threshold."""

import subprocess, sys, json

ERROR_THRESHOLD = 0.05  # 5% error rate triggers restart

def get_error_rate(service: str) -> float:

    """Query Prometheus for current error rate."""

    import urllib.request

    query = f'sum(rate(http_requests_total{{status=~"5..",service="{service}"}}[5m])) / sum(rate(http_requests_total{{service="{service}"}}[5m]))'

    url = f"http://prometheus:9090/api/v1/query?query={urllib.request.quote(query)}"

    with urllib.request.urlopen(url) as resp:

        data = json.load(resp)

    results = data["data"]["result"]

    return float(results[0]["value"][1]) if results else 0.0

def restart_deployment(namespace: str, deployment: str) -> None:

    subprocess.run(

        ["kubectl", "rollout", "restart", f"deployment/{deployment}", "-n", namespace],

        check=True

    )

    print(f"Restarted {namespace}/{deployment}")

if __name__ == "__main__":

    service, namespace, deployment = sys.argv[1], sys.argv[2], sys.argv[3]

    rate = get_error_rate(service)

    print(f"Error rate for {service}: {rate:.2%}")

    if rate > ERROR_THRESHOLD:

        restart_deployment(namespace, deployment)

    else:

        print("Within SLO threshold — no action required")

Documentation

BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card