slo-implementation

Define and implement Service Level Indicators, Objectives, and error budgets for reliability targets. Provides SLI/SLO/SLA hierarchy with common indicator types (availability, latency, durability) and Prometheus recording rules for automated calculation Includes error budget formulas, burn rate calculations, and multi-window alerting strategies to balance reliability with development velocity Offers SLO compliance dashboards, review processes (weekly/monthly/quarterly), and decision frameworks for setting achievable targets Covers error budget policies that trigger actions based on remaining budget percentage, from normal development to feature freeze

INSTALLATION
npx skills add https://github.com/wshobson/agents --skill slo-implementation
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

SLO Implementation

Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.

Purpose

Implement measurable reliability targets using SLIs, SLOs, and error budgets to balance reliability with innovation velocity.

When to Use

  • Define service reliability targets
  • Measure user-perceived reliability
  • Implement error budgets
  • Create SLO-based alerts
  • Track reliability goals

SLI/SLO/SLA Hierarchy

SLA (Service Level Agreement)

  ↓ Contract with customers

SLO (Service Level Objective)

  ↓ Internal reliability target

SLI (Service Level Indicator)

  ↓ Actual measurement

Defining SLIs

Common SLI Types

#### 1. Availability SLI

# Successful requests / Total requests

sum(rate(http_requests_total{status!~"5.."}[28d]))

/

sum(rate(http_requests_total[28d]))

#### 2. Latency SLI

# Requests below latency threshold / Total requests

sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))

/

sum(rate(http_request_duration_seconds_count[28d]))

#### 3. Durability SLI

# Successful writes / Total writes

sum(storage_writes_successful_total)

/

sum(storage_writes_total)

Reference: See references/slo-definitions.md

Setting SLO Targets

Availability SLO Examples

SLO %

Downtime/Month

Downtime/Year

99%

7.2 hours

3.65 days

99.9%

43.2 minutes

8.76 hours

99.95%

21.6 minutes

4.38 hours

99.99%

4.32 minutes

52.56 minutes

Choose Appropriate SLOs

Consider:

  • User expectations
  • Business requirements
  • Current performance
  • Cost of reliability
  • Competitor benchmarks

Example SLOs:

slos:

  - name: api_availability

    target: 99.9

    window: 28d

    sli: |

      sum(rate(http_requests_total{status!~"5.."}[28d]))

      /

      sum(rate(http_requests_total[28d]))

  - name: api_latency_p95

    target: 99

    window: 28d

    sli: |

      sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))

      /

      sum(rate(http_request_duration_seconds_count[28d]))

Error Budget Calculation

Error Budget Formula

Error Budget = 1 - SLO Target

Example:

  • SLO: 99.9% availability
  • Error Budget: 0.1% = 43.2 minutes/month
  • Current Error: 0.05% = 21.6 minutes/month
  • Remaining Budget: 50%

Error Budget Policy

error_budget_policy:

  - remaining_budget: 100%

    action: Normal development velocity

  - remaining_budget: 50%

    action: Consider postponing risky changes

  - remaining_budget: 10%

    action: Freeze non-critical changes

  - remaining_budget: 0%

    action: Feature freeze, focus on reliability

Reference: See references/error-budget.md

SLO Implementation

Prometheus Recording Rules

# SLI Recording Rules

groups:

  - name: sli_rules

    interval: 30s

    rules:

      # Availability SLI

      - record: sli:http_availability:ratio

        expr: |

          sum(rate(http_requests_total{status!~"5.."}[28d]))

          /

          sum(rate(http_requests_total[28d]))

      # Latency SLI (requests < 500ms)

      - record: sli:http_latency:ratio

        expr: |

          sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))

          /

          sum(rate(http_request_duration_seconds_count[28d]))

  - name: slo_rules

    interval: 5m

    rules:

      # SLO compliance (1 = meeting SLO, 0 = violating)

      - record: slo:http_availability:compliance

        expr: sli:http_availability:ratio >= bool 0.999

      - record: slo:http_latency:compliance

        expr: sli:http_latency:ratio >= bool 0.99

      # Error budget remaining (percentage)

      - record: slo:http_availability:error_budget_remaining

        expr: |

          (sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100

      # Error budget burn rate

      - record: slo:http_availability:burn_rate_5m

        expr: |

          (1 - (

            sum(rate(http_requests_total{status!~"5.."}[5m]))

            /

            sum(rate(http_requests_total[5m]))

          )) / (1 - 0.999)

SLO Alerting Rules

groups:

  - name: slo_alerts

    interval: 1m

    rules:

      # Fast burn: 14.4x rate, 1 hour window

      # Consumes 2% error budget in 1 hour

      - alert: SLOErrorBudgetBurnFast

        expr: |

          slo:http_availability:burn_rate_1h > 14.4

          and

          slo:http_availability:burn_rate_5m > 14.4

        for: 2m

        labels:

          severity: critical

        annotations:

          summary: "Fast error budget burn detected"

          description: "Error budget burning at {{ $value }}x rate"

      # Slow burn: 6x rate, 6 hour window

      # Consumes 5% error budget in 6 hours

      - alert: SLOErrorBudgetBurnSlow

        expr: |

          slo:http_availability:burn_rate_6h > 6

          and

          slo:http_availability:burn_rate_30m > 6

        for: 15m

        labels:

          severity: warning

        annotations:

          summary: "Slow error budget burn detected"

          description: "Error budget burning at {{ $value }}x rate"

      # Error budget exhausted

      - alert: SLOErrorBudgetExhausted

        expr: slo:http_availability:error_budget_remaining < 0

        for: 5m

        labels:

          severity: critical

        annotations:

          summary: "SLO error budget exhausted"

          description: "Error budget remaining: {{ $value }}%"

SLO Dashboard

Grafana Dashboard Structure:

┌────────────────────────────────────┐

│ SLO Compliance (Current)           │

│ ✓ 99.95% (Target: 99.9%)          │

├────────────────────────────────────┤

│ Error Budget Remaining: 65%        │

│ ████████░░ 65%                     │

├────────────────────────────────────┤

│ SLI Trend (28 days)                │

│ [Time series graph]                │

├────────────────────────────────────┤

│ Burn Rate Analysis                 │

│ [Burn rate by time window]         │

└────────────────────────────────────┘

Example Queries:

# Current SLO compliance

sli:http_availability:ratio * 100

# Error budget remaining

slo:http_availability:error_budget_remaining

# Days until error budget exhausted (at current burn rate)

(slo:http_availability:error_budget_remaining / 100)

*

28

/

(1 - sli:http_availability:ratio) * (1 - 0.999)

Multi-Window Burn Rate Alerts

# Combination of short and long windows reduces false positives

rules:

  - alert: SLOBurnRateHigh

    expr: |

      (

        slo:http_availability:burn_rate_1h > 14.4

        and

        slo:http_availability:burn_rate_5m > 14.4

      )

      or

      (

        slo:http_availability:burn_rate_6h > 6

        and

        slo:http_availability:burn_rate_30m > 6

      )

    labels:

      severity: critical

SLO Review Process

Weekly Review

  • Current SLO compliance
  • Error budget status
  • Trend analysis
  • Incident impact

Monthly Review

  • SLO achievement
  • Error budget usage
  • Incident postmortems
  • SLO adjustments

Quarterly Review

  • SLO relevance
  • Target adjustments
  • Process improvements
  • Tooling enhancements

Best Practices

  • Start with user-facing services
  • Use multiple SLIs (availability, latency, etc.)
  • Set achievable SLOs (don't aim for 100%)
  • Implement multi-window alerts to reduce noise
  • Track error budget consistently
  • Review SLOs regularly
  • Document SLO decisions
  • Align with business goals
  • Automate SLO reporting
  • Use SLOs for prioritization

Related Skills

  • prometheus-configuration - For metric collection
  • grafana-dashboards - For SLO visualization
BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card