devops-automation

DevOps and IT Ops automation - CI/CD, monitoring, incident management, and infrastructure workflows

INSTALLATION
npx skills add https://github.com/claude-office-skills/skills --skill devops-automation
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

DevOps Automation

Automate DevOps workflows including CI/CD pipelines, monitoring, incident management, and infrastructure operations. Based on n8n's IT Ops workflow templates.

Overview

This skill covers:

  • CI/CD pipeline automation
  • Monitoring and alerting
  • Incident management
  • Infrastructure automation
  • Deployment workflows

CI/CD Automation

GitHub Actions Integration

workflow: "GitHub CI/CD Notifications"

triggers:

  - github_push

  - github_pull_request

  - github_workflow_run

on_push:

  action:

    - trigger_ci: if_main_branch

    - notify_slack:

        channel: "#deployments"

        message: |

          πŸ“¦ *New Push to {branch}*

          Commit: `{commit_sha_short}`

          Author: {author}

          Message: {commit_message}

          [View Diff]({compare_url})

on_pr_opened:

  action:

    - notify_slack:

        channel: "#code-review"

        message: |

          πŸ”€ *New Pull Request*

          Title: {pr_title}

          Author: {author}

          Branch: {head} β†’ {base}

          [Review PR]({pr_url})

    - assign_reviewers: based_on_codeowners

    - run_ci_checks

on_workflow_complete:

  action:

    - notify_slack:

        message: |

          {status_emoji} *Build {status}*

          Workflow: {workflow_name}

          Branch: {branch}

          Duration: {duration}

          {if_failed: [View Logs]({logs_url})}

Deployment Pipeline

deployment_pipeline:

  stages:

    build:

      trigger: push_to_main

      steps:

        - checkout_code

        - install_dependencies

        - run_tests

        - build_artifact

        - push_to_registry

    staging:

      trigger: build_success

      steps:

        - deploy_to_staging

        - run_integration_tests

        - notify_qa

    production:

      trigger: manual_approval

      steps:

        - create_backup

        - deploy_to_production

        - run_smoke_tests

        - notify_team

  rollback:

    trigger: deployment_failed OR manual

    steps:

      - revert_to_previous

      - notify_team

      - create_incident

Monitoring & Alerting

Alert Routing

alert_routing:

  sources:

    - prometheus

    - datadog

    - cloudwatch

    - new_relic

  severity_levels:

    critical:

      response_time: 5_minutes

      channels: [pagerduty, slack_urgent, sms]

      escalation: immediate

    high:

      response_time: 15_minutes

      channels: [slack_alerts, email]

      escalation: after_15_minutes

    medium:

      response_time: 1_hour

      channels: [slack_alerts]

    low:

      response_time: 24_hours

      channels: [slack_logging]

  routing_rules:

    - if: service == "payments"

      team: payments_oncall

      severity_boost: +1

    - if: service == "auth"

      team: security_oncall

    - default:

      team: platform_oncall

Alert Templates

alert_templates:

  infrastructure:

    cpu_high:

      title: "πŸ”₯ High CPU Usage"

      body: |

        Server: {host}

        CPU: {cpu_percent}%

        Duration: {duration}

        Threshold: {threshold}%

        [View Dashboard]({grafana_url})

    memory_critical:

      title: "πŸ’Ύ Critical Memory"

      body: |

        Server: {host}

        Memory: {memory_percent}%

        Available: {available_mb}MB

        [SSH to Server]({ssh_link})

    disk_full:

      title: "πŸ’Ώ Disk Space Critical"

      body: |

        Server: {host}

        Disk: {disk_percent}%

        Available: {available_gb}GB

        Suggestion: Clean logs or expand volume

  application:

    error_spike:

      title: "πŸ“ˆ Error Rate Spike"

      body: |

        Service: {service}

        Error Rate: {error_rate}%

        Normal: {baseline}%

        Top Errors:

        {top_errors}

    latency_high:

      title: "🐒 High Latency"

      body: |

        Service: {service}

        P99 Latency: {p99_ms}ms

        Threshold: {threshold_ms}ms

Incident Management

Incident Workflow

incident_workflow:

  detection:

    sources: [monitoring, user_report, automated_check]

  triage:

    auto_severity:

      - if: affects_payments

        severity: critical

      - if: affects_auth

        severity: critical

      - if: affects_api AND error_rate > 10%

        severity: high

  response:

    critical:

      - create_incident_channel: "#inc-{timestamp}"

      - page_oncall: immediately

      - notify_stakeholders: [engineering_lead, product]

      - start_war_room: zoom_link

      - create_status_page: incident

    high:

      - create_incident_channel

      - notify_oncall: slack

      - create_ticket: jira

  communication:

    internal:

      frequency: every_30_minutes

      channel: incident_channel

      template: |

        πŸ“Š *Incident Update*

        Status: {status}

        Impact: {impact}

        Next update: {next_update_time}

        Current actions:

        {action_items}

    external:

      channel: status_page

      template: customer_facing_update

  resolution:

    steps:

      - confirm_resolution

      - update_status_page: resolved

      - notify_stakeholders

      - schedule_postmortem

      - close_incident_channel: after_24h

Postmortem Template

postmortem_template:

  sections:

    summary:

      - incident_title

      - duration

      - severity

      - impact

    timeline:

      format: |

        | Time | Event |

        |------|-------|

        | {time} | {event} |

    root_cause:

      - what_happened

      - why_it_happened

      - contributing_factors

    impact:

      - users_affected

      - revenue_impact

      - sla_breach

    resolution:

      - how_it_was_fixed

      - time_to_detect

      - time_to_resolve

    action_items:

      format: |

        | Action | Owner | Due Date | Status |

        |--------|-------|----------|--------|

    lessons_learned:

      - what_went_well

      - what_went_poorly

      - lucky_breaks

Infrastructure Automation

Server Provisioning

provisioning_workflow:

  trigger: jira_ticket OR slack_request

  steps:

    1. validate_request:

        check: [budget_approval, security_review]

    2. create_infrastructure:

        terraform:

          - vpc

          - security_groups

          - ec2_instances

          - load_balancer

    3. configure_server:

        ansible:

          - base_configuration

          - security_hardening

          - monitoring_agent

          - application_setup

    4. validate:

        - health_check

        - security_scan

        - performance_baseline

    5. notify:

        slack: "βœ… Server {hostname} is ready"

        include: [ssh_access, dashboard_link]

Scheduled Maintenance

maintenance_automation:

  tasks:

    certificate_renewal:

      schedule: "30 days before expiry"

      action:

        - request_new_cert: letsencrypt

        - deploy_cert

        - verify_ssl

        - notify: if_failure

    security_patching:

      schedule: "weekly"

      action:

        - check_updates

        - if_critical: immediate_patch

        - else: schedule_maintenance_window

    log_rotation:

      schedule: "daily"

      action:

        - rotate_logs

        - compress_old

        - upload_to_s3

        - delete_local: older_than_7_days

    backup_verification:

      schedule: "weekly"

      action:

        - restore_to_test_env

        - run_integrity_checks

        - report_status

Kubernetes Automation

K8s Workflows

kubernetes_automation:

  deployment:

    trigger: docker_image_pushed

    steps:

      - update_manifest: with_new_image_tag

      - apply_to_staging

      - run_tests

      - if_success: apply_to_production

  scaling:

    trigger: metric_threshold

    rules:

      - if: cpu > 80%

        action: scale_up

        max_replicas: 10

      - if: cpu < 20%

        action: scale_down

        min_replicas: 2

  rollback:

    trigger: health_check_failed

    action:

      - kubectl_rollout_undo

      - notify_team

      - create_incident

Output Example

Request: "Set up deployment notifications for GitHub Actions"

Output:

# GitHub Actions Deployment Notifications

## n8n Workflow

trigger: GitHub Webhook

events: [workflow_run]


## Notification Templates

**Build Started:**

πŸš€ Deployment Started

Branch: main

Commit: abc1234

Author: @developer

Triggered by: Push

View Workflow


**Build Success:**

βœ… Deployment Successful

Environment: Production

Duration: 3m 42s

Version: v1.2.3

Changes:

β€’ Feature X

β€’ Bug fix Y

View Deployment


**Build Failed:**

❌ Deployment Failed

Stage: Test

Error: npm test failed

View Logs

Retry


## Slack Integration

channel: "#deployments"

mention_on_failure: "@oncall"

thread_replies: true

---

DevOps Automation Skill - Part of Claude Office Skills

BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills β†’

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free Β· no credit card