SKILL.md

$2a

What This Skill Produces

Pipeline configuration: Stage definitions, job dependencies, parallelism, and caching strategy

Deployment strategy: Chosen rollout pattern with annotated configuration (canary weights, blue-green switchover, rolling parameters)

Health check setup: Shallow vs deep readiness probes, post-deployment smoke test scripts

Gate definitions: Automated metric thresholds and manual approval workflows

Rollback plan: Automated rollback triggers and manual runbook steps

When to Use

Design CI/CD architecture for a new service or platform migration

Implement deployment gates between environments

Configure multi-environment pipelines with mandatory security scanning

Establish progressive delivery with canary or blue-green strategies

Debug pipelines where stages succeed but production behavior is wrong

Reduce mean time to recovery by automating rollback on metric degradation

Pipeline Stages

Standard Pipeline Flow

┌─────────┐   ┌──────┐   ┌─────────┐   ┌────────┐   ┌──────────┐

│  Build  │ → │ Test │ → │ Staging │ → │ Approve│ → │Production│

└─────────┘   └──────┘   └─────────┘   └────────┘   └──────────┘

Detailed Stage Breakdown

Source - Code checkout, dependency graph resolution

Build - Compile, package, containerize, sign artifacts

Test - Unit, integration, SAST/SCA security scans

Staging Deploy - Deploy to staging environment with smoke tests

Integration Tests - E2E, contract tests, performance baselines

Approval Gate - Manual or automated metric-based gate

Production Deploy - Canary, blue-green, or rolling strategy

Verification - Deep health checks, synthetic monitoring

Rollback - Automated rollback on failure signals

Approval Gate Patterns

Pattern 1: Manual Approval (GitHub Actions)

production-deploy:

  needs: staging-deploy

  environment:

    name: production

    url: https://app.example.com

  runs-on: ubuntu-latest

  steps:

    - name: Deploy to production

      run: kubectl apply -f k8s/production/

Environment protection rules in GitHub enforce required reviewers before this job starts. Configure reviewers at Settings → Environments → production → Required reviewers.

Pattern 2: Time-Based Approval (GitLab CI)

deploy:production:

  stage: deploy

  script:

    - deploy.sh production

  environment:

    name: production

  when: delayed

  start_in: 30 minutes

  only:

    - main

Pattern 3: Multi-Approver (Azure Pipelines)

stages:

  - stage: Production

    dependsOn: Staging

    jobs:

      - deployment: Deploy

        environment:

          name: production

          resourceType: Kubernetes

        strategy:

          runOnce:

            preDeploy:

              steps:

                - task: ManualValidation@0

                  inputs:

                    notifyUsers: "team-leads@example.com"

                    instructions: "Review staging metrics before approving"

Pattern 4: Automated Metric Gate

Use an AnalysisTemplate (Argo Rollouts) or a custom gate script to block promotion when error rates exceed a threshold:

# Argo Rollouts AnalysisTemplate — blocks canary promotion automatically

apiVersion: argoproj.io/v1alpha1

kind: AnalysisTemplate

metadata:

  name: success-rate

spec:

  metrics:

  - name: success-rate

    interval: 60s

    successCondition: "result[0] >= 0.95"

    failureCondition: "result[0] < 0.90"

    inconclusiveLimit: 3

    provider:

      prometheus:

        address: http://prometheus:9090

        query: |

          sum(rate(http_requests_total{status!~"5..",job="my-app"}[2m]))

          / sum(rate(http_requests_total{job="my-app"}[2m]))

Deployment Strategies

Decision Table

Strategy

Downtime

Rollback Speed

Cost Impact

Best For

Rolling

None

~minutes

None

Most stateless services

Blue-Green

None

Instant

2x infra (temp)

High-risk or database migrations

Canary

None

Instant

Minimal

High-traffic, metric-driven

Recreate

Yes

Fast

None

Dev/test, batch jobs

Feature Flag

None

Instant

None

Gradual feature exposure

1. Rolling Deployment

apiVersion: apps/v1

kind: Deployment

metadata:

  name: my-app

spec:

  replicas: 10

  strategy:

    type: RollingUpdate

    rollingUpdate:

      maxSurge: 2         # at most 12 pods during rollout

      maxUnavailable: 1   # at least 9 pods always serving

Characteristics: gradual rollout, zero downtime, easy rollback, best for most applications.

2. Blue-Green Deployment

# Switch traffic from blue to green

kubectl apply -f k8s/green-deployment.yaml

kubectl rollout status deployment/my-app-green

# Flip the service selector

kubectl patch service my-app -p '{"spec":{"selector":{"version":"green"}}}'

# Rollback instantly if needed

kubectl patch service my-app -p '{"spec":{"selector":{"version":"blue"}}}'

Characteristics: instant switchover, easy rollback, doubles infrastructure cost temporarily, good for high-risk deployments with long warm-up times.

3. Canary Deployment (Argo Rollouts)

apiVersion: argoproj.io/v1alpha1

kind: Rollout

metadata:

  name: my-app

spec:

  replicas: 10

  strategy:

    canary:

      analysis:

        templates:

          - templateName: success-rate

        startingStep: 2

      steps:

        - setWeight: 10

        - pause: { duration: 5m }

        - setWeight: 25

        - pause: { duration: 5m }

        - setWeight: 50

        - pause: { duration: 10m }

        - setWeight: 100

Characteristics: gradual traffic shift, real-user metric validation, automated promotion or rollback, requires Argo Rollouts or a service mesh.

4. Feature Flags

from flagsmith import Flagsmith

flagsmith = Flagsmith(environment_key="API_KEY")

if flagsmith.has_feature("new_checkout_flow"):

    process_checkout_v2()

else:

    process_checkout_v1()

Characteristics: deploy without releasing, A/B testing, instant rollback per user segment, granular control independent of deployment.

Pipeline Orchestration

Multi-Stage Pipeline Example (GitHub Actions)

name: Production Pipeline

on:

  push:

    branches: [main]

jobs:

  build:

    runs-on: ubuntu-latest

    outputs:

      image: ${{ steps.build.outputs.image }}

    steps:

      - uses: actions/checkout@v4

      - name: Build and push Docker image

        id: build

        run: |

          IMAGE=myapp:${{ github.sha }}

          docker build -t $IMAGE .

          docker push $IMAGE

          echo "image=$IMAGE" >> $GITHUB_OUTPUT

  test:

    needs: build

    runs-on: ubuntu-latest

    steps:

      - name: Unit tests

        run: make test

      - name: Security scan

        run: trivy image ${{ needs.build.outputs.image }}

  deploy-staging:

    needs: test

    environment:

      name: staging

    runs-on: ubuntu-latest

    steps:

      - name: Deploy to staging

        run: kubectl apply -f k8s/staging/

  integration-test:

    needs: deploy-staging

    runs-on: ubuntu-latest

    steps:

      - name: Run E2E tests

        run: npm run test:e2e

  deploy-production:

    needs: integration-test

    environment:

      name: production        # blocks here until required reviewers approve

    runs-on: ubuntu-latest

    steps:

      - name: Canary deployment

        run: |

          kubectl apply -f k8s/production/

          kubectl argo rollouts promote my-app

  verify:

    needs: deploy-production

    runs-on: ubuntu-latest

    steps:

      - name: Deep health check

        run: |

          for i in {1..12}; do

            STATUS=$(curl -sf https://app.example.com/health/ready | jq -r '.status')

            [ "$STATUS" = "ok" ] &#x26;&#x26; exit 0

            sleep 10

          done

          exit 1

      - name: Notify on success

        run: |

          curl -X POST ${{ secrets.SLACK_WEBHOOK }} \

            -d '{"text":"Production deployment successful: ${{ github.sha }}"}'

Health Checks

Shallow vs Deep Health Endpoints

A shallow /ping returns 200 even when downstream dependencies are broken. Use a deep readiness endpoint that verifies actual dependencies before promoting traffic.

# /health/ready — checks real dependencies, used by pipeline gate

@app.get("/health/ready")

async def readiness():

    checks = {

        "database": await check_db_connection(),

        "cache":    await check_redis_connection(),

        "queue":    await check_queue_connection(),

    }

    status = "ok" if all(checks.values()) else "degraded"

    code = 200 if status == "ok" else 503

    return JSONResponse({"status": status, "checks": checks}, status_code=code)

Post-Deployment Verification Script

#!/usr/bin/env bash

# verify-deployment.sh — run after every production deploy

set -euo pipefail

ENDPOINT="${1:?usage: verify-deployment.sh <base-url>}"

MAX_ATTEMPTS=12

SLEEP_SECONDS=10

for i in $(seq 1 $MAX_ATTEMPTS); do

  STATUS=$(curl -sf "$ENDPOINT/health/ready" | jq -r '.status' 2>/dev/null || echo "unreachable")

  if [ "$STATUS" = "ok" ]; then

    echo "Health check passed after $((i * SLEEP_SECONDS))s"

    exit 0

  fi

  echo "Attempt $i/$MAX_ATTEMPTS: status=$STATUS — retrying in ${SLEEP_SECONDS}s"

  sleep "$SLEEP_SECONDS"

done

echo "Health check failed after $((MAX_ATTEMPTS * SLEEP_SECONDS))s"

exit 1

Rollback Strategies

Automated Rollback in Pipeline

deploy-and-verify:

  steps:

    - name: Deploy new version

      run: kubectl apply -f k8s/

    - name: Wait for rollout

      run: kubectl rollout status deployment/my-app --timeout=5m

    - name: Post-deployment health check

      id: health

      run: ./scripts/verify-deployment.sh https://app.example.com

    - name: Rollback on failure

      if: failure()

      run: |

        kubectl rollout undo deployment/my-app

        echo "Rolled back to previous revision"

Manual Rollback Commands

# List revision history with change-cause annotations

kubectl rollout history deployment/my-app

# Rollback to previous version

kubectl rollout undo deployment/my-app

# Rollback to a specific revision

kubectl rollout undo deployment/my-app --to-revision=3

# Verify rollback completed

kubectl rollout status deployment/my-app

For advanced rollback strategies including database migration rollbacks and Argo Rollouts abort flows, see references/advanced-strategies.md.

Monitoring and Metrics

Key DORA Metrics to Track

Metric

Target (Elite)

How to Measure

Deployment Frequency

Multiple/day

Pipeline run count per day

Lead Time for Changes

< 1 hour

Commit timestamp → production deploy

Change Failure Rate

< 5%

Failed deploys / total deploys

Mean Time to Recovery

< 1 hour

Incident open → service restored

Post-Deployment Metric Verification

- name: Verify error rate post-deployment

  run: |

    sleep 60  # allow metrics to accumulate

    ERROR_RATE=$(curl -sf "$PROMETHEUS_URL/api/v1/query" \

      --data-urlencode 'query=sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))' \

      | jq '.data.result[0].value[1]')

    echo "Current error rate: $ERROR_RATE"

    if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then

      echo "Error rate $ERROR_RATE exceeds 1% threshold — triggering rollback"

      exit 1

    fi

Pipeline Best Practices

Fail fast — Run quick checks (lint, unit tests) before slow ones (E2E, security scans)

Parallel execution — Run independent jobs concurrently to minimize total pipeline time

Caching — Cache dependency layers and build artifacts between runs

Artifact promotion — Build once, promote the same artifact through all environments

Environment parity — Keep staging infrastructure as close to production as possible

Secrets management — Use secret stores (Vault, AWS Secrets Manager, GitHub encrypted secrets) — never hardcode

Deployment windows — Prefer low-traffic windows; enforce change freeze periods via gate policies

Idempotent deploys — Ensure re-running a deploy produces the same result

Rollback automation — Trigger rollback automatically on health check or metric threshold failure

Annotate deployments — Send deployment markers to monitoring tools (Datadog, Grafana) for correlation

Troubleshooting

Health check passes in pipeline but service is unhealthy in production

The pipeline health check is hitting a shallow /ping endpoint that returns 200 even when the database is unreachable. Use a deep readiness check that verifies actual dependencies (see Health Checks section above).

Canary deployment never promotes to 100%

Argo Rollouts requires a valid AnalysisTemplate to auto-promote. If the Prometheus query returns no data (e.g., metric name changed), the analysis stays inconclusive and promotion stalls. Add inconclusiveLimit so the rollout fails fast rather than hanging:

spec:

  metrics:

  - name: error-rate

    failureCondition: "result[0] > 0.05"

    inconclusiveLimit: 2   # fail after 2 inconclusive results, not hang indefinitely

    provider:

      prometheus:

        query: |

          sum(rate(http_requests_total{status=~"5.."}[2m]))

          / sum(rate(http_requests_total[2m]))

Staging deploy succeeds but production job never starts

Check that production environment protection rules are configured — a missing reviewer assignment means the approval gate waits indefinitely with no notification. In GitHub Actions, ensure Required reviewers is set to an existing user or team in Settings → Environments → production.

Docker layer cache busted on every run causing slow builds

If COPY . . appears before dependency installation, any source file change invalidates the dependency layer. Reorder to copy dependency manifests first:

# Good: dependencies cached separately from source code

COPY package*.json ./

RUN npm ci

COPY . .

RUN npm run build

Rollback leaves database migrations applied to old code

A service rollback without a migration rollback causes schema/code mismatch errors. Always make migrations backward-compatible (additive only) for at least one release cycle, and keep undo scripts versioned alongside the migration:

# migrations/V20240315__add_nullable_column.sql       (forward)

# migrations/V20240315__add_nullable_column.undo.sql  (backward)

Never run destructive migrations (DROP COLUMN, ALTER NOT NULL) until the old code version is fully retired from all environments.

Advanced Topics

For platform-specific pipeline configurations, multi-region promotion workflows, and advanced Argo Rollouts patterns, see:

references/advanced-strategies.md — Extended YAML examples, platform-specific configs (GitHub Actions, GitLab CI, Azure Pipelines), multi-region canary patterns, and database migration rollback strategies

Related Skills

github-actions-templates - For GitHub Actions implementation patterns and reusable workflows

gitlab-ci-patterns - For GitLab CI/CD pipeline implementation

secrets-management - For secrets handling in CI/CD pipelines

deployment-pipeline-design

SKILL.md

What This Skill Produces

When to Use

Pipeline Stages

Standard Pipeline Flow

Detailed Stage Breakdown

Approval Gate Patterns

Pattern 1: Manual Approval (GitHub Actions)

Pattern 2: Time-Based Approval (GitLab CI)

Pattern 3: Multi-Approver (Azure Pipelines)

Pattern 4: Automated Metric Gate

Deployment Strategies

Decision Table

1. Rolling Deployment

2. Blue-Green Deployment

3. Canary Deployment (Argo Rollouts)

4. Feature Flags

Pipeline Orchestration

Multi-Stage Pipeline Example (GitHub Actions)

Health Checks

Shallow vs Deep Health Endpoints

Post-Deployment Verification Script

Rollback Strategies

Automated Rollback in Pipeline

Manual Rollback Commands

Monitoring and Metrics

Key DORA Metrics to Track

Post-Deployment Metric Verification

Pipeline Best Practices

Troubleshooting

Health check passes in pipeline but service is unhealthy in production

Canary deployment never promotes to 100%

Staging deploy succeeds but production job never starts

Docker layer cache busted on every run causing slow builds

Rollback leaves database migrations applied to old code

Advanced Topics

Related Skills

Stop writing automation&scrapers

deployment-pipeline-design

SKILL.md

What This Skill Produces

When to Use

Pipeline Stages

Standard Pipeline Flow

Detailed Stage Breakdown

Approval Gate Patterns

Pattern 1: Manual Approval (GitHub Actions)

Pattern 2: Time-Based Approval (GitLab CI)

Pattern 3: Multi-Approver (Azure Pipelines)

Pattern 4: Automated Metric Gate

Deployment Strategies

Decision Table

1. Rolling Deployment

2. Blue-Green Deployment

3. Canary Deployment (Argo Rollouts)

4. Feature Flags

Pipeline Orchestration

Multi-Stage Pipeline Example (GitHub Actions)

Health Checks

Shallow vs Deep Health Endpoints

Post-Deployment Verification Script

Rollback Strategies

Automated Rollback in Pipeline

Manual Rollback Commands

Monitoring and Metrics

Key DORA Metrics to Track

Post-Deployment Metric Verification

Pipeline Best Practices

Troubleshooting

Health check passes in pipeline but service is unhealthy in production

Canary deployment never promotes to 100%

Staging deploy succeeds but production job never starts

Docker layer cache busted on every run causing slow builds

Rollback leaves database migrations applied to old code

Advanced Topics

Related Skills

Let your agent run on any real-world website

Related skills

Stop writing automation&scrapers