deployment-pipeline-design

Multi-stage CI/CD pipelines with approval gates and deployment orchestration. Covers four deployment strategies: rolling updates, blue-green, canary, and feature flags, each with trade-offs for downtime, rollback speed, and infrastructure cost Includes approval gate patterns for manual review, time-based delays, and multi-approver workflows across GitHub Actions, GitLab CI, and Azure Pipelines Provides automated rollback mechanisms triggered by health checks and failure detection, plus manual rollback commands for Kubernetes deployments Outlines nine-stage pipeline flow from source checkout through build, test, staging, approval, production deployment, and verification with monitoring integration

INSTALLATION
npx skills add https://github.com/wshobson/agents --skill deployment-pipeline-design
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

$2a

What This Skill Produces

  • Pipeline configuration: Stage definitions, job dependencies, parallelism, and caching strategy
  • Deployment strategy: Chosen rollout pattern with annotated configuration (canary weights, blue-green switchover, rolling parameters)
  • Health check setup: Shallow vs deep readiness probes, post-deployment smoke test scripts
  • Gate definitions: Automated metric thresholds and manual approval workflows
  • Rollback plan: Automated rollback triggers and manual runbook steps

When to Use

  • Design CI/CD architecture for a new service or platform migration
  • Implement deployment gates between environments
  • Configure multi-environment pipelines with mandatory security scanning
  • Establish progressive delivery with canary or blue-green strategies
  • Debug pipelines where stages succeed but production behavior is wrong
  • Reduce mean time to recovery by automating rollback on metric degradation

Pipeline Stages

Standard Pipeline Flow

┌─────────┐   ┌──────┐   ┌─────────┐   ┌────────┐   ┌──────────┐

│  Build  │ → │ Test │ → │ Staging │ → │ Approve│ → │Production│

└─────────┘   └──────┘   └─────────┘   └────────┘   └──────────┘

Detailed Stage Breakdown

  • Source - Code checkout, dependency graph resolution
  • Build - Compile, package, containerize, sign artifacts
  • Test - Unit, integration, SAST/SCA security scans
  • Staging Deploy - Deploy to staging environment with smoke tests
  • Integration Tests - E2E, contract tests, performance baselines
  • Approval Gate - Manual or automated metric-based gate
  • Production Deploy - Canary, blue-green, or rolling strategy
  • Verification - Deep health checks, synthetic monitoring
  • Rollback - Automated rollback on failure signals

Approval Gate Patterns

Pattern 1: Manual Approval (GitHub Actions)

production-deploy:

  needs: staging-deploy

  environment:

    name: production

    url: https://app.example.com

  runs-on: ubuntu-latest

  steps:

    - name: Deploy to production

      run: kubectl apply -f k8s/production/

Environment protection rules in GitHub enforce required reviewers before this job starts. Configure reviewers at Settings → Environments → production → Required reviewers.

Pattern 2: Time-Based Approval (GitLab CI)

deploy:production:

  stage: deploy

  script:

    - deploy.sh production

  environment:

    name: production

  when: delayed

  start_in: 30 minutes

  only:

    - main

Pattern 3: Multi-Approver (Azure Pipelines)

stages:

  - stage: Production

    dependsOn: Staging

    jobs:

      - deployment: Deploy

        environment:

          name: production

          resourceType: Kubernetes

        strategy:

          runOnce:

            preDeploy:

              steps:

                - task: ManualValidation@0

                  inputs:

                    notifyUsers: "team-leads@example.com"

                    instructions: "Review staging metrics before approving"

Pattern 4: Automated Metric Gate

Use an AnalysisTemplate (Argo Rollouts) or a custom gate script to block promotion when error rates exceed a threshold:

# Argo Rollouts AnalysisTemplate — blocks canary promotion automatically

apiVersion: argoproj.io/v1alpha1

kind: AnalysisTemplate

metadata:

  name: success-rate

spec:

  metrics:

  - name: success-rate

    interval: 60s

    successCondition: "result[0] >= 0.95"

    failureCondition: "result[0] < 0.90"

    inconclusiveLimit: 3

    provider:

      prometheus:

        address: http://prometheus:9090

        query: |

          sum(rate(http_requests_total{status!~"5..",job="my-app"}[2m]))

          / sum(rate(http_requests_total{job="my-app"}[2m]))

Deployment Strategies

Decision Table

Strategy

Downtime

Rollback Speed

Cost Impact

Best For

Rolling

None

~minutes

None

Most stateless services

Blue-Green

None

Instant

2x infra (temp)

High-risk or database migrations

Canary

None

Instant

Minimal

High-traffic, metric-driven

Recreate

Yes

Fast

None

Dev/test, batch jobs

Feature Flag

None

Instant

None

Gradual feature exposure

1. Rolling Deployment

apiVersion: apps/v1

kind: Deployment

metadata:

  name: my-app

spec:

  replicas: 10

  strategy:

    type: RollingUpdate

    rollingUpdate:

      maxSurge: 2         # at most 12 pods during rollout

      maxUnavailable: 1   # at least 9 pods always serving

Characteristics: gradual rollout, zero downtime, easy rollback, best for most applications.

2. Blue-Green Deployment

# Switch traffic from blue to green

kubectl apply -f k8s/green-deployment.yaml

kubectl rollout status deployment/my-app-green

# Flip the service selector

kubectl patch service my-app -p '{"spec":{"selector":{"version":"green"}}}'

# Rollback instantly if needed

kubectl patch service my-app -p '{"spec":{"selector":{"version":"blue"}}}'

Characteristics: instant switchover, easy rollback, doubles infrastructure cost temporarily, good for high-risk deployments with long warm-up times.

3. Canary Deployment (Argo Rollouts)

apiVersion: argoproj.io/v1alpha1

kind: Rollout

metadata:

  name: my-app

spec:

  replicas: 10

  strategy:

    canary:

      analysis:

        templates:

          - templateName: success-rate

        startingStep: 2

      steps:

        - setWeight: 10

        - pause: { duration: 5m }

        - setWeight: 25

        - pause: { duration: 5m }

        - setWeight: 50

        - pause: { duration: 10m }

        - setWeight: 100

Characteristics: gradual traffic shift, real-user metric validation, automated promotion or rollback, requires Argo Rollouts or a service mesh.

4. Feature Flags

from flagsmith import Flagsmith

flagsmith = Flagsmith(environment_key="API_KEY")

if flagsmith.has_feature("new_checkout_flow"):

    process_checkout_v2()

else:

    process_checkout_v1()

Characteristics: deploy without releasing, A/B testing, instant rollback per user segment, granular control independent of deployment.

Pipeline Orchestration

Multi-Stage Pipeline Example (GitHub Actions)

name: Production Pipeline

on:

  push:

    branches: [main]

jobs:

  build:

    runs-on: ubuntu-latest

    outputs:

      image: ${{ steps.build.outputs.image }}

    steps:

      - uses: actions/checkout@v4

      - name: Build and push Docker image

        id: build

        run: |

          IMAGE=myapp:${{ github.sha }}

          docker build -t $IMAGE .

          docker push $IMAGE

          echo "image=$IMAGE" >> $GITHUB_OUTPUT

  test:

    needs: build

    runs-on: ubuntu-latest

    steps:

      - name: Unit tests

        run: make test

      - name: Security scan

        run: trivy image ${{ needs.build.outputs.image }}

  deploy-staging:

    needs: test

    environment:

      name: staging

    runs-on: ubuntu-latest

    steps:

      - name: Deploy to staging

        run: kubectl apply -f k8s/staging/

  integration-test:

    needs: deploy-staging

    runs-on: ubuntu-latest

    steps:

      - name: Run E2E tests

        run: npm run test:e2e

  deploy-production:

    needs: integration-test

    environment:

      name: production        # blocks here until required reviewers approve

    runs-on: ubuntu-latest

    steps:

      - name: Canary deployment

        run: |

          kubectl apply -f k8s/production/

          kubectl argo rollouts promote my-app

  verify:

    needs: deploy-production

    runs-on: ubuntu-latest

    steps:

      - name: Deep health check

        run: |

          for i in {1..12}; do

            STATUS=$(curl -sf https://app.example.com/health/ready | jq -r '.status')

            [ "$STATUS" = "ok" ] &#x26;&#x26; exit 0

            sleep 10

          done

          exit 1

      - name: Notify on success

        run: |

          curl -X POST ${{ secrets.SLACK_WEBHOOK }} \

            -d '{"text":"Production deployment successful: ${{ github.sha }}"}'

Health Checks

Shallow vs Deep Health Endpoints

A shallow /ping returns 200 even when downstream dependencies are broken. Use a deep readiness endpoint that verifies actual dependencies before promoting traffic.

# /health/ready — checks real dependencies, used by pipeline gate

@app.get("/health/ready")

async def readiness():

    checks = {

        "database": await check_db_connection(),

        "cache":    await check_redis_connection(),

        "queue":    await check_queue_connection(),

    }

    status = "ok" if all(checks.values()) else "degraded"

    code = 200 if status == "ok" else 503

    return JSONResponse({"status": status, "checks": checks}, status_code=code)

Post-Deployment Verification Script

#!/usr/bin/env bash

# verify-deployment.sh — run after every production deploy

set -euo pipefail

ENDPOINT="${1:?usage: verify-deployment.sh <base-url>}"

MAX_ATTEMPTS=12

SLEEP_SECONDS=10

for i in $(seq 1 $MAX_ATTEMPTS); do

  STATUS=$(curl -sf "$ENDPOINT/health/ready" | jq -r '.status' 2>/dev/null || echo "unreachable")

  if [ "$STATUS" = "ok" ]; then

    echo "Health check passed after $((i * SLEEP_SECONDS))s"

    exit 0

  fi

  echo "Attempt $i/$MAX_ATTEMPTS: status=$STATUS — retrying in ${SLEEP_SECONDS}s"

  sleep "$SLEEP_SECONDS"

done

echo "Health check failed after $((MAX_ATTEMPTS * SLEEP_SECONDS))s"

exit 1

Rollback Strategies

Automated Rollback in Pipeline

deploy-and-verify:

  steps:

    - name: Deploy new version

      run: kubectl apply -f k8s/

    - name: Wait for rollout

      run: kubectl rollout status deployment/my-app --timeout=5m

    - name: Post-deployment health check

      id: health

      run: ./scripts/verify-deployment.sh https://app.example.com

    - name: Rollback on failure

      if: failure()

      run: |

        kubectl rollout undo deployment/my-app

        echo "Rolled back to previous revision"

Manual Rollback Commands

# List revision history with change-cause annotations

kubectl rollout history deployment/my-app

# Rollback to previous version

kubectl rollout undo deployment/my-app

# Rollback to a specific revision

kubectl rollout undo deployment/my-app --to-revision=3

# Verify rollback completed

kubectl rollout status deployment/my-app

For advanced rollback strategies including database migration rollbacks and Argo Rollouts abort flows, see references/advanced-strategies.md.

Monitoring and Metrics

Key DORA Metrics to Track

Metric

Target (Elite)

How to Measure

Deployment Frequency

Multiple/day

Pipeline run count per day

Lead Time for Changes

< 1 hour

Commit timestamp → production deploy

Change Failure Rate

< 5%

Failed deploys / total deploys

Mean Time to Recovery

< 1 hour

Incident open → service restored

Post-Deployment Metric Verification

- name: Verify error rate post-deployment

  run: |

    sleep 60  # allow metrics to accumulate

    ERROR_RATE=$(curl -sf "$PROMETHEUS_URL/api/v1/query" \

      --data-urlencode 'query=sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))' \

      | jq '.data.result[0].value[1]')

    echo "Current error rate: $ERROR_RATE"

    if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then

      echo "Error rate $ERROR_RATE exceeds 1% threshold — triggering rollback"

      exit 1

    fi

Pipeline Best Practices

  • Fail fast — Run quick checks (lint, unit tests) before slow ones (E2E, security scans)
  • Parallel execution — Run independent jobs concurrently to minimize total pipeline time
  • Caching — Cache dependency layers and build artifacts between runs
  • Artifact promotion — Build once, promote the same artifact through all environments
  • Environment parity — Keep staging infrastructure as close to production as possible
  • Secrets management — Use secret stores (Vault, AWS Secrets Manager, GitHub encrypted secrets) — never hardcode
  • Deployment windows — Prefer low-traffic windows; enforce change freeze periods via gate policies
  • Idempotent deploys — Ensure re-running a deploy produces the same result
  • Rollback automation — Trigger rollback automatically on health check or metric threshold failure
  • Annotate deployments — Send deployment markers to monitoring tools (Datadog, Grafana) for correlation

Troubleshooting

Health check passes in pipeline but service is unhealthy in production

The pipeline health check is hitting a shallow /ping endpoint that returns 200 even when the database is unreachable. Use a deep readiness check that verifies actual dependencies (see Health Checks section above).

Canary deployment never promotes to 100%

Argo Rollouts requires a valid AnalysisTemplate to auto-promote. If the Prometheus query returns no data (e.g., metric name changed), the analysis stays inconclusive and promotion stalls. Add inconclusiveLimit so the rollout fails fast rather than hanging:

spec:

  metrics:

  - name: error-rate

    failureCondition: "result[0] > 0.05"

    inconclusiveLimit: 2   # fail after 2 inconclusive results, not hang indefinitely

    provider:

      prometheus:

        query: |

          sum(rate(http_requests_total{status=~"5.."}[2m]))

          / sum(rate(http_requests_total[2m]))

Staging deploy succeeds but production job never starts

Check that production environment protection rules are configured — a missing reviewer assignment means the approval gate waits indefinitely with no notification. In GitHub Actions, ensure Required reviewers is set to an existing user or team in Settings → Environments → production.

Docker layer cache busted on every run causing slow builds

If COPY . . appears before dependency installation, any source file change invalidates the dependency layer. Reorder to copy dependency manifests first:

# Good: dependencies cached separately from source code

COPY package*.json ./

RUN npm ci

COPY . .

RUN npm run build

Rollback leaves database migrations applied to old code

A service rollback without a migration rollback causes schema/code mismatch errors. Always make migrations backward-compatible (additive only) for at least one release cycle, and keep undo scripts versioned alongside the migration:

# migrations/V20240315__add_nullable_column.sql       (forward)

# migrations/V20240315__add_nullable_column.undo.sql  (backward)

Never run destructive migrations (DROP COLUMN, ALTER NOT NULL) until the old code version is fully retired from all environments.

Advanced Topics

For platform-specific pipeline configurations, multi-region promotion workflows, and advanced Argo Rollouts patterns, see:

  • references/advanced-strategies.md — Extended YAML examples, platform-specific configs (GitHub Actions, GitLab CI, Azure Pipelines), multi-region canary patterns, and database migration rollback strategies

Related Skills

  • github-actions-templates - For GitHub Actions implementation patterns and reusable workflows
  • gitlab-ci-patterns - For GitLab CI/CD pipeline implementation
  • secrets-management - For secrets handling in CI/CD pipelines
BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card