SKILL.md

Service Mesh Observability

Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.

When to Use This Skill

Setting up distributed tracing across services

Implementing service mesh metrics and dashboards

Debugging latency and error issues

Defining SLOs for service communication

Visualizing service dependencies

Troubleshooting mesh connectivity

Core Concepts

1. Three Pillars of Observability

┌─────────────────────────────────────────────────────┐

│                  Observability                       │

├─────────────────┬─────────────────┬─────────────────┤

│     Metrics     │     Traces      │      Logs       │

│                 │                 │                 │

│ • Request rate  │ • Span context  │ • Access logs   │

│ • Error rate    │ • Latency       │ • Error details │

│ • Latency P50   │ • Dependencies  │ • Debug info    │

│ • Saturation    │ • Bottlenecks   │ • Audit trail   │

└─────────────────┴─────────────────┴─────────────────┘

2. Golden Signals for Mesh

Signal

Description

Alert Threshold

Latency

Request duration P50, P99

P99 > 500ms

Traffic

Requests per second

Anomaly detection

Errors

5xx error rate

1%

Saturation

Resource utilization

80%

Templates

Template 1: Istio with Prometheus & Grafana

# Install Prometheus

apiVersion: v1

kind: ConfigMap

metadata:

  name: prometheus

  namespace: istio-system

data:

  prometheus.yml: |

    global:

      scrape_interval: 15s

    scrape_configs:

      - job_name: 'istio-mesh'

        kubernetes_sd_configs:

          - role: endpoints

            namespaces:

              names:

                - istio-system

        relabel_configs:

          - source_labels: [__meta_kubernetes_service_name]

            action: keep

            regex: istio-telemetry

---

# ServiceMonitor for Prometheus Operator

apiVersion: monitoring.coreos.com/v1

kind: ServiceMonitor

metadata:

  name: istio-mesh

  namespace: istio-system

spec:

  selector:

    matchLabels:

      app: istiod

  endpoints:

    - port: http-monitoring

      interval: 15s

Template 2: Key Istio Metrics Queries

# Request rate by service

sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name)

# Error rate (5xx)

sum(rate(istio_requests_total{reporter="destination", response_code=~"5.."}[5m]))

  / sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100

# P99 latency

histogram_quantile(0.99,

  sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m]))

  by (le, destination_service_name))

# TCP connections

sum(istio_tcp_connections_opened_total{reporter="destination"}) by (destination_service_name)

# Request size

histogram_quantile(0.99,

  sum(rate(istio_request_bytes_bucket{reporter="destination"}[5m]))

  by (le, destination_service_name))

Template 3: Jaeger Distributed Tracing

# Jaeger installation for Istio

apiVersion: install.istio.io/v1alpha1

kind: IstioOperator

spec:

  meshConfig:

    enableTracing: true

    defaultConfig:

      tracing:

        sampling: 100.0 # 100% in dev, lower in prod

        zipkin:

          address: jaeger-collector.istio-system:9411

---

# Jaeger deployment

apiVersion: apps/v1

kind: Deployment

metadata:

  name: jaeger

  namespace: istio-system

spec:

  selector:

    matchLabels:

      app: jaeger

  template:

    metadata:

      labels:

        app: jaeger

    spec:

      containers:

        - name: jaeger

          image: jaegertracing/all-in-one:1.50

          ports:

            - containerPort: 5775 # UDP

            - containerPort: 6831 # Thrift

            - containerPort: 6832 # Thrift

            - containerPort: 5778 # Config

            - containerPort: 16686 # UI

            - containerPort: 14268 # HTTP

            - containerPort: 14250 # gRPC

            - containerPort: 9411 # Zipkin

          env:

            - name: COLLECTOR_ZIPKIN_HOST_PORT

              value: ":9411"

Template 4: Linkerd Viz Dashboard

# Install Linkerd viz extension

linkerd viz install | kubectl apply -f -

# Access dashboard

linkerd viz dashboard

# CLI commands for observability

# Top requests

linkerd viz top deploy/my-app

# Per-route metrics

linkerd viz routes deploy/my-app --to deploy/backend

# Live traffic inspection

linkerd viz tap deploy/my-app --to deploy/backend

# Service edges (dependencies)

linkerd viz edges deployment -n my-namespace

Template 5: Grafana Dashboard JSON

{

  "dashboard": {

    "title": "Service Mesh Overview",

    "panels": [

      {

        "title": "Request Rate",

        "type": "graph",

        "targets": [

          {

            "expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (destination_service_name)",

            "legendFormat": "{{destination_service_name}}"

          }

        ]

      },

      {

        "title": "Error Rate",

        "type": "gauge",

        "targets": [

          {

            "expr": "sum(rate(istio_requests_total{response_code=~\"5..\"}[5m])) / sum(rate(istio_requests_total[5m])) * 100"

          }

        ],

        "fieldConfig": {

          "defaults": {

            "thresholds": {

              "steps": [

                { "value": 0, "color": "green" },

                { "value": 1, "color": "yellow" },

                { "value": 5, "color": "red" }

              ]

            }

          }

        }

      },

      {

        "title": "P99 Latency",

        "type": "graph",

        "targets": [

          {

            "expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le, destination_service_name))",

            "legendFormat": "{{destination_service_name}}"

          }

        ]

      },

      {

        "title": "Service Topology",

        "type": "nodeGraph",

        "targets": [

          {

            "expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (source_workload, destination_service_name)"

          }

        ]

      }

    ]

  }

}

Template 6: Kiali Service Mesh Visualization

# Kiali installation

apiVersion: kiali.io/v1alpha1

kind: Kiali

metadata:

  name: kiali

  namespace: istio-system

spec:

  auth:

    strategy: anonymous # or openid, token

  deployment:

    accessible_namespaces:

      - "**"

  external_services:

    prometheus:

      url: http://prometheus.istio-system:9090

    tracing:

      url: http://jaeger-query.istio-system:16686

    grafana:

      url: http://grafana.istio-system:3000

Template 7: OpenTelemetry Integration

# OpenTelemetry Collector for mesh

apiVersion: v1

kind: ConfigMap

metadata:

  name: otel-collector-config

data:

  config.yaml: |

    receivers:

      otlp:

        protocols:

          grpc:

            endpoint: 0.0.0.0:4317

          http:

            endpoint: 0.0.0.0:4318

      zipkin:

        endpoint: 0.0.0.0:9411

    processors:

      batch:

        timeout: 10s

    exporters:

      jaeger:

        endpoint: jaeger-collector:14250

        tls:

          insecure: true

      prometheus:

        endpoint: 0.0.0.0:8889

    service:

      pipelines:

        traces:

          receivers: [otlp, zipkin]

          processors: [batch]

          exporters: [jaeger]

        metrics:

          receivers: [otlp]

          processors: [batch]

          exporters: [prometheus]

---

# Istio Telemetry v2 with OTel

apiVersion: telemetry.istio.io/v1alpha1

kind: Telemetry

metadata:

  name: mesh-default

  namespace: istio-system

spec:

  tracing:

    - providers:

        - name: otel

      randomSamplingPercentage: 10

Alerting Rules

apiVersion: monitoring.coreos.com/v1

kind: PrometheusRule

metadata:

  name: mesh-alerts

  namespace: istio-system

spec:

  groups:

    - name: mesh.rules

      rules:

        - alert: HighErrorRate

          expr: |

            sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name)

            / sum(rate(istio_requests_total[5m])) by (destination_service_name) > 0.05

          for: 5m

          labels:

            severity: critical

          annotations:

            summary: "High error rate for {{ $labels.destination_service_name }}"

        - alert: HighLatency

          expr: |

            histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m]))

            by (le, destination_service_name)) > 1000

          for: 5m

          labels:

            severity: warning

          annotations:

            summary: "High P99 latency for {{ $labels.destination_service_name }}"

        - alert: MeshCertExpiring

          expr: |

            (certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7

          labels:

            severity: warning

          annotations:

            summary: "Mesh certificate expiring in less than 7 days"

Best Practices

Do's

Sample appropriately - 100% in dev, 1-10% in prod

Use trace context - Propagate headers consistently

Set up alerts - For golden signals

Correlate metrics/traces - Use exemplars

Retain strategically - Hot/cold storage tiers

Don'ts

Don't over-sample - Storage costs add up

Don't ignore cardinality - Limit label values

Don't skip dashboards - Visualize dependencies

Don't forget costs - Monitor observability costs

service-mesh-observability

SKILL.md

Service Mesh Observability

When to Use This Skill

Core Concepts

1. Three Pillars of Observability

2. Golden Signals for Mesh

Templates

Template 1: Istio with Prometheus & Grafana

Template 2: Key Istio Metrics Queries

Template 3: Jaeger Distributed Tracing

Template 4: Linkerd Viz Dashboard

Template 5: Grafana Dashboard JSON

Template 6: Kiali Service Mesh Visualization

Template 7: OpenTelemetry Integration

Alerting Rules

Best Practices

Do's

Don'ts

Stop writing automation&scrapers

service-mesh-observability

SKILL.md

Service Mesh Observability

When to Use This Skill

Core Concepts

1. Three Pillars of Observability

2. Golden Signals for Mesh

Templates

Template 1: Istio with Prometheus &#x26; Grafana

Template 2: Key Istio Metrics Queries

Template 3: Jaeger Distributed Tracing

Template 4: Linkerd Viz Dashboard

Template 5: Grafana Dashboard JSON

Template 6: Kiali Service Mesh Visualization

Template 7: OpenTelemetry Integration

Alerting Rules

Best Practices

Do's

Don'ts

Let your agent run on any real-world website

Related skills

Stop writing automation&scrapers

Template 1: Istio with Prometheus & Grafana