service-mesh-observability

Comprehensive observability for Istio and Linkerd service meshes with distributed tracing, metrics, and visualization. Covers three observability pillars: metrics (request rate, error rate, latency), traces (span context, dependencies, bottlenecks), and logs (access logs, error details) Includes ready-to-use templates for Prometheus, Grafana, Jaeger, Kiali, and OpenTelemetry integration with Istio and Linkerd Provides golden signals framework (latency, traffic, errors, saturation) with PromQL queries for P50/P99 latency, error rates, and service topology visualization Features alerting rules for high error rates, latency spikes, and certificate expiration; sampling guidance and cardinality management best practices included

INSTALLATION
npx skills add https://github.com/wshobson/agents --skill service-mesh-observability
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

Service Mesh Observability

Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.

When to Use This Skill

  • Setting up distributed tracing across services
  • Implementing service mesh metrics and dashboards
  • Debugging latency and error issues
  • Defining SLOs for service communication
  • Visualizing service dependencies
  • Troubleshooting mesh connectivity

Core Concepts

1. Three Pillars of Observability

┌─────────────────────────────────────────────────────┐

│                  Observability                       │

├─────────────────┬─────────────────┬─────────────────┤

│     Metrics     │     Traces      │      Logs       │

│                 │                 │                 │

│ • Request rate  │ • Span context  │ • Access logs   │

│ • Error rate    │ • Latency       │ • Error details │

│ • Latency P50   │ • Dependencies  │ • Debug info    │

│ • Saturation    │ • Bottlenecks   │ • Audit trail   │

└─────────────────┴─────────────────┴─────────────────┘

2. Golden Signals for Mesh

Signal

Description

Alert Threshold

Latency

Request duration P50, P99

P99 > 500ms

Traffic

Requests per second

Anomaly detection

Errors

5xx error rate

1%

Saturation

Resource utilization

80%

Templates

Template 1: Istio with Prometheus & Grafana

# Install Prometheus

apiVersion: v1

kind: ConfigMap

metadata:

  name: prometheus

  namespace: istio-system

data:

  prometheus.yml: |

    global:

      scrape_interval: 15s

    scrape_configs:

      - job_name: 'istio-mesh'

        kubernetes_sd_configs:

          - role: endpoints

            namespaces:

              names:

                - istio-system

        relabel_configs:

          - source_labels: [__meta_kubernetes_service_name]

            action: keep

            regex: istio-telemetry

---

# ServiceMonitor for Prometheus Operator

apiVersion: monitoring.coreos.com/v1

kind: ServiceMonitor

metadata:

  name: istio-mesh

  namespace: istio-system

spec:

  selector:

    matchLabels:

      app: istiod

  endpoints:

    - port: http-monitoring

      interval: 15s

Template 2: Key Istio Metrics Queries

# Request rate by service

sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name)

# Error rate (5xx)

sum(rate(istio_requests_total{reporter="destination", response_code=~"5.."}[5m]))

  / sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100

# P99 latency

histogram_quantile(0.99,

  sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m]))

  by (le, destination_service_name))

# TCP connections

sum(istio_tcp_connections_opened_total{reporter="destination"}) by (destination_service_name)

# Request size

histogram_quantile(0.99,

  sum(rate(istio_request_bytes_bucket{reporter="destination"}[5m]))

  by (le, destination_service_name))

Template 3: Jaeger Distributed Tracing

# Jaeger installation for Istio

apiVersion: install.istio.io/v1alpha1

kind: IstioOperator

spec:

  meshConfig:

    enableTracing: true

    defaultConfig:

      tracing:

        sampling: 100.0 # 100% in dev, lower in prod

        zipkin:

          address: jaeger-collector.istio-system:9411

---

# Jaeger deployment

apiVersion: apps/v1

kind: Deployment

metadata:

  name: jaeger

  namespace: istio-system

spec:

  selector:

    matchLabels:

      app: jaeger

  template:

    metadata:

      labels:

        app: jaeger

    spec:

      containers:

        - name: jaeger

          image: jaegertracing/all-in-one:1.50

          ports:

            - containerPort: 5775 # UDP

            - containerPort: 6831 # Thrift

            - containerPort: 6832 # Thrift

            - containerPort: 5778 # Config

            - containerPort: 16686 # UI

            - containerPort: 14268 # HTTP

            - containerPort: 14250 # gRPC

            - containerPort: 9411 # Zipkin

          env:

            - name: COLLECTOR_ZIPKIN_HOST_PORT

              value: ":9411"

Template 4: Linkerd Viz Dashboard

# Install Linkerd viz extension

linkerd viz install | kubectl apply -f -

# Access dashboard

linkerd viz dashboard

# CLI commands for observability

# Top requests

linkerd viz top deploy/my-app

# Per-route metrics

linkerd viz routes deploy/my-app --to deploy/backend

# Live traffic inspection

linkerd viz tap deploy/my-app --to deploy/backend

# Service edges (dependencies)

linkerd viz edges deployment -n my-namespace

Template 5: Grafana Dashboard JSON

{

  "dashboard": {

    "title": "Service Mesh Overview",

    "panels": [

      {

        "title": "Request Rate",

        "type": "graph",

        "targets": [

          {

            "expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (destination_service_name)",

            "legendFormat": "{{destination_service_name}}"

          }

        ]

      },

      {

        "title": "Error Rate",

        "type": "gauge",

        "targets": [

          {

            "expr": "sum(rate(istio_requests_total{response_code=~\"5..\"}[5m])) / sum(rate(istio_requests_total[5m])) * 100"

          }

        ],

        "fieldConfig": {

          "defaults": {

            "thresholds": {

              "steps": [

                { "value": 0, "color": "green" },

                { "value": 1, "color": "yellow" },

                { "value": 5, "color": "red" }

              ]

            }

          }

        }

      },

      {

        "title": "P99 Latency",

        "type": "graph",

        "targets": [

          {

            "expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le, destination_service_name))",

            "legendFormat": "{{destination_service_name}}"

          }

        ]

      },

      {

        "title": "Service Topology",

        "type": "nodeGraph",

        "targets": [

          {

            "expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (source_workload, destination_service_name)"

          }

        ]

      }

    ]

  }

}

Template 6: Kiali Service Mesh Visualization

# Kiali installation

apiVersion: kiali.io/v1alpha1

kind: Kiali

metadata:

  name: kiali

  namespace: istio-system

spec:

  auth:

    strategy: anonymous # or openid, token

  deployment:

    accessible_namespaces:

      - "**"

  external_services:

    prometheus:

      url: http://prometheus.istio-system:9090

    tracing:

      url: http://jaeger-query.istio-system:16686

    grafana:

      url: http://grafana.istio-system:3000

Template 7: OpenTelemetry Integration

# OpenTelemetry Collector for mesh

apiVersion: v1

kind: ConfigMap

metadata:

  name: otel-collector-config

data:

  config.yaml: |

    receivers:

      otlp:

        protocols:

          grpc:

            endpoint: 0.0.0.0:4317

          http:

            endpoint: 0.0.0.0:4318

      zipkin:

        endpoint: 0.0.0.0:9411

    processors:

      batch:

        timeout: 10s

    exporters:

      jaeger:

        endpoint: jaeger-collector:14250

        tls:

          insecure: true

      prometheus:

        endpoint: 0.0.0.0:8889

    service:

      pipelines:

        traces:

          receivers: [otlp, zipkin]

          processors: [batch]

          exporters: [jaeger]

        metrics:

          receivers: [otlp]

          processors: [batch]

          exporters: [prometheus]

---

# Istio Telemetry v2 with OTel

apiVersion: telemetry.istio.io/v1alpha1

kind: Telemetry

metadata:

  name: mesh-default

  namespace: istio-system

spec:

  tracing:

    - providers:

        - name: otel

      randomSamplingPercentage: 10

Alerting Rules

apiVersion: monitoring.coreos.com/v1

kind: PrometheusRule

metadata:

  name: mesh-alerts

  namespace: istio-system

spec:

  groups:

    - name: mesh.rules

      rules:

        - alert: HighErrorRate

          expr: |

            sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name)

            / sum(rate(istio_requests_total[5m])) by (destination_service_name) > 0.05

          for: 5m

          labels:

            severity: critical

          annotations:

            summary: "High error rate for {{ $labels.destination_service_name }}"

        - alert: HighLatency

          expr: |

            histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m]))

            by (le, destination_service_name)) > 1000

          for: 5m

          labels:

            severity: warning

          annotations:

            summary: "High P99 latency for {{ $labels.destination_service_name }}"

        - alert: MeshCertExpiring

          expr: |

            (certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7

          labels:

            severity: warning

          annotations:

            summary: "Mesh certificate expiring in less than 7 days"

Best Practices

Do's

  • Sample appropriately - 100% in dev, 1-10% in prod
  • Use trace context - Propagate headers consistently
  • Set up alerts - For golden signals
  • Correlate metrics/traces - Use exemplars
  • Retain strategically - Hot/cold storage tiers

Don'ts

  • Don't over-sample - Storage costs add up
  • Don't ignore cardinality - Limit label values
  • Don't skip dashboards - Visualize dependencies
  • Don't forget costs - Monitor observability costs
BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card