SKILL.md

Prometheus Configuration

Complete guide to Prometheus setup, metric collection, scrape configuration, and recording rules.

Purpose

Configure Prometheus for comprehensive metric collection, alerting, and monitoring of infrastructure and applications.

When to Use

Set up Prometheus monitoring

Configure metric scraping

Create recording rules

Design alert rules

Implement service discovery

Prometheus Architecture

┌──────────────┐

│ Applications │ ← Instrumented with client libraries

└──────┬───────┘

       │ /metrics endpoint

       ↓

┌──────────────┐

│  Prometheus  │ ← Scrapes metrics periodically

│    Server    │

└──────┬───────┘

       │

       ├─→ AlertManager (alerts)

       ├─→ Grafana (visualization)

       └─→ Long-term storage (Thanos/Cortex)

Installation

Kubernetes with Helm

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

helm repo update

helm install prometheus prometheus-community/kube-prometheus-stack \

  --namespace monitoring \

  --create-namespace \

  --set prometheus.prometheusSpec.retention=30d \

  --set prometheus.prometheusSpec.storageVolumeSize=50Gi

Docker Compose

version: "3.8"

services:

  prometheus:

    image: prom/prometheus:v3.2

    ports:

      - "9090:9090"

    volumes:

      - ./prometheus.yml:/etc/prometheus/prometheus.yml

      - prometheus-data:/prometheus

    command:

      - "--config.file=/etc/prometheus/prometheus.yml"

      - "--storage.tsdb.path=/prometheus"

      - "--storage.tsdb.retention.time=30d"

volumes:

  prometheus-data:

Configuration File

prometheus.yml:

global:

  scrape_interval: 15s

  evaluation_interval: 15s

  external_labels:

    cluster: "production"

    region: "us-west-2"

# Alertmanager configuration

alerting:

  alertmanagers:

    - static_configs:

        - targets:

            - alertmanager:9093

# Load rules files

rule_files:

  - /etc/prometheus/rules/*.yml

# Scrape configurations

scrape_configs:

  # Prometheus itself

  - job_name: "prometheus"

    static_configs:

      - targets: ["localhost:9090"]

  # Node exporters

  - job_name: "node-exporter"

    static_configs:

      - targets:

          - "node1:9100"

          - "node2:9100"

          - "node3:9100"

    relabel_configs:

      - source_labels: [__address__]

        target_label: instance

        regex: "([^:]+)(:[0-9]+)?"

        replacement: "${1}"

  # Kubernetes pods with annotations

  - job_name: "kubernetes-pods"

    kubernetes_sd_configs:

      - role: pod

    relabel_configs:

      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]

        action: keep

        regex: true

      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]

        action: replace

        target_label: __metrics_path__

        regex: (.+)

      - source_labels:

          [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]

        action: replace

        regex: ([^:]+)(?::\d+)?;(\d+)

        replacement: $1:$2

        target_label: __address__

      - source_labels: [__meta_kubernetes_namespace]

        action: replace

        target_label: namespace

      - source_labels: [__meta_kubernetes_pod_name]

        action: replace

        target_label: pod

  # Application metrics

  - job_name: "my-app"

    static_configs:

      - targets:

          - "app1.example.com:9090"

          - "app2.example.com:9090"

    metrics_path: "/metrics"

    scheme: "https"

    tls_config:

      ca_file: /etc/prometheus/ca.crt

      cert_file: /etc/prometheus/client.crt

      key_file: /etc/prometheus/client.key

Reference: See assets/prometheus.yml.template

Scrape Configurations

Static Targets

scrape_configs:

  - job_name: "static-targets"

    static_configs:

      - targets: ["host1:9100", "host2:9100"]

        labels:

          env: "production"

          region: "us-west-2"

File-based Service Discovery

scrape_configs:

  - job_name: "file-sd"

    file_sd_configs:

      - files:

          - /etc/prometheus/targets/*.json

          - /etc/prometheus/targets/*.yml

        refresh_interval: 5m

targets/production.json:

[

  {

    "targets": ["app1:9090", "app2:9090"],

    "labels": {

      "env": "production",

      "service": "api"

    }

  }

]

Kubernetes Service Discovery

scrape_configs:

  - job_name: "kubernetes-services"

    kubernetes_sd_configs:

      - role: service

    relabel_configs:

      - source_labels:

          [__meta_kubernetes_service_annotation_prometheus_io_scrape]

        action: keep

        regex: true

      - source_labels:

          [__meta_kubernetes_service_annotation_prometheus_io_scheme]

        action: replace

        target_label: __scheme__

        regex: (https?)

      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]

        action: replace

        target_label: __metrics_path__

        regex: (.+)

Reference: See references/scrape-configs.md

Recording Rules

Create pre-computed metrics for frequently queried expressions:

# /etc/prometheus/rules/recording_rules.yml

groups:

  - name: api_metrics

    interval: 15s

    rules:

      # HTTP request rate per service

      - record: job:http_requests:rate5m

        expr: sum by (job) (rate(http_requests_total[5m]))

      # Error rate percentage

      - record: job:http_requests_errors:rate5m

        expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))

      - record: job:http_requests_error_rate:percentage

        expr: |

          (job:http_requests_errors:rate5m / job:http_requests:rate5m) * 100

      # P95 latency

      - record: job:http_request_duration:p95

        expr: |

          histogram_quantile(0.95,

            sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))

          )

  - name: resource_metrics

    interval: 30s

    rules:

      # CPU utilization percentage

      - record: instance:node_cpu:utilization

        expr: |

          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

      # Memory utilization percentage

      - record: instance:node_memory:utilization

        expr: |

          100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)

      # Disk usage percentage

      - record: instance:node_disk:utilization

        expr: |

          100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)

Reference: See references/recording-rules.md

Alert Rules

# /etc/prometheus/rules/alert_rules.yml

groups:

  - name: availability

    interval: 30s

    rules:

      - alert: ServiceDown

        expr: up{job="my-app"} == 0

        for: 1m

        labels:

          severity: critical

        annotations:

          summary: "Service {{ $labels.instance }} is down"

          description: "{{ $labels.job }} has been down for more than 1 minute"

      - alert: HighErrorRate

        expr: job:http_requests_error_rate:percentage > 5

        for: 5m

        labels:

          severity: warning

        annotations:

          summary: "High error rate for {{ $labels.job }}"

          description: "Error rate is {{ $value }}% (threshold: 5%)"

      - alert: HighLatency

        expr: job:http_request_duration:p95 > 1

        for: 5m

        labels:

          severity: warning

        annotations:

          summary: "High latency for {{ $labels.job }}"

          description: "P95 latency is {{ $value }}s (threshold: 1s)"

  - name: resources

    interval: 1m

    rules:

      - alert: HighCPUUsage

        expr: instance:node_cpu:utilization > 80

        for: 5m

        labels:

          severity: warning

        annotations:

          summary: "High CPU usage on {{ $labels.instance }}"

          description: "CPU usage is {{ $value }}%"

      - alert: HighMemoryUsage

        expr: instance:node_memory:utilization > 85

        for: 5m

        labels:

          severity: warning

        annotations:

          summary: "High memory usage on {{ $labels.instance }}"

          description: "Memory usage is {{ $value }}%"

      - alert: DiskSpaceLow

        expr: instance:node_disk:utilization > 90

        for: 5m

        labels:

          severity: critical

        annotations:

          summary: "Low disk space on {{ $labels.instance }}"

          description: "Disk usage is {{ $value }}%"

Validation

# Validate configuration

promtool check config prometheus.yml

# Validate rules

promtool check rules /etc/prometheus/rules/*.yml

# Test query

promtool query instant http://localhost:9090 'up'

Reference: See scripts/validate-prometheus.sh

Best Practices

Use consistent naming for metrics (prefix_name_unit)

Set appropriate scrape intervals (15-60s typical)

Use recording rules for expensive queries

Implement high availability (multiple Prometheus instances)

Configure retention based on storage capacity

Use relabeling for metric cleanup

Monitor Prometheus itself

Implement federation for large deployments

Use Thanos/Cortex for long-term storage

Document custom metrics

Troubleshooting

Check scrape targets:

curl http://localhost:9090/api/v1/targets

Check configuration:

curl http://localhost:9090/api/v1/status/config

Test query:

curl 'http://localhost:9090/api/v1/query?query=up'

Related Skills

grafana-dashboards - For visualization

slo-implementation - For SLO monitoring

distributed-tracing - For request tracing

prometheus-configuration

SKILL.md

Prometheus Configuration

Purpose

When to Use

Prometheus Architecture

Installation

Kubernetes with Helm

Docker Compose

Configuration File

Scrape Configurations

Static Targets

File-based Service Discovery

Kubernetes Service Discovery

Recording Rules

Alert Rules

Validation

Best Practices

Troubleshooting

Related Skills

Stop writing automation&scrapers

prometheus-configuration

SKILL.md

Prometheus Configuration

Purpose

When to Use

Prometheus Architecture

Installation

Kubernetes with Helm

Docker Compose

Configuration File

Scrape Configurations

Static Targets

File-based Service Discovery

Kubernetes Service Discovery

Recording Rules

Alert Rules

Validation

Best Practices

Troubleshooting

Related Skills

Let your agent run on any real-world website

Related skills

Stop writing automation&scrapers