prometheus-configuration

Complete Prometheus setup guide covering scrape configuration, recording rules, and alerting. Includes Kubernetes and Docker Compose installation methods with example configurations for static targets, file-based discovery, and Kubernetes service discovery Provides pre-built recording rules for HTTP metrics (request rates, error rates, latency percentiles) and resource metrics (CPU, memory, disk utilization) Covers alert rule examples for service availability, error rates, latency thresholds, and resource constraints with severity labeling Includes validation tools (promtool), troubleshooting queries, and best practices for metric naming, scrape intervals, and high-availability setups

INSTALLATION
npx skills add https://github.com/wshobson/agents --skill prometheus-configuration
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

Prometheus Configuration

Complete guide to Prometheus setup, metric collection, scrape configuration, and recording rules.

Purpose

Configure Prometheus for comprehensive metric collection, alerting, and monitoring of infrastructure and applications.

When to Use

  • Set up Prometheus monitoring
  • Configure metric scraping
  • Create recording rules
  • Design alert rules
  • Implement service discovery

Prometheus Architecture

┌──────────────┐

│ Applications │ ← Instrumented with client libraries

└──────┬───────┘

       │ /metrics endpoint

       ↓

┌──────────────┐

│  Prometheus  │ ← Scrapes metrics periodically

│    Server    │

└──────┬───────┘

       │

       ├─→ AlertManager (alerts)

       ├─→ Grafana (visualization)

       └─→ Long-term storage (Thanos/Cortex)

Installation

Kubernetes with Helm

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

helm repo update

helm install prometheus prometheus-community/kube-prometheus-stack \

  --namespace monitoring \

  --create-namespace \

  --set prometheus.prometheusSpec.retention=30d \

  --set prometheus.prometheusSpec.storageVolumeSize=50Gi

Docker Compose

version: "3.8"

services:

  prometheus:

    image: prom/prometheus:v3.2

    ports:

      - "9090:9090"

    volumes:

      - ./prometheus.yml:/etc/prometheus/prometheus.yml

      - prometheus-data:/prometheus

    command:

      - "--config.file=/etc/prometheus/prometheus.yml"

      - "--storage.tsdb.path=/prometheus"

      - "--storage.tsdb.retention.time=30d"

volumes:

  prometheus-data:

Configuration File

prometheus.yml:

global:

  scrape_interval: 15s

  evaluation_interval: 15s

  external_labels:

    cluster: "production"

    region: "us-west-2"

# Alertmanager configuration

alerting:

  alertmanagers:

    - static_configs:

        - targets:

            - alertmanager:9093

# Load rules files

rule_files:

  - /etc/prometheus/rules/*.yml

# Scrape configurations

scrape_configs:

  # Prometheus itself

  - job_name: "prometheus"

    static_configs:

      - targets: ["localhost:9090"]

  # Node exporters

  - job_name: "node-exporter"

    static_configs:

      - targets:

          - "node1:9100"

          - "node2:9100"

          - "node3:9100"

    relabel_configs:

      - source_labels: [__address__]

        target_label: instance

        regex: "([^:]+)(:[0-9]+)?"

        replacement: "${1}"

  # Kubernetes pods with annotations

  - job_name: "kubernetes-pods"

    kubernetes_sd_configs:

      - role: pod

    relabel_configs:

      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]

        action: keep

        regex: true

      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]

        action: replace

        target_label: __metrics_path__

        regex: (.+)

      - source_labels:

          [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]

        action: replace

        regex: ([^:]+)(?::\d+)?;(\d+)

        replacement: $1:$2

        target_label: __address__

      - source_labels: [__meta_kubernetes_namespace]

        action: replace

        target_label: namespace

      - source_labels: [__meta_kubernetes_pod_name]

        action: replace

        target_label: pod

  # Application metrics

  - job_name: "my-app"

    static_configs:

      - targets:

          - "app1.example.com:9090"

          - "app2.example.com:9090"

    metrics_path: "/metrics"

    scheme: "https"

    tls_config:

      ca_file: /etc/prometheus/ca.crt

      cert_file: /etc/prometheus/client.crt

      key_file: /etc/prometheus/client.key

Reference: See assets/prometheus.yml.template

Scrape Configurations

Static Targets

scrape_configs:

  - job_name: "static-targets"

    static_configs:

      - targets: ["host1:9100", "host2:9100"]

        labels:

          env: "production"

          region: "us-west-2"

File-based Service Discovery

scrape_configs:

  - job_name: "file-sd"

    file_sd_configs:

      - files:

          - /etc/prometheus/targets/*.json

          - /etc/prometheus/targets/*.yml

        refresh_interval: 5m

targets/production.json:

[

  {

    "targets": ["app1:9090", "app2:9090"],

    "labels": {

      "env": "production",

      "service": "api"

    }

  }

]

Kubernetes Service Discovery

scrape_configs:

  - job_name: "kubernetes-services"

    kubernetes_sd_configs:

      - role: service

    relabel_configs:

      - source_labels:

          [__meta_kubernetes_service_annotation_prometheus_io_scrape]

        action: keep

        regex: true

      - source_labels:

          [__meta_kubernetes_service_annotation_prometheus_io_scheme]

        action: replace

        target_label: __scheme__

        regex: (https?)

      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]

        action: replace

        target_label: __metrics_path__

        regex: (.+)

Reference: See references/scrape-configs.md

Recording Rules

Create pre-computed metrics for frequently queried expressions:

# /etc/prometheus/rules/recording_rules.yml

groups:

  - name: api_metrics

    interval: 15s

    rules:

      # HTTP request rate per service

      - record: job:http_requests:rate5m

        expr: sum by (job) (rate(http_requests_total[5m]))

      # Error rate percentage

      - record: job:http_requests_errors:rate5m

        expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))

      - record: job:http_requests_error_rate:percentage

        expr: |

          (job:http_requests_errors:rate5m / job:http_requests:rate5m) * 100

      # P95 latency

      - record: job:http_request_duration:p95

        expr: |

          histogram_quantile(0.95,

            sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))

          )

  - name: resource_metrics

    interval: 30s

    rules:

      # CPU utilization percentage

      - record: instance:node_cpu:utilization

        expr: |

          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

      # Memory utilization percentage

      - record: instance:node_memory:utilization

        expr: |

          100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)

      # Disk usage percentage

      - record: instance:node_disk:utilization

        expr: |

          100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)

Reference: See references/recording-rules.md

Alert Rules

# /etc/prometheus/rules/alert_rules.yml

groups:

  - name: availability

    interval: 30s

    rules:

      - alert: ServiceDown

        expr: up{job="my-app"} == 0

        for: 1m

        labels:

          severity: critical

        annotations:

          summary: "Service {{ $labels.instance }} is down"

          description: "{{ $labels.job }} has been down for more than 1 minute"

      - alert: HighErrorRate

        expr: job:http_requests_error_rate:percentage > 5

        for: 5m

        labels:

          severity: warning

        annotations:

          summary: "High error rate for {{ $labels.job }}"

          description: "Error rate is {{ $value }}% (threshold: 5%)"

      - alert: HighLatency

        expr: job:http_request_duration:p95 > 1

        for: 5m

        labels:

          severity: warning

        annotations:

          summary: "High latency for {{ $labels.job }}"

          description: "P95 latency is {{ $value }}s (threshold: 1s)"

  - name: resources

    interval: 1m

    rules:

      - alert: HighCPUUsage

        expr: instance:node_cpu:utilization > 80

        for: 5m

        labels:

          severity: warning

        annotations:

          summary: "High CPU usage on {{ $labels.instance }}"

          description: "CPU usage is {{ $value }}%"

      - alert: HighMemoryUsage

        expr: instance:node_memory:utilization > 85

        for: 5m

        labels:

          severity: warning

        annotations:

          summary: "High memory usage on {{ $labels.instance }}"

          description: "Memory usage is {{ $value }}%"

      - alert: DiskSpaceLow

        expr: instance:node_disk:utilization > 90

        for: 5m

        labels:

          severity: critical

        annotations:

          summary: "Low disk space on {{ $labels.instance }}"

          description: "Disk usage is {{ $value }}%"

Validation

# Validate configuration

promtool check config prometheus.yml

# Validate rules

promtool check rules /etc/prometheus/rules/*.yml

# Test query

promtool query instant http://localhost:9090 'up'

Reference: See scripts/validate-prometheus.sh

Best Practices

  • Use consistent naming for metrics (prefix_name_unit)
  • Set appropriate scrape intervals (15-60s typical)
  • Use recording rules for expensive queries
  • Implement high availability (multiple Prometheus instances)
  • Configure retention based on storage capacity
  • Use relabeling for metric cleanup
  • Monitor Prometheus itself
  • Implement federation for large deployments
  • Use Thanos/Cortex for long-term storage
  • Document custom metrics

Troubleshooting

Check scrape targets:

curl http://localhost:9090/api/v1/targets

Check configuration:

curl http://localhost:9090/api/v1/status/config

Test query:

curl 'http://localhost:9090/api/v1/query?query=up'

Related Skills

  • grafana-dashboards - For visualization
  • slo-implementation - For SLO monitoring
  • distributed-tracing - For request tracing
BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card