SKILL.md
Metrics with Prometheus and Grafana
Value Proposition
Prometheus is an open-source monitoring and alerting toolkit for cloud-native environments. Combined with
Grafana Cloud Metrics (powered by Grafana Mimir), it provides a fully managed Prometheus-compatible service
with long-term storage, global query performance, and enterprise scalability.
Key Differentiators: Pull-based model, dimensional data model with labels, PromQL, automatic service
discovery, scales to billions of active series.
PromQL Quick Reference
Instant Vector Selectors
# By metric name
http_requests_total
Label filter
http_requests_total{job="api-server"}
Multiple labels (AND)
http_requests_total{job="api-server", method="GET"}
Regex
http_requests_total{job="api.*", status="5.."}
Negative
http_requests_total{status!="200"}
### Range Vectors & Rates
Per-second rate over 5 minutes
rate(http_requests_total[5m])
Increase over interval
increase(http_requests_total[1h])
Instant rate (last two samples)
irate(http_requests_total[5m])
Offset (5 minutes ago)
rate(http_requests_total[5m] offset 5m)
### Aggregations
Sum by label
sum by (job) (rate(http_requests_total[5m]))
Average
avg by (instance) (node_cpu_seconds_total)
Top-K
topk(5, rate(http_requests_total[5m]))
Histogram quantiles
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
Count distinct
count(up{job="api"})
### Common Patterns
Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100
Saturation (CPU usage %)
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Memory usage
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
Predict disk full (linear extrapolation)
predict_linear(node_filesystem_free_bytes[6h], 24*3600) < 0
## Metrics Drilldown
Queryless Prometheus metrics exploration (preinstalled in Grafana 12+):
- Browse metrics without writing PromQL
- Smart segmentation and anomaly detection
- Auto-visualization with optimal chart types
- Metric relationship discovery
- Telemetry pivoting (metrics to logs)
## Alerting
### Prometheus Alertmanager
Route, group, silence, and deduplicate alerts. Multi-destination routing (PagerDuty, Slack, Email, webhooks).
### Grafana Alerting
Unified alerting across all data sources. Supports multi-dimensional alerts, notification policies,
and contact points.
### Recording Rules
Pre-compute expensive PromQL queries for dashboard performance:
groups:
- name: api_rules
rules:
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))