SKILL.md
$28
A trace represents the lifecycle of a request as it passes through multiple services. It consists of:
- Spans: Individual units of work with start time, duration, attributes, and status
- Trace ID: Shared identifier across all spans in a request
- Parent-child relationships: Spans form a tree showing causality
Traces enable:
- Root cause analysis for service outages
- Understanding service dependencies
- Identifying latency bottlenecks
- Correlating events across microservices
Architecture Overview
Applications
|
| (OTLP 4317/4318, Jaeger 14250/14268, Zipkin 9411)
v
[Distributor] ---- hashes traceID, routes to N partitions
|
[Kafka]
|---> [Live Stores] (storage of recent data)
|
|---> [Block Builders] (Parquet block assembly, flush to object storage)
|
|---> [Metrics Generator] (optional: derives RED metrics -> Prometheus)
Query path:
Grafana --> [Query Frontend] (shards queries)
|
[Querier pool]
/ \
[Live Stores] [Object Storage]
(recent) (historical blocks)
Core Components
Component
Role
Default Ports
Distributor
Receives spans, routes by traceID hash
4317 (gRPC), 4318 (HTTP)
Live Store
Buffers recent data on local disk and serves queries
-
Query Frontend
Query orchestrator, shards across queriers
3200 (HTTP)
Querier
Executes search jobs against storage
-
Compactor
Merges blocks, enforces retention
-
Block Builder
Creates the final parquet blocks and flushes to object storage
-
Metrics Generator
Derives RED metrics from spans
-
TraceQL - The Query Language
TraceQL queries filter traces by span properties. Structure: { filters } | pipeline
Attribute Scopes
span.http.status_code # span-level attribute
resource.service.name # resource-level attribute (from SDK)
event.name # event-level attribute
name # intrinsic: span operation name
status # intrinsic: ok | error | unset
duration # intrinsic: span duration
kind # intrinsic: server | client | producer | consumer | internal
traceDuration # intrinsic: entire trace duration
rootServiceName # intrinsic: service of the root span
rootName # intrinsic: operation name of the root span
Operators
= != > < >= <= # comparison
=~ !~ # regex match (Go RE2)
&& || ! # logical
Essential Examples
# All errors
{ status = error }
# Slow requests from a service
{ resource.service.name = "frontend" && duration > 1s }
# HTTP 5xx errors
{ span.http.status_code >= 500 }
# Count errors per trace (more than 2)
{ status = error } | count() >= 2
# Select specific fields
{ status = error } | select(span.http.url, duration, resource.service.name)
# Structural: server span with downstream error
{ kind = server } >> { status = error }
# Both conditions present (any relationship)
{ span.db.system = "redis" } && { span.db.system = "postgresql" }
# Find most recent (deterministic)
{ resource.service.name = "api" } with (most_recent=true)
TraceQL Metrics
# Error rate per service
{ status = error } | rate() by (resource.service.name)
# P99 latency
{ kind = server } | quantile_over_time(duration, .99) by (resource.service.name)
Deployment
Quick Start (Docker Compose)
git clone https://github.com/grafana/tempo.git
cd tempo/example/docker-compose/local
mkdir tempo-data
docker compose up -d
# Grafana at http://localhost:3000, Tempo API at http://localhost:3200
Kubernetes (Helm)
helm repo add grafana https://grafana.github.io/helm-charts
helm install tempo grafana/tempo-distributed \
--set storage.trace.backend=s3 \
--set storage.trace.s3.bucket=my-tempo-bucket \
--set storage.trace.s3.region=us-east-1
Sending Traces to Tempo
Via Grafana Alloy (Recommended)
// alloy.river
otelcol.receiver.otlp "default" {
grpc { endpoint = "0.0.0.0:4317" }
http { endpoint = "0.0.0.0:4318" }
output {
traces = [otelcol.exporter.otlp.tempo.input]
}
}
otelcol.exporter.otlp "tempo" {
client {
endpoint = "tempo:4317"
tls { insecure = true }
}
}
Via OpenTelemetry Collector
exporters:
otlp:
endpoint: tempo:4317
tls:
insecure: true
# For multi-tenancy:
headers:
x-scope-orgid: my-tenant
service:
pipelines:
traces:
receivers: [otlp]
exporters: [otlp]
Direct HTTP (OTLP)
curl -X POST -H 'Content-Type: application/json' \
http://localhost:4318/v1/traces \
-d '{"resourceSpans": [{"resource": {"attributes": [{"key": "service.name", "value": {"stringValue": "my-service"}}]}, "scopeSpans": [{"spans": [{"traceId": "5B8EFFF798038103D269B633813FC700", "spanId": "EEE19B7EC3C1B100", "name": "my-op", "startTimeUnixNano": 1689969302000000000, "endTimeUnixNano": 1689969302500000000, "kind": 2}]}]}]}'
Metrics from Traces
Enable Metrics Generator
metrics_generator:
storage:
path: /var/tempo/generator/wal
remote_write:
- url: http://prometheus:9090/api/v1/write
send_exemplars: true
overrides:
defaults:
metrics_generator:
processors: [service-graphs, span-metrics]
Processor Types
Service Graphs: Visualizes service topology and latency
- Output:
traces_service_graph_request_total,traces_service_graph_request_failed_total, duration histograms
Span Metrics: RED metrics per span
- Output:
traces_spanmetrics_calls_total,traces_spanmetrics_duration_seconds_*
- Labels: service, span_name, span_kind, status_code + custom dimensions
Local Blocks: Enables TraceQL metrics queries on recent data
Multi-Tenancy
# Enable in Tempo config
multitenancy_enabled: true
All requests require X-Scope-OrgID header.
# OpenTelemetry Collector
exporters:
otlp:
headers:
x-scope-orgid: tenant-id
# Grafana datasource
jsonData:
httpHeaderName1: "X-Scope-OrgID"
secureJsonData:
httpHeaderValue1: "tenant-id"
Grafana Integration
Data Source Configuration
datasources:
- name: Tempo
type: tempo
url: http://tempo:3200
jsonData:
# Link traces to logs
tracesToLogsV2:
datasourceUid: loki-uid
filterByTraceID: true
tags: [{key: "service.name", value: "app"}]
# Link traces to metrics
tracesToMetrics:
datasourceUid: prometheus-uid
tags: [{key: "service.name", value: "service"}]
queries:
- name: Error Rate
query: 'sum(rate(traces_spanmetrics_calls_total{$$__tags, status_code="STATUS_CODE_ERROR"}[5m]))'
# Link traces to profiles (Pyroscope)
tracesToProfiles:
datasourceUid: pyroscope-uid
tags: [{key: "service.name", value: "service_name"}]
# Service map from span metrics
serviceMap:
datasourceUid: prometheus-uid
Key Grafana Features
- Explore > Tempo: Search by TraceQL, trace ID, or tag filters
- Service Graph tab: Visual service topology with RED metrics
- Traces Drilldown:
/a/grafana-exploretraces-app- no TraceQL required
- Exemplars: Click metric spike -> jump directly to responsible trace
- Derived fields in Loki: Click trace ID in log -> jump to trace in Tempo
API Quick Reference
# Search traces
GET /api/search?q={status=error}&limit=20&start=<unix>&end=<unix>
# Get trace by ID
GET /api/traces/<traceID>
GET /api/v2/traces/<traceID>
# List all tag names
GET /api/search/tags
# Get values for a tag
GET /api/search/tag/service.name/values
# TraceQL metrics (time series)
GET /api/metrics/query_range?q={status=error}|rate()&start=...&end=...&step=60
# Health check
GET /ready
Performance Tuning Summary
Problem
Solution
Slow searches
Scale queriers horizontally; scale compactors to reduce block count
High memory on queriers
Reduce max_concurrent_queries; lower target_bytes_per_job
High memory on ingesters
Reduce max_block_bytes; lower per-tenant trace limits
Slow attribute queries
Add dedicated Parquet columns for frequent attributes
Cache miss rate high
Increase cache size; tune cache_min_compaction_level
Rate limited (429)
Raise max_outstanding_per_tenant or increase per-tenant ingestion limits
Memcached connection errors
Increase memcached connection limit (-c 4096)
Best Practices
Instrumentation
- Follow OpenTelemetry semantic conventions for attribute names
- Use
span.prefix for span attributes,resource.for process context
- Keep attributes meaningful - avoid metrics/logs as span attributes
- Limit attributes to max ~128 per span (OTel default)
- Use span linking for batch processing (instead of huge fan-out traces)
- Create spans for: external calls, significant loops, operations with variable latency
- Avoid creating spans for every function call
Deployment
- Use replication factor 3 for production HA
- Object storage required for distributed deployments (not local)
- Enable dedicated attribute columns for your most-queried attributes
- Set appropriate block retention per tenant via overrides
- Monitor
tempo_ingester_live_tracesto detect memory pressure early
Querying
- Use time bounds (
start/end) to limit search scope
- Use structural operators for root cause analysis patterns
- Prefer
attribute != nilfor existence checks
- Use
with (most_recent=true)when you need deterministic recent results
- Scope tag discovery with a TraceQL query to reduce noise
Ports Reference
Port
Protocol
Purpose
3200
HTTP
Tempo API (queries, search, health)
9095
gRPC
Internal component communication
4317
gRPC
OTLP trace ingestion
4318
HTTP
OTLP trace ingestion
14268
HTTP
Jaeger Thrift HTTP ingestion
14250
gRPC
Jaeger gRPC ingestion
6831
UDP
Jaeger Thrift Compact
6832
UDP
Jaeger Thrift Binary
9411
HTTP
Zipkin ingestion
7946
TCP/UDP
Memberlist gossip