SKILL.md
$2c
3. Map Service Dependencies
- Goal: Understand service-to-service communication patterns and external API calls
- Trigger: "service dependencies", "what services does X call", "outgoing HTTP calls"
- Done: Dependency map showing call counts, latency, and error rates between services
Core Concepts
Understanding Traces and Spans
Spans represent logical units of work in distributed traces:
- HTTP requests, RPC calls, database operations
- Messaging system interactions
- Internal function invocations
- Custom instrumentation points
Span kinds:
span.kind: server- Incoming call to a service
span.kind: client- Outgoing call from a service
span.kind: consumer- Incoming message consumption call to a service
span.kind: producer- Outgoing message production call from a service
span.kind: internal- Internal operation within a service
Root spans: A request root span (request.is_root_span == true) represents an incoming call to a service. Use this to analyze end-to-end request performance.
Key Trace Attributes
Essential attributes for trace analysis:
Attribute
Description
trace.id
Unique trace identifier
span.id
Unique span identifier
span.parent_id
Parent span ID (null for root spans)
request.is_root_span
Boolean, true for request entry points
request.is_failed
Boolean, true if request failed
duration
Span duration in nanoseconds
span.timing.cpu
Overall CPU time of the span (stable)
span.timing.cpu_self
CPU time excluding child spans (stable)
dt.smartscape.service
Service Smartscape node ID
dt.service.name
Dynatrace service name derived from service detection rules. It is equal to the Smartscape service node name.
endpoint.name
Endpoint/route name
Service Context
Spans reference services via Smartscape node IDs and the detected service name dt.service.name which is also present on every span.
fetch spans
| summarize spans=count(), by: { dt.smartscape.service, dt.service.name }
Node functions:
getNodeName(dt.smartscape.service)- Addsdt.smartscape.service.namefield with the human-readable service name
getNodeField(dt.smartscape.service, "attribute_name")- Access specific node attributes
π Learn more: See Entity Lookups for advanced entity selectors, infrastructure correlation, and hardware analysis.
Sampling and Extrapolation
One span can represent multiple real operations due to:
- Aggregation: Multiple operations in one span (
aggregation.count)
- ATM (Adaptive Traffic Management): Head-based sampling by agent
- ALR (Adaptive Load Reduction): Server-side sampling
- Read Sampling: Query-time sampling via
samplingRatioparameter
When to extrapolate: Always extrapolate when counting actual operations (not just spans). Use the multiplicity factor:
fetch spans
| fieldsAdd sampling.probability = (power(2, 56) - coalesce(sampling.threshold, 0)) * power(2, -56)
| fieldsAdd sampling.multiplicity = 1 / sampling.probability
| fieldsAdd multiplicity = coalesce(sampling.multiplicity, 1)
* coalesce(aggregation.count, 1)
* dt.system.sampling_ratio
| summarize operation_count = sum(multiplicity)
π Learn more: See Sampling and Extrapolation for detailed formulas and examples.
Common Query Patterns
Basic Span Access
Fetch spans and explore by type:
fetch spans | limit 1
Explore spans by function and type:
fetch spans
| summarize count(), by: { span.kind, code.namespace, code.function }
Request Root Filtering
List request root spans (incoming service calls):
fetch spans
| filter request.is_root_span == true
| fields trace.id, span.id, start_time, response_time = duration, endpoint.name
| limit 100
Service Performance Summary
Analyze service performance with error rates:
fetch spans
| filter request.is_root_span == true
| summarize
total_requests = count(),
failed_requests = countIf(request.is_failed == true),
avg_duration = avg(duration),
p95_duration = percentile(duration, 95),
by: {dt.service.name}
| fieldsAdd error_rate = (failed_requests * 100.0) / total_requests
| sort error_rate desc
Trace ID Lookup
Find all spans in a specific trace:
fetch spans
| filter trace.id == toUid("abc123def456")
| fields span.name, duration, dt.service.name
Performance Analysis
Response Time Percentiles
Calculate percentiles by endpoint:
fetch spans
| filter request.is_root_span == true
| summarize {
requests=count(),
avg_duration=avg(duration),
p95=percentile(duration, 95),
p99=percentile(duration, 99)
}, by: { endpoint.name }
| sort p99 desc
π‘ Best practice: Use percentiles (p95, p99) over averages for performance insights.
Slow Trace Detection
Find requests exceeding a threshold:
fetch spans, from:now() - 2h
| filter request.is_root_span == true
| filter duration > 5s
| fields trace.id, span.name, dt.service.name, duration
| sort duration desc
| limit 50
Duration Buckets with Exemplars
fetch spans, from:now() - 24h
| filter http.route == "/api/v1/storage/findByISBN"
| summarize {
spans=count(),
trace=takeAny(record(start_time, trace.id))
}, by: { bin(duration, 10ms) }
| fields `bin(duration, 10ms)`, spans, trace.id=trace[trace.id], start_time=trace[start_time]
Performance Timeseries
Extract response time as timeseries:
fetch spans, from:now() - 24h
| filter request.is_root_span == true
| makeTimeseries {
requests=count(),
avg_duration=avg(duration),
p95=percentile(duration, 95),
p99=percentile(duration, 99)
}, by: { endpoint.name }
π Learn more: See Performance Analysis for advanced patterns and timeseries techniques.
Failure Investigation
Failed Request Summary
Summarize failures by service:
fetch spans
| filter request.is_root_span == true
| summarize
total = count(),
failed = countIf(request.is_failed == true),
by: { dt.service.name }
| fieldsAdd failure_rate = (failed * 100.0) / total
| sort failure_rate desc
Failure Reason Analysis
Breakdown by failure detection reason:
fetch spans
| filter request.is_failed == true and isNotNull(dt.failure_detection.results)
| expand dt.failure_detection.results
| summarize count(), by: { dt.failure_detection.results[reason] }
Failure reasons:
http_code- HTTP response code triggered failure
grpc_code- gRPC status code triggered failure
exception- Exception caused failure
span_status- Span status indicated failure
custom_rule- Custom failure detection rule matched
HTTP Code Failures
Find failures by HTTP status code:
fetch spans
| filter request.is_failed == true
| filter iAny(dt.failure_detection.results[][reason] == "http_code")
| summarize count(), by: { http.response.status_code, endpoint.name }
| sort `count()` desc
Recent Failed Requests
List recent failures with details:
fetch spans
| filter request.is_root_span == true and request.is_failed == true
| fields
start_time,
trace.id,
endpoint.name,
http.response.status_code,
duration
| sort start_time desc
| limit 100
π Learn more: See Failure Detection for exception analysis and custom rule investigation.
Service Dependencies
Service-to-Service Analysis
Analyze service communication patterns:
fetch spans, from:now() - 1h
| filter isNotNull(server.address)
| fieldsAdd
remote_side = server.address
| summarize
call_count = count(),
avg_duration = avg(duration),
by: {dt.service.name, remote_side}
| sort call_count desc
Outgoing HTTP Calls
Identify external API dependencies:
fetch spans
| filter span.kind == "client" and isNotNull(http.request.method)
| summarize
calls = count(),
avg_latency = avg(duration),
p99_latency = percentile(duration, 99),
by: { dt.service.name, server.address, server.port }
| sort calls desc
Trace Aggregation
Complete Trace Analysis
Aggregate all spans in a trace to understand full request flow:
fetch spans, from:now() - 30m
| summarize {
spans = count(),
client_spans = countIf(span.kind == "client"),
// Endpoints involved in the trace
endpoints = toString(arrayRemoveNulls(collectDistinct(endpoint.name))),
// Extract the first request root in the trace
trace_root = takeMin(record(
root_detection_helper = coalesce(
if(request.is_root_span, 1),
if(isNull(span.parent_id), 2),
3),
start_time, endpoint.name, duration
))
}, by: { trace.id }
| fieldsFlatten trace_root
| fieldsRemove trace_root.root_detection_helper, trace_root
| fields
start_time = trace_root.start_time,
endpoint = trace_root.endpoint.name,
response_time = trace_root.duration,
spans,
client_spans,
endpoints,
trace.id
| sort start_time
| limit 100
Root detection strategy: Use takeMin(record(...)) with a detection helper to reliably find the root request:
- Priority 1: Spans with
request.is_root_span == true
- Priority 2: Spans without parent (root spans)
- Priority 3: All other spans
Multi-Service Traces
Find traces spanning multiple services:
fetch spans, from:now() - 1h
| summarize {
services = collectDistinct(dt.service.name),
trace_root = takeMin(record(root_detection_helper = coalesce(if(request.is_root_span, 1), 2), endpoint.name))
}, by: { trace.id }
| fieldsAdd service_count = arraySize(services)
| filter service_count > 1
| fields endpoint = trace_root[endpoint.name], service_count, services = toString(services), trace.id
| sort service_count desc
| limit 50
Request-Level Analysis
Request Attributes
Access custom request attributes captured by OneAgent on request root spans:
fetch spans
| filter request.is_root_span == true
| filter isNotNull(request_attribute.PaidAmount)
| makeTimeseries sum(request_attribute.PaidAmount)
Field patterns: request_attribute.<name>, captured_attribute.<name> (always arrays)
β Request Attributes β full patterns for request attributes, captured attributes, and request ID aggregation
Span Types
Span Type
Detection
Key Fields
Reference
HTTP server (incoming)
span.kind == "server" and isNotNull(http.request.method)
http.route, http.request.method, http.response.status_code
HTTP client (outgoing)
span.kind == "client" and isNotNull(http.request.method)
server.address, server.port
Database
span.kind == "client" and isNotNull(db.system)
db.system, db.namespace, db.statement
Messaging
isNotNull(messaging.system)
messaging.system, messaging.destination.name, messaging.operation.type
RPC / gRPC
isNotNull(rpc.system)
rpc.system, rpc.service, rpc.method, rpc.grpc.status_code
Serverless / FaaS
isNotNull(faas.name) and span.kind == "server"
faas.name, faas.trigger.type, cloud.provider
β οΈ Database spans: Can be aggregated (one span = multiple calls). Always use aggregation.count extrapolation for accurate operation counts.
π Detailed patterns per span type: See the reference files above.
Advanced Topics
Exception Analysis
Exceptions are stored as span.events within spans:
fetch spans
| filter iAny(span.events[][span_event.name] == "exception")
| expand span.events
| fieldsFlatten span.events, fields: { exception.type }
| summarize {
count(),
trace=takeAny(record(start_time, trace.id))
}, by: { exception.type }
| fields exception.type, `count()`, trace.id=trace[trace.id], start_time=trace[start_time]
π‘ Tip: Use iAny() to check conditions within span event arrays.
β Logs Correlation β joining logs and traces, filtering traces by log content
β Network Analysis β client IPs, DNS resolution, subnet analysis
Best Practices
Area
Rule
Filtering
Apply request.is_root_span == true and endpoint filters first
Sampling
Use samplingRatio (e.g., 100 = read 1%) for performance
Percentiles
Use p95/p99 over averages for performance analysis
Root spans
Use request.is_root_span == true for end-to-end analysis
Trace grouping
Group by trace.id for complete trace metrics
Request grouping
Group by request.id for OneAgent-only request metrics
Extrapolation
Always apply multiplicity for accurate operation counts
Exemplars
Use takeAny(record(start_time, trace.id)) to enable UI drilldown
Troubleshooting
Problem
Cause
Solution
Duration values seem wrong (too large)
duration is in nanoseconds, not milliseconds
Divide by 1000000 or compare with 5s (DQL duration literal)
Span counts don't match expected request volume
Sampling or aggregation not accounted for
Use multiplicity extrapolation β see Sampling and Extrapolation reference
getNodeName(dt.smartscape.service) returns null
Service not yet resolved or OneAgent not monitoring
Verify OneAgent monitors the service; entity resolution may have a short delay
request.is_root_span filter returns nothing
Querying OpenTelemetry-only traces without OneAgent
Use isNull(span.parent_id) as fallback for root span detection
trace.id filter returns no results
Trace ID not converted to UID format
Use filter trace.id == toUid("abc123...") for string-based trace IDs
Database span counts are too low
Database spans are aggregated (one span = N calls)
Always use aggregation.count extrapolation for database operation counts
Related Skills
- dt-dql-essentials β Core DQL syntax for querying trace data
- dt-app-dashboards β Embed trace queries in dashboards
- dt-migration β Smartscape entity model and relationship navigation
References
Detailed documentation for specific topics:
- Performance Analysis - Advanced timeseries, duration buckets, endpoint ranking
- Failure Detection - Failure reasons, exception investigation, custom rules
- Sampling and Extrapolation - Multiplicity calculation, database extrapolation
- Request Attributes - Request attributes, captured attributes, request ID aggregation
- Entity Lookups - Advanced node lookups, infrastructure correlation, hardware analysis
- HTTP Span Analysis - Status codes, payload analysis, client IPs
- Database Span Analysis - Extrapolated counts, slow queries, statement analysis
- Messaging Span Analysis - Kafka, RabbitMQ, SQS throughput and latency
- RPC Span Analysis - gRPC, SOAP, service dependencies
- Serverless Span Analysis - Lambda, Azure Functions, cold start analysis
- Logs Correlation - Joining logs and traces, correlation patterns
- Network Analysis - IP addresses, DNS resolution, communication mapping