Name: dt-obs-tracing
Author: dynatrace

SKILL.md

$2c

3. Map Service Dependencies

Goal: Understand service-to-service communication patterns and external API calls

Trigger: "service dependencies", "what services does X call", "outgoing HTTP calls"

Done: Dependency map showing call counts, latency, and error rates between services

Core Concepts

Understanding Traces and Spans

Spans represent logical units of work in distributed traces:

HTTP requests, RPC calls, database operations

Messaging system interactions

Internal function invocations

Custom instrumentation points

Span kinds:

span.kind: server - Incoming call to a service

span.kind: client - Outgoing call from a service

span.kind: consumer - Incoming message consumption call to a service

span.kind: producer - Outgoing message production call from a service

span.kind: internal - Internal operation within a service

Root spans: A request root span (request.is_root_span == true) represents an incoming call to a service. Use this to analyze end-to-end request performance.

Key Trace Attributes

Essential attributes for trace analysis:

Attribute

Description

trace.id

Unique trace identifier

span.id

Unique span identifier

span.parent_id

Parent span ID (null for root spans)

request.is_root_span

Boolean, true for request entry points

request.is_failed

Boolean, true if request failed

duration

Span duration in nanoseconds

span.timing.cpu

Overall CPU time of the span (stable)

span.timing.cpu_self

CPU time excluding child spans (stable)

dt.smartscape.service

Service Smartscape node ID

dt.service.name

Dynatrace service name derived from service detection rules. It is equal to the Smartscape service node name.

endpoint.name

Endpoint/route name

Service Context

Spans reference services via Smartscape node IDs and the detected service name dt.service.name which is also present on every span.

fetch spans

| summarize spans=count(), by: { dt.smartscape.service, dt.service.name }

Node functions:

getNodeName(dt.smartscape.service) - Adds dt.smartscape.service.name field with the human-readable service name

getNodeField(dt.smartscape.service, "attribute_name") - Access specific node attributes

📖 Learn more: See Entity Lookups for advanced entity selectors, infrastructure correlation, and hardware analysis.

Sampling and Extrapolation

One span can represent multiple real operations due to:

Aggregation: Multiple operations in one span (aggregation.count)

ATM (Adaptive Traffic Management): Head-based sampling by agent

ALR (Adaptive Load Reduction): Server-side sampling

Read Sampling: Query-time sampling via samplingRatio parameter

When to extrapolate: Always extrapolate when counting actual operations (not just spans). Use the multiplicity factor:

fetch spans

| fieldsAdd sampling.probability = (power(2, 56) - coalesce(sampling.threshold, 0)) * power(2, -56)

| fieldsAdd sampling.multiplicity = 1 / sampling.probability

| fieldsAdd multiplicity = coalesce(sampling.multiplicity, 1)

                         * coalesce(aggregation.count, 1)

                         * dt.system.sampling_ratio

| summarize operation_count = sum(multiplicity)

📖 Learn more: See Sampling and Extrapolation for detailed formulas and examples.

Common Query Patterns

Basic Span Access

Fetch spans and explore by type:

fetch spans | limit 1

Explore spans by function and type:

fetch spans

| summarize count(), by: { span.kind, code.namespace, code.function }

Request Root Filtering

List request root spans (incoming service calls):

fetch spans

| filter request.is_root_span == true

| fields trace.id, span.id, start_time, response_time = duration, endpoint.name

| limit 100

Service Performance Summary

Analyze service performance with error rates:

fetch spans

| filter request.is_root_span == true

| summarize

    total_requests = count(),

    failed_requests = countIf(request.is_failed == true),

    avg_duration = avg(duration),

    p95_duration = percentile(duration, 95),

    by: {dt.service.name}

| fieldsAdd error_rate = (failed_requests * 100.0) / total_requests

| sort error_rate desc

Trace ID Lookup

Find all spans in a specific trace:

fetch spans

| filter trace.id == toUid("abc123def456")

| fields span.name, duration, dt.service.name

Performance Analysis

Response Time Percentiles

Calculate percentiles by endpoint:

fetch spans

| filter request.is_root_span == true

| summarize {

    requests=count(),

    avg_duration=avg(duration),

    p95=percentile(duration, 95),

    p99=percentile(duration, 99)

  }, by: { endpoint.name }

| sort p99 desc

💡 Best practice: Use percentiles (p95, p99) over averages for performance insights.

Slow Trace Detection

Find requests exceeding a threshold:

fetch spans, from:now() - 2h

| filter request.is_root_span == true

| filter duration > 5s

| fields trace.id, span.name, dt.service.name, duration

| sort duration desc

| limit 50

Duration Buckets with Exemplars

fetch spans, from:now() - 24h

| filter http.route == "/api/v1/storage/findByISBN"

| summarize {

    spans=count(),

    trace=takeAny(record(start_time, trace.id))

  }, by: { bin(duration, 10ms) }

| fields `bin(duration, 10ms)`, spans, trace.id=trace[trace.id], start_time=trace[start_time]

Performance Timeseries

Extract response time as timeseries:

fetch spans, from:now() - 24h

| filter request.is_root_span == true

| makeTimeseries {

    requests=count(),

    avg_duration=avg(duration),

    p95=percentile(duration, 95),

    p99=percentile(duration, 99)

  }, by: { endpoint.name }

📖 Learn more: See Performance Analysis for advanced patterns and timeseries techniques.

Failure Investigation

Failed Request Summary

Summarize failures by service:

fetch spans

| filter request.is_root_span == true

| summarize

    total = count(),

    failed = countIf(request.is_failed == true),

  by: { dt.service.name }

| fieldsAdd failure_rate = (failed * 100.0) / total

| sort failure_rate desc

Failure Reason Analysis

Breakdown by failure detection reason:

fetch spans

| filter request.is_failed == true and isNotNull(dt.failure_detection.results)

| expand dt.failure_detection.results

| summarize count(), by: { dt.failure_detection.results[reason] }

Failure reasons:

http_code - HTTP response code triggered failure

grpc_code - gRPC status code triggered failure

exception - Exception caused failure

span_status - Span status indicated failure

custom_rule - Custom failure detection rule matched

HTTP Code Failures

Find failures by HTTP status code:

fetch spans

| filter request.is_failed == true

| filter iAny(dt.failure_detection.results[][reason] == "http_code")

| summarize count(), by: { http.response.status_code, endpoint.name }

| sort `count()` desc

Recent Failed Requests

List recent failures with details:

fetch spans

| filter request.is_root_span == true and request.is_failed == true

| fields

    start_time,

    trace.id,

    endpoint.name,

    http.response.status_code,

    duration

| sort start_time desc

| limit 100

📖 Learn more: See Failure Detection for exception analysis and custom rule investigation.

Service Dependencies

Service-to-Service Analysis

Analyze service communication patterns:

fetch spans, from:now() - 1h

| filter isNotNull(server.address)

| fieldsAdd

    remote_side = server.address

| summarize

    call_count = count(),

    avg_duration = avg(duration),

    by: {dt.service.name, remote_side}

| sort call_count desc

Outgoing HTTP Calls

Identify external API dependencies:

fetch spans

| filter span.kind == "client" and isNotNull(http.request.method)

| summarize

    calls = count(),

    avg_latency = avg(duration),

    p99_latency = percentile(duration, 99),

  by: { dt.service.name, server.address, server.port }

| sort calls desc

Trace Aggregation

Complete Trace Analysis

Aggregate all spans in a trace to understand full request flow:

fetch spans, from:now() - 30m

| summarize {

    spans = count(),

    client_spans = countIf(span.kind == "client"),

    // Endpoints involved in the trace

    endpoints = toString(arrayRemoveNulls(collectDistinct(endpoint.name))),

    // Extract the first request root in the trace

    trace_root = takeMin(record(

        root_detection_helper = coalesce(

            if(request.is_root_span, 1),

            if(isNull(span.parent_id), 2),

            3),

        start_time, endpoint.name, duration

      ))

}, by: { trace.id }

| fieldsFlatten trace_root

| fieldsRemove trace_root.root_detection_helper, trace_root

| fields

    start_time = trace_root.start_time,

    endpoint = trace_root.endpoint.name,

    response_time = trace_root.duration,

    spans,

    client_spans,

    endpoints,

    trace.id

| sort start_time

| limit 100

Root detection strategy: Use takeMin(record(...)) with a detection helper to reliably find the root request:

Priority 1: Spans with request.is_root_span == true

Priority 2: Spans without parent (root spans)

Priority 3: All other spans

Multi-Service Traces

Find traces spanning multiple services:

fetch spans, from:now() - 1h

| summarize {

    services = collectDistinct(dt.service.name),

    trace_root = takeMin(record(root_detection_helper = coalesce(if(request.is_root_span, 1), 2), endpoint.name))

}, by: { trace.id }

| fieldsAdd service_count = arraySize(services)

| filter service_count > 1

| fields endpoint = trace_root[endpoint.name], service_count, services = toString(services), trace.id

| sort service_count desc

| limit 50

Request-Level Analysis

Request Attributes

Access custom request attributes captured by OneAgent on request root spans:

fetch spans

| filter request.is_root_span == true

| filter isNotNull(request_attribute.PaidAmount)

| makeTimeseries sum(request_attribute.PaidAmount)

Field patterns: request_attribute.<name>, captured_attribute.<name> (always arrays)

→ Request Attributes — full patterns for request attributes, captured attributes, and request ID aggregation

Span Types

Span Type

Detection

Key Fields

Reference

HTTP server (incoming)

span.kind == "server" and isNotNull(http.request.method)

http.route, http.request.method, http.response.status_code

http-spans.md

HTTP client (outgoing)

span.kind == "client" and isNotNull(http.request.method)

server.address, server.port

http-spans.md

Database

span.kind == "client" and isNotNull(db.system)

db.system, db.namespace, db.statement

database-spans.md

Messaging

isNotNull(messaging.system)

messaging.system, messaging.destination.name, messaging.operation.type

messaging-spans.md

RPC / gRPC

isNotNull(rpc.system)

rpc.system, rpc.service, rpc.method, rpc.grpc.status_code

rpc-spans.md

Serverless / FaaS

isNotNull(faas.name) and span.kind == "server"

faas.name, faas.trigger.type, cloud.provider

serverless-spans.md

⚠️ Database spans: Can be aggregated (one span = multiple calls). Always use aggregation.count extrapolation for accurate operation counts.

📖 Detailed patterns per span type: See the reference files above.

Advanced Topics

Exception Analysis

Exceptions are stored as span.events within spans:

fetch spans

| filter iAny(span.events[][span_event.name] == "exception")

| expand span.events

| fieldsFlatten span.events, fields: { exception.type }

| summarize {

    count(),

    trace=takeAny(record(start_time, trace.id))

  }, by: { exception.type }

| fields exception.type, `count()`, trace.id=trace[trace.id], start_time=trace[start_time]

💡 Tip: Use iAny() to check conditions within span event arrays.

→ Logs Correlation — joining logs and traces, filtering traces by log content

→ Network Analysis — client IPs, DNS resolution, subnet analysis

Best Practices

Area

Rule

Filtering

Apply request.is_root_span == true and endpoint filters first

Sampling

Use samplingRatio (e.g., 100 = read 1%) for performance

Percentiles

Use p95/p99 over averages for performance analysis

Root spans

Use request.is_root_span == true for end-to-end analysis

Trace grouping

Group by trace.id for complete trace metrics

Request grouping

Group by request.id for OneAgent-only request metrics

Extrapolation

Always apply multiplicity for accurate operation counts

Exemplars

Use takeAny(record(start_time, trace.id)) to enable UI drilldown

Troubleshooting

Problem

Cause

Solution

Duration values seem wrong (too large)

duration is in nanoseconds, not milliseconds

Divide by 1000000 or compare with 5s (DQL duration literal)

Span counts don't match expected request volume

Sampling or aggregation not accounted for

Use multiplicity extrapolation — see Sampling and Extrapolation reference

getNodeName(dt.smartscape.service) returns null

Service not yet resolved or OneAgent not monitoring

Verify OneAgent monitors the service; entity resolution may have a short delay

request.is_root_span filter returns nothing

Querying OpenTelemetry-only traces without OneAgent

Use isNull(span.parent_id) as fallback for root span detection

trace.id filter returns no results

Trace ID not converted to UID format

Use filter trace.id == toUid("abc123...") for string-based trace IDs

Database span counts are too low

Database spans are aggregated (one span = N calls)

Always use aggregation.count extrapolation for database operation counts

Related Skills

dt-dql-essentials — Core DQL syntax for querying trace data

dt-app-dashboards — Embed trace queries in dashboards

dt-migration — Smartscape entity model and relationship navigation

References

Detailed documentation for specific topics:

Performance Analysis - Advanced timeseries, duration buckets, endpoint ranking

Failure Detection - Failure reasons, exception investigation, custom rules

Sampling and Extrapolation - Multiplicity calculation, database extrapolation

Request Attributes - Request attributes, captured attributes, request ID aggregation

Entity Lookups - Advanced node lookups, infrastructure correlation, hardware analysis

HTTP Span Analysis - Status codes, payload analysis, client IPs

Database Span Analysis - Extrapolated counts, slow queries, statement analysis

Messaging Span Analysis - Kafka, RabbitMQ, SQS throughput and latency

RPC Span Analysis - gRPC, SOAP, service dependencies

Serverless Span Analysis - Lambda, Azure Functions, cold start analysis

Logs Correlation - Joining logs and traces, correlation patterns

Network Analysis - IP addresses, DNS resolution, communication mapping

dt-obs-tracing