SKILL.md

Application Services Skill

Name: dt-obs-services
Author: dynatrace

Monitor application service performance, health, and runtime-specific metrics using DQL.

Core Capabilities

1. Service Performance (RED Metrics)

Monitor service Rate, Errors, Duration using metrics-based timeseries queries.

Key Metrics:

dt.service.request.response_time - Response time (microseconds)

dt.service.request.count - Request count

dt.service.request.failure_count - Failed request count

Common Use Cases:

Response time monitoring (avg, p50, p95, p99)

Error rate tracking and spike detection

Traffic analysis (throughput, peaks, growth)

Performance degradation detection

Multi-cluster comparison

Quick Example:

timeseries {

  p95 = percentile(dt.service.request.response_time, 95),

  total_requests = sum(dt.service.request.count),

  failures = sum(dt.service.request.failure_count)

}, by: {dt.service.name}

| fieldsAdd p95_ms = p95[] / 1000, error_rate_pct = (failures[] * 100.0) / total_requests[]

→ For detailed queries: See references/service-metrics.md

2. Advanced Service Analysis

Span-based queries for complex scenarios requiring flexible filtering and custom aggregations.

Use Cases:

SLA compliance tracking with custom thresholds

Service health scoring (multi-dimensional)

Operation/endpoint-level performance analysis

Custom error classification

Failure pattern detection with error details

Quick Example:

fetch spans, from: now() - 1h | filter request.is_root_span == true

| fieldsAdd meets_sla = if(request.is_failed == false AND duration < 3s, 1, else: 0)

| summarize total = count(), sla_compliant = sum(meets_sla), by: {dt.service.name}

| fieldsAdd sla_compliance_pct = (sla_compliant * 100.0) / total

→ For detailed queries: See references/service-metrics.md

3. Service Messaging Metrics

Monitor message-based service communication (queues, topics).

Key Metrics:

dt.service.messaging.publish.count - Messages sent to queues or topics

dt.service.messaging.receive.count - Messages received from queues or topics

dt.service.messaging.process.count - Messages successfully processed

dt.service.messaging.process.failure_count - Messages that failed processing

Use Cases:

Message throughput monitoring (publish/receive rates)

Message processing failure tracking

Queue/topic health analysis

Consumer lag detection (publish vs receive rate comparison)

Quick Example:

timeseries {

  published = sum(dt.service.messaging.publish.count),

  received = sum(dt.service.messaging.receive.count),

  processed = sum(dt.service.messaging.process.count),

  failed = sum(dt.service.messaging.process.failure_count)

}, by: {dt.service.name}

→ For detailed queries: See references/service-metrics.md

4. Service Mesh Monitoring

Monitor service mesh ingress performance and overhead.

Key Metrics:

dt.service.request.service_mesh.response_time - Mesh response time (microseconds)

dt.service.request.service_mesh.count - Mesh request count

dt.service.request.service_mesh.failure_count - Mesh failure count

Use Cases:

Mesh vs direct performance comparison

Mesh overhead calculation

Mesh failure analysis

gRPC traffic monitoring

Multi-cluster mesh performance

Quick Example:

timeseries {

  direct_p95 = percentile(dt.service.request.response_time, 95),

  mesh_p95 = percentile(dt.service.request.service_mesh.response_time, 95)

}, by: {dt.service.name}

| fieldsAdd mesh_overhead_ms = (mesh_p95[] - direct_p95[]) / 1000

→ For detailed queries: See references/service-metrics.md

5. Runtime-Specific Monitoring

Technology-specific runtime performance and resource usage metrics.

Java/JVM - references/java.md

Memory: heap, pools, metaspace

GC: impact, suspension, frequency, pause time

Threads: count monitoring, leak detection

Classes: loading, unloading, growth

Node.js - references/nodejs.md

Event loop: utilization, active handles

V8 heap: memory used, total

GC: collection time, suspension

Process: RSS memory

.NET CLR - references/dotnet.md

Memory: consumption by generation

GC: collection count, suspension time

Thread pool: threads, queued work

JIT: compilation time

Python - references/python.md

Threads: active thread count

Heap: allocated blocks

GC: collection by generation, pause time

Objects: collected, uncollectable

PHP - references/php.md

OPcache: hit ratio, memory, restarts

GC: effectiveness, duration

JIT: buffer usage

Interned strings: usage, buffer

Go - references/go.md

Goroutines: count, leak detection

GC: suspension, collection time

Memory: heap by state, committed

Scheduler: worker threads, queue size

CGo: call frequency

When to Use This Skill

✅ Use for:

Monitoring service performance (response time, errors, traffic)

Calculating SLA compliance

Analyzing service mesh performance

Monitoring messaging throughput and processing failures

Troubleshooting runtime-specific issues (GC, memory, threads)

Multi-cluster service comparison

Operation/endpoint-level analysis

❌ Don't use for:

Infrastructure metrics (use infrastructure skills)

Log analysis (use logs skills)

Distributed tracing workflows (use traces/spans skills)

Database performance (use database skills)

Product documentation or how-to configuration questions → use ask-dynatrace-docs

Agent Instructions

Act First, Refine Later

When a user asks for analysis — threshold checks, anomaly detection, performance

comparisons — proceed immediately with sensible defaults. Do not ask the user

for parameter values you can reasonably assume.

Why this matters: analysis tools (e.g., static-threshold-analyzer) require specific

inputs like threshold values and service scope. The user expects results, not a

parameter interview. Pick reasonable defaults, state them clearly in the response,

and let the user refine.

Default values when not specified:

Parameter

Default

Rationale

Response time threshold

1000 ms (= 1,000,000 µs in the metric's base unit)

Common SLA boundary

Service scope

All services

Show the most relevant violations

Timeframe

From the request, or last 30 min for threshold checks, 2h for general analysis

Matches typical operational windows

Example: threshold violation request

Use create-dql to build a timeseries query for avg(dt.service.request.response_time) grouped by dt.smartscape.service

Pass the query to static-threshold-analyzer with threshold = 1000000 (µs), alertCondition = ABOVE

Resolve entity IDs to names using get-entity-name

Present violations with service names, timestamps, values, and duration

Reading user phrasing: Phrases like "the fixed threshold", "a threshold", or "the limit"

name the type of analysis — static threshold check — not a specific number the user expects

you to already know. "Fixed" distinguishes a static cutoff from a dynamic or seasonal baseline.

When you see these phrases, apply the 1000 ms default from the table above and present

results — the user can then refine if the default doesn't match their intent.

Scope Boundary

This skill covers service performance metrics and runtime monitoring only. If the

user asks a product documentation or configuration question (e.g., "How do I add custom

sensors?", "How do I configure service detection?"), use ask-dynatrace-docs instead —

this skill does not contain configuration how-tos.

Understanding User Intent

Map user questions to capabilities:

User Request

Use Capability

Key Files

"service performance", "response time", "error rate"

Service Performance (RED)

service-metrics.md

"SLA tracking", "health scoring"

Advanced Service Analysis

service-metrics.md

"service mesh", "Istio", "Linkerd", "mesh overhead"

Service Mesh Monitoring

service-metrics.md

"messaging", "queue", "topic", "publish", "consumer"

Service Messaging Metrics

service-metrics.md

"JVM GC", "Java memory", "heap"

Runtime-Specific (Java)

java.md

"Node.js event loop", "V8 heap"

Runtime-Specific (Node.js)

nodejs.md

".NET CLR", "GC generation"

Runtime-Specific (.NET)

dotnet.md

"Python GC", "thread count"

Runtime-Specific (Python)

python.md

"OPcache", "PHP GC"

Runtime-Specific (PHP)

php.md

"goroutines", "Go GC", "scheduler"

Runtime-Specific (Go)

go.md

Query Construction Patterns

1. Metrics-based (timeseries)

Use for: Standard monitoring, dashboards, alerting

Pattern: timeseries <metric> = <aggregation>(<metric_name>), by: {dimensions}

Files: service-metrics.md, all runtime-specific files

2. Span-based (fetch spans)

Use for: Complex filtering, custom logic, detailed analysis

Pattern: fetch spans | filter request.is_root_span == true | fieldsAdd ... | summarize ...

Files: service-metrics.md (Advanced Service Analysis section)

3. Comparison queries

Use append for baseline comparison

Use shift: -15m for time-shifted baselines

Example: Performance degradation detection

Response Construction Guidelines

Always include:

Metric name(s) - Clear metric identifiers

Aggregation - How data is aggregated (avg, sum, percentile)

Grouping - Dimensions used (dt.service.name, k8s.workload.name, etc.)

Unit conversion - Convert microseconds to milliseconds where appropriate

Filtering - Relevant thresholds or conditions

When referencing runtime-specific content:

Check user's technology stack first

Provide only relevant runtime queries (don't overwhelm with all 6 runtimes)

Explain runtime-specific metrics (e.g., "OPcache hit ratio" measures PHP opcode cache efficiency)

Common Workflows

Workflow: Service Health Check

1. Check response time (RED metrics)

2. Check error rate (RED metrics)

3. Check traffic patterns (RED metrics)

4. If runtime-specific issues suspected → Load runtime-specific reference

Workflow: SLA Monitoring

1. Define SLA criteria (e.g., < 3s response time AND < 1% error rate)

2. Use span-based query for custom SLA logic

3. Calculate compliance percentage

4. Filter non-compliant services

Workflow: Service Mesh Analysis

1. Check mesh response time

2. Compare mesh vs direct performance

3. Calculate mesh overhead

4. Analyze mesh failure rates

Workflow: Runtime Troubleshooting

Identify technology stack → Load runtime-specific reference

Check memory/GC metrics → threads/goroutines → runtime features

Troubleshooting

Problem

Cause

Solution

Response time values look too large

Metric is in microseconds

Divide by 1000 to convert to milliseconds

No data for service mesh metrics

Service mesh not configured

Verify mesh sidecar injection is enabled

Runtime metrics missing

Wrong technology or no OneAgent

Confirm the runtime is supported and OneAgent is active

dt.smartscape.service returns SmartscapeId, not name

Need entity name resolution

Use getNodeName(dt.smartscape.service)

Error rate always zero

Using wrong failure metric

Use dt.service.request.failure_count, not custom fields

References

Core Service Monitoring:

references/service-metrics.md - Complete RED metrics, SLA tracking, service mesh queries

Runtime-Specific Monitoring:

references/java.md - Java/JVM monitoring

references/nodejs.md - Node.js monitoring

references/dotnet.md - .NET CLR monitoring

references/python.md - Python monitoring

references/php.md - PHP monitoring

references/go.md - Go runtime monitoring

dt-obs-services

SKILL.md

Application Services Skill

Core Capabilities

1. Service Performance (RED Metrics)

2. Advanced Service Analysis

3. Service Messaging Metrics

4. Service Mesh Monitoring

5. Runtime-Specific Monitoring

When to Use This Skill

Agent Instructions

Act First, Refine Later

Scope Boundary

Understanding User Intent

Query Construction Patterns

Response Construction Guidelines

Common Workflows

Workflow: Service Health Check

Workflow: SLA Monitoring

Workflow: Service Mesh Analysis

Workflow: Runtime Troubleshooting

Troubleshooting

References

Let your agent run on any real-world website

Related skills

Stop writing automation&scrapers