SKILL.md
$27
If ddtrace found → remove that line entirely from the dependency file, rebuild image, reload into cluster, restart pod.
Step 1: Install the Datadog Operator (NOT datadog/datadog )
helm repo add datadog https://helm.datadoghq.com && helm repo update
helm upgrade --install datadog-operator datadog/datadog-operator \
--namespace datadog --create-namespace
kubectl wait --for=condition=Ready pod \
-l app.kubernetes.io/name=datadog-operator -n datadog --timeout=120s
kubectl create secret generic datadog-secret \
--from-literal api-key=$DD_API_KEY -n datadog
Step 2: Create DatadogAgent CR with SSI enabled
Save as datadog-agent.yaml, then kubectl apply -f datadog-agent.yaml:
apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
name: datadog
namespace: datadog
spec:
global:
site: <DD_SITE>
credentials:
apiSecret:
secretName: datadog-secret
keyName: api-key
kubelet:
tlsVerify: false # required for kind/minikube; omit for cloud clusters
features:
apm:
instrumentation:
enabled: true
Step 3: Apply Unified Service Tags to the application Deployment
Add to both metadata.labels and spec.template.metadata.labels:
tags.datadoghq.com/env: "dev"
tags.datadoghq.com/service: "<service-name>"
tags.datadoghq.com/version: "1.0.0"
Step 4: Restart app pods and verify SSI init containers
Confirm with the user before restarting. Tell the user: "I need to restart <name> in <namespace> for SSI to inject into the pods. This will cause a brief outage. Ready to proceed?" Wait for confirmation.
kubectl rollout restart deployment/<name> -n <namespace>
kubectl get pods -A -o json | grep -o '"datadog-lib[^"]*"' | sort -u
Expected: "datadog-lib-python-init" (or the language-appropriate init container).
Immediately read .claude/skills/dd-apm/k8s-ssi/agent-install/SKILL.md now, then .claude/skills/dd-apm/k8s-ssi/enable-ssi/SKILL.md — do not proceed from memory or the summary above.
For Linux APM install/instrument tasks: Immediately read .claude/skills/dd-apm/linux-ssi/agent-install/SKILL.md now, then enable-ssi then verify-ssi — do not proceed from memory or the summary above.
For trace search, service analysis, metrics: Continue below.
Requirements
Datadog Labs Pup should be installed. See Setup Pup if not.
Command Execution Order (Token-Efficient)
For scoped commands, use this order:
- Check context first (prior outputs, conversation, saved values).
- If a required value is missing, run a discovery command first.
- If still ambiguous, ask the user to confirm.
- Then run the target command.
- Avoid speculative commands likely to fail.
Quick Start
pup auth login
# Confirm env tag with the user first (do not assume production/prod/prd).
pup apm services list --env <env> --from 1h --to now
pup traces search --query "service:api-gateway" --from 1h
Services
List Services
pup apm services list --env <env> --from 1h --to now
pup apm services stats --env <env> --from 1h --to now
Service Stats
pup apm services stats --env <env> --from 1h --to now
Service Map
# View dependencies
pup apm flow-map --query "service:api-gateway&from=$(($(date +%s)-3600))000&to=$(date +%s)000" --env <env> --limit 10
Traces
Search Traces
# By service
pup traces search --query "service:api-gateway" --from 1h
# Errors only
pup traces search --query "service:api-gateway status:error" --from 1h
# Slow traces (>1s)
pup traces search --query "service:api-gateway @duration:>1000ms" --from 1h
# With specific tag
pup traces search --query "service:api-gateway @http.url:/api/users" --from 1h
Trace Detail
# No direct get command for a single trace ID.
# Use traces search with a narrow query and time window.
pup traces search --query "trace_id:<trace_id>" --from 1h
Key Metrics
Metric
What It Measures
trace.http.request.hits
Request count
trace.http.request.duration
Latency
trace.http.request.errors
Error count
trace.http.request.apdex
User satisfaction
Service Level Objectives
Link APM to SLOs:
pup slos create --file slo.json
Common Queries
Goal
Query
Slowest endpoints
avg:trace.http.request.duration{*} by {resource_name}
Error rate
sum:trace.http.request.errors{*} / sum:trace.http.request.hits{*}
Throughput
sum:trace.http.request.hits{*}.as_rate()
Troubleshooting
Problem
Fix
No traces
Check ddtrace installed, DD_TRACE_ENABLED=true
Missing service
Verify DD_SERVICE env var
Traces not linked
Check trace headers propagated
High cardinality
Don't tag with user_id/request_id