SKILL.md
Infrastructure Kubernetes
Monitor and analyze Kubernetes infrastructure using Dynatrace DQL. Query
cluster resources, monitor workload health, analyze pod placement, optimize
costs, and assess security posture.
When to Use This Skill
- Monitoring Kubernetes cluster health and capacity
- Analyzing pod and container resource utilization
- Investigating pod failures, OOMKills, evictions, or crash loops
- Debugging degraded deployments, stuck rollouts, or node pressure
- Optimizing Kubernetes resource costs
- Assessing security posture and compliance
- Troubleshooting workload scheduling and placement
- Auditing ingress routing and network policies
Knowledge Base Structure
Core Monitoring (Start Here)
- Cluster Inventory →
references/cluster-inventory.md- Clusters,
namespaces, resource distribution
- Node Monitoring - Node capacity, CPU/memory usage, pod density
- Pod Monitoring - Pod CPU, memory, lifecycle events
- Workload Monitoring - Deployment, StatefulSet, DaemonSet resources
Advanced Topics
- Configuration Analysis →
references/labels-annotations.md- Parse
k8s.object, labels, annotations
- Scheduling & Placement →
references/pod-node-placement.md- Node
selectors, affinity, taints, HA
- Cost Optimization - Right-sizing, waste detection, efficiency scoring
- Security & Compliance - Privileged containers, security contexts
Key Concepts
Entity Types
Workloads: K8S_DEPLOYMENT, K8S_STATEFULSET, K8S_DAEMONSET,
K8S_JOB, K8S_CRONJOB, K8S_HORIZONTALPODAUTOSCALER
Infrastructure: K8S_CLUSTER, K8S_NAMESPACE, K8S_NODE, K8S_POD
Configuration: K8S_SERVICE, K8S_CONFIGMAP, K8S_SECRET,
K8S_PERSISTENTVOLUMECLAIM, K8S_PERSISTENTVOLUME, K8S_INGRESS,
K8S_NETWORKPOLICY
Query Types
smartscapeNodes - Query K8s entities:
smartscapeNodes K8S_POD
| filter k8s.namespace.name == "production"
| fields k8s.cluster.name, k8s.pod.name
timeseries - Monitor metrics over time:
timeseries cpu = sum(dt.kubernetes.container.cpu_usage),
by: {k8s.pod.name, k8s.namespace.name}
| fieldsAdd avg_cpu = arrayAvg(cpu)
fetch logs - Analyze log events:
fetch logs
| filter k8s.namespace.name == "production" and loglevel == "ERROR"
Core Fields
k8s.cluster.name,k8s.namespace.name,k8s.pod.name,k8s.node.name
k8s.workload.name,k8s.workload.kind,k8s.container.name
k8s.object- Full JSON configuration for deep inspection
tags[label]- Access labels and annotations
Available Metrics
CPU: dt.kubernetes.container.cpu_usage, cpu_throttled, limits_cpu,
requests_cpu
Memory: dt.kubernetes.container.memory_working_set, limits_memory,
requests_memory
Operations: dt.kubernetes.container.restarts, oom_kills
Node: dt.kubernetes.node.pods_allocatable, cpu_allocatable,
memory_allocatable, dt.kubernetes.pods
Entity Disambiguation
K8S_POD vs CONTAINER: these are different entity types in Dynatrace.
- **
K8S_POD** — K8s-native entities withk8s.objectJSON, scheduling state, conditions, and K8s metrics. Use this skill.
- **
CONTAINER** — Host-level container inventory (image, lifetime, host assignment). Usedt-obs-hostsskill instead.
The smartscape edge is CONTAINER --(is_part_of)--> K8S_POD. To reach containers from a pod, traverse backward:
smartscapeNodes K8S_POD
| filter k8s.namespace.name == "<namespace>"
| traverse edgeTypes: {is_part_of}, targetTypes: {CONTAINER}, direction: backward, fieldsKeep: {id}
| fields k8s.cluster.name, k8s.namespace.name, k8s.pod.name, container.id=id
Service → K8S_POD Correlation
No direct smartscape edge exists between SERVICE and K8S_POD. The correlation key is the shared dimension k8s.workload.name. See Service → Pod Drill-Down in references/pod-debugging.md for the full two-step pattern.
Common Workflows
1. Cluster Health Check
List all clusters:
smartscapeNodes K8S_CLUSTER
| fields k8s.cluster.name, k8s.cluster.version, k8s.cluster.distribution
Check node capacity:
timeseries {
current_pods = avg(dt.kubernetes.pods),
max_pods = avg(dt.kubernetes.node.pods_allocatable)
}, by: {k8s.node.name, k8s.cluster.name}
| fieldsAdd pod_capacity_pct = (arrayAvg(current_pods) / arrayAvg(max_pods)) * 100
| filter pod_capacity_pct > 80
Identify pods in non-Running state:
smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| fieldsAdd phase = config[status][phase]
| filter phase != "Running"
| fields k8s.cluster.name, k8s.namespace.name, k8s.pod.name, phase
2. Resource Optimization
Find over-provisioned pods (usage < 30%):
timeseries {
cpu_usage = sum(dt.kubernetes.container.cpu_usage),
cpu_requests = avg(dt.kubernetes.container.requests_cpu)
}, by: {k8s.pod.name, k8s.namespace.name, k8s.cluster.name}
| fieldsAdd usage_pct = (arrayAvg(cpu_usage) / arrayAvg(cpu_requests)) * 100
| filter usage_pct < 30 and arrayAvg(cpu_requests) > 0
Identify containers without limits:
smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| expand container = config[spec][containers]
| fieldsAdd
container_name = container[name],
cpu_limit = container[resources][limits][cpu],
memory_limit = container[resources][limits][memory]
| filter isNull(cpu_limit) or isNull(memory_limit)
3. Troubleshooting Pod Issues
Pod troubleshooting benefits from combining metrics (timeseries) with
Kubernetes events (event stream) for a complete picture.
#### Metrics-Based Troubleshooting
Find pods with OOMKills:
timeseries oom_kills = sum(dt.kubernetes.container.oom_kills),
by: {k8s.pod.name, k8s.namespace.name, k8s.cluster.name}
| filter arraySum(oom_kills) > 0
| fieldsAdd total_oom_kills = arraySum(oom_kills)
| sort total_oom_kills desc
Analyze pod restart patterns:
timeseries restarts = sum(dt.kubernetes.container.restarts),
by: {k8s.pod.name, k8s.namespace.name, k8s.cluster.name}
| fieldsAdd total_restarts = arraySum(restarts)
| filter total_restarts > 5
#### Event-Based Troubleshooting
For operational events (pod restarts, OOM kills, evictions, scheduling failures),
Kubernetes events provide richer context than metrics alone — including event
reasons, messages, and timestamps.
When to use Kubernetes events over metrics:
- User asks about recent operational events ("show me pod restart events")
- User wants event details like reasons and messages
- User asks about events in a specific time window ("last 48 hours")
- User wants to correlate events with root causes
Kubernetes events are available through the get-events-for-kubernetes-cluster
tool. Prefer this tool when the user asks about OOM events, pod restarts,
evictions, or cluster-wide event history.
Important: distinguish event types when filtering results. Kubernetes events
cover many categories. When the user asks about a specific event type, filter
the results accordingly — do not report unrelated events:
User Asks About
Relevant Event Reasons
NOT Related
Pod restarts
BackOff, CrashLoopBackOff, Killing
Readiness probe failures, CPU throttling
OOM events
OOMKilling, OOMKilled
Memory pressure warnings
Evictions
Evicted, Preempting
Node pressure
Scheduling failures
FailedScheduling, Unschedulable
Resource quotas
For a complete answer, combine both approaches:
- Use the events tool to get the event details (what happened, when, why)
- Use timeseries metrics to show the quantitative impact (how many restarts,
OOM kill counts over time)
#### Fetch Kubernetes Events via DQL
Pod restart and operational events can also be queried via DQL from the events
table:
fetch events
| filter event.kind == "K8S_EVENT"
| filter event.type == "Warning"
| fields timestamp, k8s.cluster.name, k8s.namespace.name, k8s.pod.name,
event.reason, event.message
| sort timestamp desc
| limit 50
Filter for specific event reasons:
fetch events
| filter event.kind == "K8S_EVENT"
| filter in(event.reason, {"OOMKilling", "BackOff", "Evicted", "FailedScheduling"})
| fields timestamp, k8s.cluster.name, k8s.namespace.name, k8s.pod.name,
event.reason, event.message
| sort timestamp desc
**Field names in fetch events:** Use event.reason and event.message — not
dt.kubernetes.event.reason. The dt.kubernetes.* prefix is for timeseries metrics,
not the events table. Queries using the wrong prefix return zero results.
4. Security Assessment
Identify privileged containers:
smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| expand container = config[spec][containers]
| fieldsAdd
container_name = container[name],
privileged = container[securityContext][privileged]
| filter privileged == true
Find containers running as root:
smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| expand container = config[spec][containers]
| fieldsAdd
container_name = container[name],
run_as_user = container[securityContext][runAsUser],
run_as_non_root = container[securityContext][runAsNonRoot]
| filter (isNull(run_as_user) or run_as_user == 0) and run_as_non_root != true
5. Scheduling Analysis
Verify pod distribution (HA compliance):
smartscapeNodes K8S_POD
| filter k8s.workload.kind == "deployment"
| summarize pod_count = count(),
node_count = countDistinct(k8s.node.name),
by: {k8s.cluster.name, k8s.namespace.name, k8s.workload.name}
| fieldsAdd ha_compliant = node_count > 1
| filter pod_count >= 2 and not ha_compliant
6. DAVIS Problems affecting K8s Entities
Find active DAVIS problems affecting K8s entities:
fetch dt.davis.problems, from:now() - 2h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter matchesPhrase(smartscape.affected_entity.types, "K8S_")
| fields display_id, event.name, event.category, smartscape.affected_entity.ids
Use entries smartscape.affected_entity.ids (array of Smartscape IDs) to look up the affected entity using its Smartscape ID.
Best Practices
Choosing the Right Data Source
User Question
Best Approach
Why
"Show me OOM events"
Events tool + metrics
Events give reasons/messages; metrics show trends
"Show me pod restart events"
Events tool + timeseries metrics
Events reveal the reason (BackOff, Killing, CrashLoopBackOff); dt.kubernetes.container.restarts metric gives the actual restart counts
"How many pod restarts?"
Timeseries metrics
Quantitative data over time
"What happened to my pods in the last 48h?"
Events tool
Operational event history with context
"Which pods are using the most CPU?"
Timeseries metrics
Resource utilization analysis
"List all clusters/namespaces"
smartscapeNodes
Entity discovery and inventory
"Are there scheduling failures?"
Events tool
Event reasons explain why
Query Performance
- Filter early - Apply cluster/namespace filters immediately
- Use specific entity types - Avoid wildcards
- Limit result sets - Use
limitfor exploration
- Cache cluster lists - Store in variables
Monitoring Recommendations
- Set resource limits on all containers
- Monitor OOMKills and adjust memory limits
- Track CPU throttling and adjust CPU limits
- Review resource efficiency regularly (target 70-80%)
- Implement security best practices (non-root, read-only filesystem)
- Use specific image tags (avoid :latest)
Configuration Standards
- Use labels for organization (app, environment, team)
- Set resource requests and limits
- Configure health checks (liveness/readiness probes)
- Use TLS for all ingress resources
- Document with annotations
Troubleshooting
Problem
Cause
Solution
No pod data returned
Wrong entity type or missing cluster filter
Use K8S_POD (not POD); add k8s.cluster.name filter
k8s.object parsing errors
Complex JSON structure
Use parse k8s.object, "JSON:config" then access nested fields
Pod network metrics unavailable
Not available in Grail
Use service mesh metrics or host-level network metrics
Large result sets
No time range or cluster filter
Add time range and filter by cluster/namespace early
Missing labels in output
Labels accessed incorrectly
Use tags[label_name] to access labels
Limitations
Unavailable Metrics:
- Pod network metrics (rx_bytes, tx_bytes) are NOT available in Grail
- Workaround: Use service mesh metrics or host-level network metrics
Query Considerations:
- Minimize result set size: Do not include the
k8s.objectfield if not necessary
- Keep result set as simple as possible: Parsing k8s.object increases query complexity
- Large clusters may require pagination or time-range limits
- Some K8s status fields update asynchronously
When to Load References
Load cluster-inventory.md when:
- Performing cluster, namespace, or resource distribution analysis
- Auditing workload counts across clusters
→ references/cluster-inventory.md
Load labels-annotations.md when:
- Filtering by labels or annotations
- Parsing
k8s.objectfor detailed configuration inspection
→ references/labels-annotations.md
Load pod-node-placement.md when:
- Analyzing scheduling constraints (affinity, taints, tolerations)
- Verifying HA compliance and pod distribution
→ references/pod-node-placement.md
Load pod-debugging.md when:
- Investigating pod exit codes, crash loops, or init container failures
- Diagnosing image pull errors or service-to-pod connectivity issues
- Drilling down from a service problem to pod-level details
Load workload-health.md when:
- Investigating degraded deployments or stuck rollouts
- Checking node conditions, CPU throttling, or HPA scaling
- Analyzing StatefulSet ordering or DaemonSet coverage
→ references/workload-health.md
Load pv-pvc.md when:
- Working with persistent storage (PVC/PV lifecycle, orphaned volumes)
- Checking StorageClass configurations
Load ingress.md when:
- Analyzing ingress routing rules or TLS certificates
- Auditing ingress controller configurations
Load network-policies.md when:
- Listing or auditing network policies
- Checking namespace isolation configurations
→ references/network-policies.md
References
- cluster-inventory.md — Cluster, namespace, and resource distribution analysis
- labels-annotations.md — Label/annotation filtering and k8s.object parsing
- pod-node-placement.md — Scheduling, affinity, taints, and HA patterns
- pod-debugging.md — Exit codes, pod conditions, init containers, image pull errors, logs, service-to-pod drill-down
- workload-health.md — Degraded deployments, stuck rollouts, node conditions, CPU throttling, HPA, StatefulSet ordering
- pv-pvc.md — PVC/PV lifecycle, phase reference, orphaned volumes, StorageClass
- ingress.md — Routing rule parsing, TLS audit
- network-policies.md — Policy listing, namespace isolation audit
Related Skills
- dt-obs-problems — For problems associated with Kubernetes clusters (use
dt.smartscape_source.idwith K8S_ prefix filters)
- dt-dql-essentials — Core DQL syntax and query structure
- dt-obs-hosts — Host-level metrics for K8s nodes