SKILL.md
$27
When to Use This Skill
Use this skill when the user wants to:
- Set up AI Runway on an existing AKS cluster from scratch
- Install the AI Runway controller and CRDs
- Assess GPU hardware compatibility for model deployment
- Choose and install an inference provider (KAITO, Dynamo, KubeRay)
- Deploy their first AI model to AKS via AI Runway
- Resume a partially-complete AI Runway setup from a specific step
MCP Tools
This skill uses no MCP tools. All cluster operations are performed directly via kubectl and make.
Rules
- Execute steps in sequence — load the reference for each step as you reach it
- Report cluster state at each step: ✓ healthy, ✗ missing/failed
- Ask for user confirmation before any install or deployment action
- If a step is already complete, report status and skip to the next step
- If the user provides
skip-to-step N, start at step N; assume prior steps are complete
Steps
#
Step
Reference
1
Cluster Verification — context check, node inventory, GPU detection
2
Controller Installation — CRD + controller deployment
3
GPU Assessment — detect GPU models, flag dtype/attention constraints
4
Provider Setup — recommend and install inference provider
5
First Deployment — pick a model, deploy, verify Ready
6
Summary — recap, smoke test, next steps
Error Handling
Error / Symptom
Likely Cause
Remediation
No kubeconfig context
Not connected to a cluster
Run az aks get-credentials or equivalent
Controller in CrashLoopBackOff
Config or RBAC issue
kubectl logs -n airunway-system -l control-plane=controller-manager --previous
Provider not ready
Image pull or RBAC issue
kubectl logs <pod-name> -n <namespace> for the provider pod
ModelDeployment stuck in Pending
GPU scheduling failure or provider not ready
kubectl describe modeldeployment <name> -n <namespace> events
bfloat16 errors at inference
T4 or V100 lacks bfloat16 support
Add --dtype float16 to serving args
For full error handling and rollback procedures, see troubleshooting.md.