SKILL.md
$2c
3. Generate an avatar video
runcomfy run //
--input '{"prompt": "...", "audio_url": "https://...", "image_url": "https://..."}'
--output-dir ./out
CLI deep dive: [`runcomfy-cli`](https://www.skills.sh/agentspace-so/runcomfy-agent-skills/runcomfy-cli) skill.
## Install this skill
npx skills add agentspace-so/runcomfy-agent-skills --skill ai-avatar-video -g
## Pick the right model for the user's intent
Listed newest first. The agent classifies user intent — pre-recorded audio file or just a script? Photoreal portrait or stylized character? Single shot or cinematic composition? — and picks one route below.
**OmniHuman** — `bytedance/omnihuman/api` (default)
ByteDance audio-driven full-body avatar. Feed one portrait + one audio file, get back a video where the subject speaks / sings / gestures naturally. Listed on RunComfy's `/feature/lip-sync` as the curated default.
Pick for: UGC voiceover, virtual presenter, dubbed product demo, multi-language clips from same portrait.
Avoid for: no audio file available (need to generate speech from a script) — use **HappyHorse 1.0**.
**HappyHorse 1.0** — `happyhorse/happyhorse-1-0/text-to-video` (t2v) · `happyhorse/happyhorse-1-0/image-to-video` (i2v)
Arena #1 t2v / i2v with in-pass audio generated from prompt. No external audio file required — quote the spoken line inside the prompt.
Pick for: written script with no audio file, "write a script → get a video", concept clips, i2v talking-head from an existing portrait.
Avoid for: precise lip-sync to a specific MP3 — audio is regenerated each call, not locked.
**Seedance v2 Pro** — `bytedance/seedance-v2/pro`
ByteDance multi-modal flagship — up to 9 reference images, 3 reference videos, 3 reference audio tracks composed in one pass with cinematic motion / lens / lighting control.
Pick for: cinematic monologue with reference subject + reference audio + reference scene; ad creative.
Avoid for: simple "portrait + audio" jobs — overpowered, slower. Use **OmniHuman**.
**Wan 2-7 with `audio_url`** — `wan-ai/wan-2-7/text-to-video`
Open-weights with `audio_url` field — prompt describes the scene, audio file drives the mouth.
Pick for: full scene control (not just a portrait), specific voiceover MP3, open-weights pipeline.
Avoid for: simplest portrait-talks job — use **OmniHuman**.
**Wan 2-2 Animate** — `community/wan-2-2-animate/api`
Community-published variant on the Wan 2-2 base. Audio-driven full-body animation of stylized characters (illustration, anime, mascot).
Pick for: stylized / illustrated character + audio (not a photoreal portrait).
Avoid for: photoreal subjects — use **OmniHuman** or **Wan 2-7**.
## Route 1: OmniHuman — default audio-driven avatar
**Model**: `bytedance/omnihuman/api`
**Catalog**: [omnihuman](https://www.runcomfy.com/models/bytedance/omnihuman/api?utm_source=skills.sh&utm_medium=skill&utm_campaign=ai-avatar-video) · [/feature/lip-sync](https://www.runcomfy.com/models/feature/lip-sync?utm_source=skills.sh&utm_medium=skill&utm_campaign=ai-avatar-video)
ByteDance OmniHuman is the strongest single-shot path: feed it **one portrait image + one audio file**, get back a video where the subject speaks / sings / gestures naturally to the audio. No prompt required beyond the inputs.
### Invoke
runcomfy run bytedance/omnihuman/api \
--input '{
"image_url": "https://your-cdn.example/presenter.jpg",
"audio_url": "https://your-cdn.example/voiceover.mp3"
}' \
--output-dir ./out
### Tips
- **Portrait framing works best** — head-and-shoulders or upper body. Full-body still works but expects more "presenter" energy.
- **Audio quality drives output quality** — clean voiceover (no music bed) → cleaner mouth sync. If your audio is a mix, isolate the voice stem first.
- **No prompt field** — the model derives everything from image + audio. Don't fight that.
- See the full input schema on the [model page](https://www.runcomfy.com/models/bytedance/omnihuman/api?utm_source=skills.sh&utm_medium=skill&utm_campaign=ai-avatar-video).
## Route 2: Wan 2-7 with audio_url — open-weights lip-sync
**Model**: `wan-ai/wan-2-7/text-to-video`
**Catalog**: [wan-2-7](https://www.runcomfy.com/models/wan-ai/wan-2-7?utm_source=skills.sh&utm_medium=skill&utm_campaign=ai-avatar-video)
When you want full control over the scene (not just a portrait) and have a specific audio track. Wan 2-7 accepts an `audio_url` field — the model generates the scene from prompt and locks the subject's mouth to the audio.
### Invoke
runcomfy run wan-ai/wan-2-7/text-to-video \
--input '{
"prompt": "Studio portrait of a woman in her 30s, confident expression, soft window light, neutral gray background.",
"audio_url": "https://your-cdn.example/voiceover.mp3",
"duration": 8
}' \
--output-dir ./out
### Tips
- **The prompt describes the scene; the audio drives the mouth.** Don't put the spoken words in the prompt — the model isn't reading them, it's syncing to the waveform.
- **Match the audio's emotional tone** — "confident expression" / "warmly engaged" / "deadpan delivery" cues the face.
- **Camera language** — "static portrait", "slow push in" — works the same as a regular Wan 2-7 t2v call.
## Route 3: Wan 2-2 Animate — full-body character animation
**Model**: `community/wan-2-2-animate/api`
**Catalog**: [wan-2-2-animate](https://www.runcomfy.com/models/community/wan-2-2-animate/api?utm_source=skills.sh&utm_medium=skill&utm_campaign=ai-avatar-video) · [/feature/character-swap](https://www.runcomfy.com/models/feature/character-swap?utm_source=skills.sh&utm_medium=skill&utm_campaign=ai-avatar-video)
Pick this when the subject is a **stylized character** (illustration, anime, mascot) rather than a photoreal portrait, and you want full-body motion synchronized to audio. Community-published variant on the Wan 2-2 base.
### Invoke
runcomfy run community/wan-2-2-animate/api \
--input '{
"image_url": "https://your-cdn.example/character.png",
"audio_url": "https://your-cdn.example/voiceover.mp3"
}' \
--output-dir ./out
Schema details on the [model page](https://www.runcomfy.com/models/community/wan-2-2-animate/api?utm_source=skills.sh&utm_medium=skill&utm_campaign=ai-avatar-video).
## Route 4: HappyHorse 1.0 — in-pass audio (no external file)
**Model**: `happyhorse/happyhorse-1-0/text-to-video` (t2v) or `happyhorse/happyhorse-1-0/image-to-video` (i2v)
**Catalog**: [happyhorse-1-0](https://www.runcomfy.com/models/happyhorse/happyhorse-1-0/text-to-video?utm_source=skills.sh&utm_medium=skill&utm_campaign=ai-avatar-video)
Pick HappyHorse when the user **doesn't have an audio file** — they want a talking-head video from a written script and HappyHorse generates speech in-pass. The mouth sync is derived from the generated audio, not from an input file.
### Invoke
**t2v with spoken script:**
runcomfy run happyhorse/happyhorse-1-0/text-to-video \
--input '{
"prompt": "A woman in her 30s, confident expression, looks at the camera and says clearly: \"Welcome to our product demo. Today we are going to show you three things.\" Soft daylight, neutral background.",
"duration": 6,
"aspect_ratio": "9:16",
"resolution": "1080p"
}' \
--output-dir ./out
**i2v from an existing portrait:**
runcomfy run happyhorse/happyhorse-1-0/image-to-video \
--input '{
"image_url": "https://your-cdn.example/portrait.jpg",
"prompt": "She looks at the camera and says clearly: \"Hi, I am Aria.\" Audio: friendly tone, neutral accent.",
"duration": 5
}' \
--output-dir ./out
### Tips
- **Quote the spoken line exactly** with `says clearly: "…"`. Without the literal quote the model paraphrases or skips speech.
- **Describe audio tone separately** — `"Audio: friendly tone, neutral accent."` — outside the spoken line.
- **Keep scripts short.** 1-2 sentences per clip; chain clips for longer narratives.
## Route 5: Seedance v2 Pro — multi-modal cinematic
**Model**: `bytedance/seedance-v2/pro`
**Catalog**: [seedance-v2 Pro](https://www.runcomfy.com/models/bytedance/seedance-v2/pro?utm_source=skills.sh&utm_medium=skill&utm_campaign=ai-avatar-video)
Pick Seedance v2 Pro when the avatar work is part of a **cinematic shot** — reference your subject from an image, your audio from a reference track, and have Seedance compose them with full motion + lens control.
### Invoke
runcomfy run bytedance/seedance-v2/pro \
--input '{
"prompt": "Anamorphic close-up — the subject delivers a confident monologue to camera, golden hour light through window, shallow DoF.",
"reference_images": ["https://your-cdn.example/subject.jpg"],
"reference_audio": ["https://your-cdn.example/voiceover.mp3"],
"duration": 10,
"aspect_ratio": "21:9"
}' \
--output-dir ./out