SKILL.md

$2c

3. Generate an avatar video

Name: ai-avatar-video
Author: agentspace-so

runcomfy run //

--input '{"prompt": "...", "audio_url": "https://...", "image_url": "https://..."}'

--output-dir ./out

CLI deep dive: [`runcomfy-cli`](https://www.skills.sh/agentspace-so/runcomfy-agent-skills/runcomfy-cli) skill.

## Install this skill

npx skills add agentspace-so/runcomfy-agent-skills --skill ai-avatar-video -g


## Pick the right model for the user's intent

Listed newest first. The agent classifies user intent — pre-recorded audio file or just a script? Photoreal portrait or stylized character? Single shot or cinematic composition? — and picks one route below.

**OmniHuman** — `bytedance/omnihuman/api` (default)

ByteDance audio-driven full-body avatar. Feed one portrait + one audio file, get back a video where the subject speaks / sings / gestures naturally. Listed on RunComfy's `/feature/lip-sync` as the curated default.
Pick for: UGC voiceover, virtual presenter, dubbed product demo, multi-language clips from same portrait.
Avoid for: no audio file available (need to generate speech from a script) — use **HappyHorse 1.0**.

**HappyHorse 1.0** — `happyhorse/happyhorse-1-0/text-to-video` (t2v) · `happyhorse/happyhorse-1-0/image-to-video` (i2v)

Arena #1 t2v / i2v with in-pass audio generated from prompt. No external audio file required — quote the spoken line inside the prompt.
Pick for: written script with no audio file, "write a script → get a video", concept clips, i2v talking-head from an existing portrait.
Avoid for: precise lip-sync to a specific MP3 — audio is regenerated each call, not locked.

**Seedance v2 Pro** — `bytedance/seedance-v2/pro`

ByteDance multi-modal flagship — up to 9 reference images, 3 reference videos, 3 reference audio tracks composed in one pass with cinematic motion / lens / lighting control.
Pick for: cinematic monologue with reference subject + reference audio + reference scene; ad creative.
Avoid for: simple "portrait + audio" jobs — overpowered, slower. Use **OmniHuman**.

**Wan 2-7 with `audio_url`** — `wan-ai/wan-2-7/text-to-video`

Open-weights with `audio_url` field — prompt describes the scene, audio file drives the mouth.
Pick for: full scene control (not just a portrait), specific voiceover MP3, open-weights pipeline.
Avoid for: simplest portrait-talks job — use **OmniHuman**.

**Wan 2-2 Animate** — `community/wan-2-2-animate/api`

Community-published variant on the Wan 2-2 base. Audio-driven full-body animation of stylized characters (illustration, anime, mascot).
Pick for: stylized / illustrated character + audio (not a photoreal portrait).
Avoid for: photoreal subjects — use **OmniHuman** or **Wan 2-7**.

## Route 1: OmniHuman — default audio-driven avatar

**Model**: `bytedance/omnihuman/api`
**Catalog**: [omnihuman](https://www.runcomfy.com/models/bytedance/omnihuman/api?utm_source=skills.sh&#x26;utm_medium=skill&#x26;utm_campaign=ai-avatar-video) · [/feature/lip-sync](https://www.runcomfy.com/models/feature/lip-sync?utm_source=skills.sh&#x26;utm_medium=skill&#x26;utm_campaign=ai-avatar-video)

ByteDance OmniHuman is the strongest single-shot path: feed it **one portrait image + one audio file**, get back a video where the subject speaks / sings / gestures naturally to the audio. No prompt required beyond the inputs.

### Invoke

runcomfy run bytedance/omnihuman/api \

--input '{

"image_url": "https://your-cdn.example/presenter.jpg",

"audio_url": "https://your-cdn.example/voiceover.mp3"

}' \

--output-dir ./out


### Tips

- **Portrait framing works best** — head-and-shoulders or upper body. Full-body still works but expects more "presenter" energy.

- **Audio quality drives output quality** — clean voiceover (no music bed) → cleaner mouth sync. If your audio is a mix, isolate the voice stem first.

- **No prompt field** — the model derives everything from image + audio. Don't fight that.

- See the full input schema on the [model page](https://www.runcomfy.com/models/bytedance/omnihuman/api?utm_source=skills.sh&#x26;utm_medium=skill&#x26;utm_campaign=ai-avatar-video).

## Route 2: Wan 2-7 with audio_url — open-weights lip-sync

**Model**: `wan-ai/wan-2-7/text-to-video`
**Catalog**: [wan-2-7](https://www.runcomfy.com/models/wan-ai/wan-2-7?utm_source=skills.sh&#x26;utm_medium=skill&#x26;utm_campaign=ai-avatar-video)

When you want full control over the scene (not just a portrait) and have a specific audio track. Wan 2-7 accepts an `audio_url` field — the model generates the scene from prompt and locks the subject's mouth to the audio.

### Invoke

runcomfy run wan-ai/wan-2-7/text-to-video \

--input '{

"prompt": "Studio portrait of a woman in her 30s, confident expression, soft window light, neutral gray background.",

"audio_url": "https://your-cdn.example/voiceover.mp3",

"duration": 8

}' \

--output-dir ./out


### Tips

- **The prompt describes the scene; the audio drives the mouth.** Don't put the spoken words in the prompt — the model isn't reading them, it's syncing to the waveform.

- **Match the audio's emotional tone** — "confident expression" / "warmly engaged" / "deadpan delivery" cues the face.

- **Camera language** — "static portrait", "slow push in" — works the same as a regular Wan 2-7 t2v call.

## Route 3: Wan 2-2 Animate — full-body character animation

**Model**: `community/wan-2-2-animate/api`
**Catalog**: [wan-2-2-animate](https://www.runcomfy.com/models/community/wan-2-2-animate/api?utm_source=skills.sh&#x26;utm_medium=skill&#x26;utm_campaign=ai-avatar-video) · [/feature/character-swap](https://www.runcomfy.com/models/feature/character-swap?utm_source=skills.sh&#x26;utm_medium=skill&#x26;utm_campaign=ai-avatar-video)

Pick this when the subject is a **stylized character** (illustration, anime, mascot) rather than a photoreal portrait, and you want full-body motion synchronized to audio. Community-published variant on the Wan 2-2 base.

### Invoke

runcomfy run community/wan-2-2-animate/api \

--input '{

"image_url": "https://your-cdn.example/character.png",

"audio_url": "https://your-cdn.example/voiceover.mp3"

}' \

--output-dir ./out


Schema details on the [model page](https://www.runcomfy.com/models/community/wan-2-2-animate/api?utm_source=skills.sh&#x26;utm_medium=skill&#x26;utm_campaign=ai-avatar-video).

## Route 4: HappyHorse 1.0 — in-pass audio (no external file)

**Model**: `happyhorse/happyhorse-1-0/text-to-video` (t2v) or `happyhorse/happyhorse-1-0/image-to-video` (i2v)
**Catalog**: [happyhorse-1-0](https://www.runcomfy.com/models/happyhorse/happyhorse-1-0/text-to-video?utm_source=skills.sh&#x26;utm_medium=skill&#x26;utm_campaign=ai-avatar-video)

Pick HappyHorse when the user **doesn't have an audio file** — they want a talking-head video from a written script and HappyHorse generates speech in-pass. The mouth sync is derived from the generated audio, not from an input file.

### Invoke

**t2v with spoken script:**

runcomfy run happyhorse/happyhorse-1-0/text-to-video \

--input '{

"prompt": "A woman in her 30s, confident expression, looks at the camera and says clearly: \"Welcome to our product demo. Today we are going to show you three things.\" Soft daylight, neutral background.",

"duration": 6,

"aspect_ratio": "9:16",

"resolution": "1080p"

}' \

--output-dir ./out


**i2v from an existing portrait:**

runcomfy run happyhorse/happyhorse-1-0/image-to-video \

--input '{

"image_url": "https://your-cdn.example/portrait.jpg",

"prompt": "She looks at the camera and says clearly: \"Hi, I am Aria.\" Audio: friendly tone, neutral accent.",

"duration": 5

}' \

--output-dir ./out


### Tips

- **Quote the spoken line exactly** with `says clearly: "…"`. Without the literal quote the model paraphrases or skips speech.

- **Describe audio tone separately** — `"Audio: friendly tone, neutral accent."` — outside the spoken line.

- **Keep scripts short.** 1-2 sentences per clip; chain clips for longer narratives.

## Route 5: Seedance v2 Pro — multi-modal cinematic

**Model**: `bytedance/seedance-v2/pro`
**Catalog**: [seedance-v2 Pro](https://www.runcomfy.com/models/bytedance/seedance-v2/pro?utm_source=skills.sh&#x26;utm_medium=skill&#x26;utm_campaign=ai-avatar-video)

Pick Seedance v2 Pro when the avatar work is part of a **cinematic shot** — reference your subject from an image, your audio from a reference track, and have Seedance compose them with full motion + lens control.

### Invoke

runcomfy run bytedance/seedance-v2/pro \

--input '{

"prompt": "Anamorphic close-up — the subject delivers a confident monologue to camera, golden hour light through window, shallow DoF.",

"reference_images": ["https://your-cdn.example/subject.jpg"],

"reference_audio": ["https://your-cdn.example/voiceover.mp3"],

"duration": 10,

"aspect_ratio": "21:9"

}' \

--output-dir ./out

ai-avatar-video

SKILL.md

3. Generate an avatar video

Let your agent run on any real-world website

Related skills

Stop writing automation&scrapers