SKILL.md
Install the belt CLI skill: npx skills add belt-sh/cli
AI Avatar & Talking Head Videos
Create AI avatars and talking head videos via inference.sh CLI.
Quick Start
Requires inference.sh CLI (belt). Install instructions
belt login
Recommended: P-Video-Avatar (fastest, cheapest, built-in TTS)
belt app run pruna/p-video-avatar --input '{
"image": "https://portrait.jpg",
"voice_script": "Hello, welcome to our product demo!",
"voice": "Zephyr (Female)"
}'
## Available Models
**Start with P-Video-Avatar** — it's 18x faster and 6x cheaper than alternatives, with built-in TTS, dynamic backgrounds, and 1080p support.
| Model | App ID | Best For | Built-in TTS |
|-------|--------|----------|-------------|
| **P-Video-Avatar** | `pruna/p-video-avatar` | **Best overall: speed, cost, quality, control** | **Yes (30 voices, 10 languages)** |
| OmniHuman 1.5 | `bytedance/omnihuman-1-5` | Multi-character, audio-driven | No |
| Fabric 1.0 | `falai/fabric-1-0` | Image talks with lipsync | Yes |
| PixVerse Lipsync | `falai/pixverse-lipsync` | Highly realistic lipsync | No |
### Cost & Speed Comparison
| Model | Speed (per sec of video) | Cost per second |
|-------|-------------------------|----------------|
| **P-Video-Avatar** | **~1.83s/s** | **$0.025** |
| OmniHuman 1.5 | ~28s/s (15x slower) | $0.16 (6.4x more) |
| Fabric 1.0 | ~34s/s (18x slower) | $0.14 (5.6x more) |
## Examples
### P-Video-Avatar (Recommended)
Generate avatar from portrait + text script with built-in TTS:
belt app run pruna/p-video-avatar --input '{
"image": "https://portrait.jpg",
"voice_script": "Welcome to our product walkthrough. Today I will show you three key features.",
"voice": "Puck (Male)",
"voice_language": "English (US)",
"resolution": "720p"
}'
With custom style control:
belt app run pruna/p-video-avatar --input '{
"image": "https://portrait.jpg",
"voice_script": "This is exciting news!",
"voice": "Aoede (Female)",
"voice_prompt": "Enthusiastic and energetic tone",
"video_prompt": "The person is presenting on stage with dramatic lighting",
"resolution": "1080p"
}'
With audio file instead of TTS:
belt app run pruna/p-video-avatar --input '{
"image": "https://portrait.jpg",
"audio": "https://speech.mp3"
}'
### Full Workflow: Generate Portrait + Avatar
Use Pruna P-Image to generate the portrait, then create the avatar:
1. Generate a portrait image
belt app run pruna/p-image --input '{
"prompt": "professional headshot portrait of a young woman, neutral background, looking at camera, studio lighting, photorealistic",
"aspect_ratio": "9:16"
}'
2. Create avatar video with built-in TTS
belt app run pruna/p-video-avatar --input '{
"image": "<image-url-from-step-1>",
"voice_script": "Hi there! Let me walk you through our latest features.",
"voice": "Zephyr (Female)"
}'
### OmniHuman 1.5 (Multi-Character)
belt app run bytedance/omnihuman-1-5 --input '{
"image_url": "https://portrait.jpg",
"audio_url": "https://speech.mp3"
}'
Supports specifying which character to drive in multi-person images.
### Fabric 1.0 (Image Talks)
belt app run falai/fabric-1-0 --input '{
"image_url": "https://face.jpg",
"audio_url": "https://audio.mp3"
}'
### PixVerse Lipsync
belt app run falai/pixverse-lipsync --input '{
"image_url": "https://portrait.jpg",
"audio_url": "https://speech.mp3"
}'
## Full Workflow: TTS + Avatar (Non-TTS Models)
For models without built-in TTS (OmniHuman, PixVerse), generate speech first:
1. Generate speech — Inworld TTS-2 for expressive character voices
belt app run inworld/text-to-speech-2 --input '{
"text": "[friendly] Welcome to our product demo! [excited] Let me show you three features that will change how you work.",
"voice_id": "Sarah",
"delivery_mode": "CREATIVE"
}' > speech.json
2. Create avatar video with the speech
belt app run bytedance/omnihuman-1-5 --input '{
"image_url": "https://presenter-photo.jpg",
"audio_url": "<audio-url-from-step-1>"
}'
**Tip**: For most use cases, P-Video-Avatar with built-in TTS is simpler — no separate audio step needed. Use this workflow only when you specifically need OmniHuman (multi-character) or PixVerse (realistic lipsync).
## Full Workflow: Dub Video in Another Language
1. Transcribe original video
belt app run infsh/fast-whisper-large-v3 --input '{"audio_url": "https://video.mp4"}' > transcript.json
2. Translate text (manually or with an LLM)
3. Generate speech in new language
belt app run infsh/kokoro-tts --input '{"text": "<translated-text>"}' > new_speech.json
4. Lipsync the original video with new audio
belt app run infsh/latentsync-1-6 --input '{
"video_url": "https://original-video.mp4",
"audio_url": "<new-audio-url>"
}'
## Avatar UGC Generation
Create UGC-style content with P-Video-Avatar — built-in TTS, no separate audio step needed:
1. Generate a relatable UGC-style portrait
belt app run pruna/p-image --input '{
"prompt": "casual selfie-style photo of a young woman in a cozy room, natural lighting, looking at camera, warm smile, authentic feel",
"aspect_ratio": "9:16"
}'
2. Create UGC avatar video with built-in TTS
belt app run pruna/p-video-avatar --input '{
"image": "<image-url-from-step-1>",
"voice_script": "Okay so I just tried this product and honestly? It is a game changer. I was not expecting to love it this much but here we are!",
"voice": "Zephyr (Female)",
"voice_prompt": "Excited, casual, authentic tone like talking to a friend",
"video_prompt": "The person is talking casually to camera in their room, natural gestures",
"resolution": "1080p"
}'
### Why P-Video-Avatar for UGC
- **All-in-one** — built-in TTS means no separate audio generation step
- **30 voices, 10 languages** — match your target audience
- **Voice + video prompts** — control tone, emotion, body language, and background independently
- **18x faster, 6x cheaper** — produce UGC at scale vs. Fabric/OmniHuman/HeyGen
- **1080p support** — platform-ready vertical video from a single portrait image
### Batch UGC: Same Product, Multiple Presenters
Generate 3 different presenters
for voice in "Zephyr (Female)" "Puck (Male)" "Aoede (Female)"; do
belt app run pruna/p-video-avatar --input "{
\"image\": \"https://portrait.jpg\",
\"voice_script\": \"This changed my morning routine completely. Five minutes and I am done.\",
\"voice\": \"$voice\",
\"voice_prompt\": \"Casual, authentic, like a real testimonial\",
\"video_prompt\": \"Person talking to camera in a bright kitchen\",
\"resolution\": \"1080p\"
}"
done
## Use Cases
- **UGC & Marketing**: Product demos, UGC-style ads with AI presenters
- **Education**: Course videos, explainers
- **Localization**: Dub content across 10 languages from one image
- **Social Media**: Consistent virtual influencer content
- **Corporate**: Training videos, announcements
- **Gaming**: Character avatars, NPC dialogue
## Tips
- Use high-quality portrait photos (front-facing, good lighting)
- Audio should be clear with minimal background noise
- P-Video-Avatar supports built-in TTS — no need for a separate speech generation step
- P-Video-Avatar output aspect ratio matches the input image
- Generate portraits with `pruna/p-image` using `9:16` aspect ratio for vertical videos
- OmniHuman 1.5 supports multiple people in one image
- LatentSync is best for syncing existing videos to new audio
## Related Skills
Dedicated P-Video-Avatar skill
npx skills add inference-sh/skills@p-video-avatar
Full platform skill (all 250+ apps)
npx skills add inference-sh/skills@infsh-cli
Text-to-speech (generate audio for non-TTS avatar models)
npx skills add inference-sh/skills@text-to-speech
Speech-to-text (transcribe for dubbing)
npx skills add inference-sh/skills@speech-to-text
Video generation
npx skills add inference-sh/skills@ai-video-generation
Image generation (create avatar images)
npx skills add inference-sh/skills@ai-image-generation