SKILL.md
$2b
Available Models
Model
App ID
Best For
Inworld TTS-2
inworld/text-to-speech-2
100+ languages, emotion steering with [brackets], delivery modes
Inworld TTS 1.5 Max
inworld/text-to-speech-1-5-max
Low latency (<200ms), 15 languages
Inworld TTS 1.5 Mini
inworld/text-to-speech-1-5-mini
Ultra-low latency (~120ms), 15 languages
ElevenLabs TTS
elevenlabs/tts
Premium quality, 22+ voices, 32 languages
DIA TTS
infsh/dia-tts
Conversational, expressive
Kokoro TTS
infsh/kokoro-tts
Fast, natural
Chatterbox
infsh/chatterbox
General purpose
Higgs Audio
infsh/higgs-audio
Emotional control
VibeVoice
infsh/vibevoice
Podcasts, long-form
Browse All Audio Apps
belt app store --category audio
Examples
Basic Text-to-Speech
belt app run infsh/kokoro-tts --input '{"text": "Welcome to our tutorial."}'
Inworld TTS-2 — Emotion Steering
Inworld TTS-2 supports natural-language steering with [brackets] — control emotion, volume, speed, and non-verbals inline with text:
belt app run inworld/text-to-speech-2 --input '{
"text": "I have some [exciting] news to share with you! [pause] We just hit one million users. [laugh]",
"voice_id": "Sarah",
"delivery_mode": "CREATIVE"
}'
Delivery modes: STABLE (consistent), BALANCED (natural, default), CREATIVE (expressive).
Built-in voices (271+ across 15 languages): Sarah, Alex, Ashley, Dennis, Hana, Blake, Luna, Clive, and many more. Browse all voices in the Inworld TTS Playground or list programmatically via GET https://api.inworld.ai/voices/v1/voices.
Inworld TTS 1.5 — Low Latency
For real-time applications where speed matters:
# Max quality at low latency (<200ms)
belt app run inworld/text-to-speech-1-5-max --input '{
"text": "Welcome back! How can I help you today?",
"voice_id": "Ashley"
}'
# Ultra-low latency (~120ms) for conversational AI
belt app run inworld/text-to-speech-1-5-mini --input '{
"text": "Sure, let me look that up for you.",
"voice_id": "Dennis"
}'
Conversational TTS with DIA
belt app sample infsh/dia-tts --save input.json
# Edit input.json:
# {
# "text": "Hey! How are you doing today? I'm really excited to share this with you.",
# "voice": "conversational"
# }
belt app run infsh/dia-tts --input input.json
Long-form Audio (Podcasts)
belt app sample infsh/vibevoice --save input.json
# Edit input.json with your podcast script
belt app run infsh/vibevoice --input input.json
Expressive Speech with Higgs
belt app sample infsh/higgs-audio --save input.json
# {
# "text": "This is absolutely incredible!",
# "emotion": "excited"
# }
belt app run infsh/higgs-audio --input input.json
Use Cases
- Voiceovers: Product demos, explainer videos
- Audiobooks: Convert text to spoken word
- Podcasts: Generate podcast episodes
- Accessibility: Make content accessible
- IVR: Phone system voice prompts
- Video Narration: Add narration to videos
- Gaming / NPCs: Character voices with emotion steering (Inworld TTS-2)
- Conversational AI: Ultra-low latency responses (Inworld TTS 1.5 Mini)
- Avatar / UGC Videos: Generate speech for talking head avatars
Combine with Video
The easiest way to create a talking head video is P-Video-Avatar with built-in TTS — no separate audio step:
belt app run pruna/p-video-avatar --input '{
"image": "https://portrait.jpg",
"voice_script": "Your script here",
"voice": "Zephyr (Female)"
}'
For models without built-in TTS (OmniHuman, PixVerse), generate speech first:
# 1. Generate speech with Inworld TTS-2 (emotion steering)
belt app run inworld/text-to-speech-2 --input '{
"text": "[friendly] Your script here. [excited] This is the best part!",
"voice_id": "Sarah",
"delivery_mode": "CREATIVE"
}' > speech.json
# 2. Use the audio URL with OmniHuman for avatar video
belt app run bytedance/omnihuman-1-5 --input '{
"image_url": "https://portrait.jpg",
"audio_url": "<audio-url-from-step-1>"
}'
Related Skills
# ElevenLabs TTS (premium, 22+ voices)
npx skills add inference-sh/skills@elevenlabs-tts
# ElevenLabs dialogue (multi-speaker)
npx skills add inference-sh/skills@elevenlabs-dialogue
# Full platform skill (all 250+ apps)
npx skills add inference-sh/skills@infsh-cli
# AI avatars (combine TTS with talking heads)
npx skills add inference-sh/skills@ai-avatar-video
# AI music generation
npx skills add inference-sh/skills@ai-music-generation
# Speech-to-text (transcription)
npx skills add inference-sh/skills@speech-to-text
# Video generation
npx skills add inference-sh/skills@ai-video-generation
Browse all apps: belt app store
Documentation
- Running Apps - How to run apps via CLI
- Audio Transcription Example - Audio processing workflows
- Apps Overview - Understanding the app ecosystem