SKILL.md

OmniVoice TTS Skill

Skill by ara.so — Daily 2026 Skills collection.

OmniVoice is a state-of-the-art zero-shot TTS model supporting 600+ languages, built on a diffusion language model-style architecture. It supports voice cloning (from reference audio), voice design (via text attributes), and auto voice generation with RTF as low as 0.025.

Installation

Requirements

Python 3.9+

PyTorch 2.8+

CUDA (recommended) or Apple Silicon (MPS) or CPU

pip (recommended)

# Step 1: Install PyTorch for your platform

NVIDIA GPU (CUDA 12.8)

pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128

Apple Silicon

pip install torch==2.8.0 torchaudio==2.8.0

Step 2: Install OmniVoice

pip install omnivoice

Or from source (latest)

pip install git+https://github.com/k2-fsa/OmniVoice.git

Or editable dev install

git clone https://github.com/k2-fsa/OmniVoice.git

cd OmniVoice

pip install -e .

### uv

git clone https://github.com/k2-fsa/OmniVoice.git

cd OmniVoice

uv sync

With mirror: uv sync --default-index "https://mirrors.aliyun.com/pypi/simple"


### HuggingFace Mirror (if blocked)

export HF_ENDPOINT="https://hf-mirror.com"


## Core Concepts

Mode
What you provide
Use case

**Voice Cloning**
`ref_audio` + `ref_text`
Clone a speaker from a short audio clip

**Voice Design**
`instruct` string
Describe speaker attributes (no audio needed)

**Auto Voice**
nothing extra
Model picks a random voice

## Python API

### Load the Model

from omnivoice import OmniVoice

import torch

import torchaudio

NVIDIA GPU

model = OmniVoice.from_pretrained(

"k2-fsa/OmniVoice",

device_map="cuda:0",

dtype=torch.float16

)

Apple Silicon

model = OmniVoice.from_pretrained(

"k2-fsa/OmniVoice",

device_map="mps",

dtype=torch.float16

)

CPU (slower)

model = OmniVoice.from_pretrained(

"k2-fsa/OmniVoice",

device_map="cpu",

dtype=torch.float32

)


### Voice Cloning

With manual reference transcription (faster, more accurate)

audio = model.generate(

text="Hello, this is a test of zero-shot voice cloning.",

ref_audio="ref.wav",

ref_text="Transcription of the reference audio.",

)

Without ref_text — Whisper auto-transcribes ref_audio

audio = model.generate(

text="Hello, this is a test of zero-shot voice cloning.",

ref_audio="ref.wav",

)

audio is a list of torch.Tensor, shape (1, T) at 24kHz

torchaudio.save("out.wav", audio[0], 24000)


### Voice Design

Describe speaker via comma-separated attributes

audio = model.generate(

text="Hello, this is a test of zero-shot voice design.",

instruct="female, low pitch, british accent",

)

torchaudio.save("out.wav", audio[0], 24000)


**Supported attributes:**

- **Gender**: `male`, `female`

- **Age**: `child`, `young`, `middle-aged`, `elderly`

- **Pitch**: `very low pitch`, `low pitch`, `high pitch`, `very high pitch`

- **Style**: `whisper`

- **English accents**: `american accent`, `british accent`, `australian accent`, etc.

- **Chinese dialects**: `四川话`, `陕西话`, etc.

### Auto Voice

audio = model.generate(text="This is a sentence without any voice prompt.")

torchaudio.save("out.wav", audio[0], 24000)


### Generation Parameters

audio = model.generate(

text="Hello world.",

ref_audio="ref.wav",

ref_text="Reference text.",

num_step=32, # diffusion steps; use 16 for faster (slightly lower quality)

speed=1.2, # speaking rate multiplier (>1 faster, <1 slower)

duration=8.0, # fix output duration in seconds (overrides speed)

)


### Non-Verbal Symbols

Insert expressive non-verbal sounds inline

audio = model.generate(

text="[laughter] You really got me. I didn't see that coming at all."

)


**Supported tags:**
`[laughter]`, `[sigh]`, `[confirmation-en]`, `[question-en]`, `[question-ah]`,
`[question-oh]`, `[question-ei]`, `[question-yi]`, `[surprise-ah]`, `[surprise-oh]`,
`[surprise-wa]`, `[surprise-yo]`, `[dissatisfaction-hnn]`

### Pronunciation Control

Chinese: pinyin with tone numbers (inline, uppercase)

audio = model.generate(

text="这批货物打ZHE2出售后他严重SHE2本了，再也经不起ZHE1腾了。"

)

English: CMU dict pronunciation in brackets (uppercase)

audio = model.generate(

text="You could probably still make [IH1 T] look good."

)


## CLI Tools

### Web Demo

omnivoice-demo --ip 0.0.0.0 --port 8001

omnivoice-demo --help # all options


### Single Inference

Voice Cloning (ref_text optional; omit for Whisper auto-transcription)

omnivoice-infer \

--model k2-fsa/OmniVoice \

--text "This is a test for text to speech." \

--ref_audio ref.wav \

--ref_text "Transcription of the reference audio." \

--output hello.wav

Voice Design

omnivoice-infer \

--model k2-fsa/OmniVoice \

--text "This is a test for text to speech." \

--instruct "male, British accent" \

--output hello.wav

Auto Voice

omnivoice-infer \

--model k2-fsa/OmniVoice \

--text "This is a test for text to speech." \

--output hello.wav


### Batch Inference (Multi-GPU)

omnivoice-infer-batch \

--model k2-fsa/OmniVoice \

--test_list test.jsonl \

--res_dir results/


**JSONL format** (`test.jsonl`):

{"id": "sample_001", "text": "Hello world", "ref_audio": "/path/to/ref.wav", "ref_text": "Reference transcript"}

{"id": "sample_002", "text": "Voice design example", "instruct": "female, british accent"}

{"id": "sample_003", "text": "Auto voice example"}

{"id": "sample_004", "text": "Speed controlled", "ref_audio": "/path/to/ref.wav", "speed": 1.2}

{"id": "sample_005", "text": "Duration fixed", "ref_audio": "/path/to/ref.wav", "duration": 10.0}

{"id": "sample_006", "text": "With language hint", "ref_audio": "/path/to/ref.wav", "language_id": "en", "language_name": "English"}


**JSONL field reference:**

Field
Required
Description

`id`
✅
Unique identifier

`text`
✅
Text to synthesize

`ref_audio`
❌
Path to reference audio (voice cloning)

`ref_text`
❌
Transcript of ref audio

`instruct`
❌
Speaker attributes (voice design)

`language_id`
❌
Language code, e.g. `"en"`

`language_name`
❌
Language name, e.g. `"English"`

`duration`
❌
Fixed output duration in seconds

`speed`
❌
Speaking rate multiplier (ignored if duration set)

## Common Patterns

### Full Voice Cloning Pipeline

from omnivoice import OmniVoice

import torch

import torchaudio

from pathlib import Path

def clone_voice(ref_audio_path: str, texts: list[str], output_dir: str):

model = OmniVoice.from_pretrained(

"k2-fsa/OmniVoice",

device_map="cuda:0",

dtype=torch.float16

)

Path(output_dir).mkdir(parents=True, exist_ok=True)

for i, text in enumerate(texts):

audio = model.generate(

text=text,

ref_audio=ref_audio_path,

# ref_text omitted: Whisper auto-transcribes

num_step=32,

speed=1.0,

)

out_path = f"{output_dir}/output_{i:04d}.wav"

torchaudio.save(out_path, audio[0], 24000)

print(f"Saved: {out_path}")

clone_voice(

ref_audio_path="speaker.wav",

texts=["Hello world.", "Second sentence.", "Third sentence."],

output_dir="outputs/"

)


### Batch Processing from a List

import json

from omnivoice import OmniVoice

import torch

import torchaudio

model = OmniVoice.from_pretrained("k2-fsa/OmniVoice", device_map="cuda:0", dtype=torch.float16)

items = [

{"id": "s1", "text": "English sentence.", "instruct": "female, american accent"},

{"id": "s2", "text": "Another sentence.", "ref_audio": "ref.wav"},

{"id": "s3", "text": "Auto voice.", },

]

for item in items:

kwargs = {"text": item["text"]}

if "ref_audio" in item:

kwargs["ref_audio"] = item["ref_audio"]

if "ref_text" in item:

kwargs["ref_text"] = item["ref_text"]

if "instruct" in item:

kwargs["instruct"] = item["instruct"]

audio = model.generate(**kwargs)

torchaudio.save(f"{item['id']}.wav", audio[0], 24000)


### Voice Design Combinations

designs = [

"male, elderly, low pitch",

"female, child, high pitch",

"male, whisper",

"female, british accent, high pitch",

"male, american accent, middle-aged",

]

for design in designs:

audio = model.generate(

text="The quick brown fox jumps over the lazy dog.",

instruct=design,

)

safe_name = design.replace(", ", "_").replace(" ", "-")

torchaudio.save(f"design_{safe_name}.wav", audio[0], 24000)


### Fast Inference (Lower Diffusion Steps)

Default: num_step=32 (high quality)

Fast: num_step=16 (slightly lower quality, ~2x faster)

audio = model.generate(

text="Fast inference example.",

ref_audio="ref.wav",

num_step=16,

)


## Output Format

- **Sample rate**: 24,000 Hz

- **Type**: `list[torch.Tensor]`, each tensor shape `(1, T)`

- **Save**: use `torchaudio.save(path, audio[0], 24000)`

## Troubleshooting

### HuggingFace download fails

export HF_ENDPOINT="https://hf-mirror.com"


### CUDA out of memory

Use float16 (not float32)

model = OmniVoice.from_pretrained("k2-fsa/OmniVoice", device_map="cuda:0", dtype=torch.float16)

Or reduce batch size / text length in batch inference


### Whisper ASR not available for ref_text auto-transcription

pip install openai-whisper


### Wrong pronunciation in Chinese

Use inline pinyin with tone numbers directly in the text string:

Format: PINYINTONE_NUMBER within the sentence

text = "这批货物打ZHE2出售"


### Audio quality issues

- Increase `num_step` to 32 or 64

- Provide `ref_text` manually instead of relying on auto-transcription

- Use a clean, noise-free reference audio clip (3–15 seconds recommended)

### Apple Silicon (MPS) issues

Use mps device explicitly

model = OmniVoice.from_pretrained("k2-fsa/OmniVoice", device_map="mps", dtype=torch.float16)

omnivoice-tts

SKILL.md

OmniVoice TTS Skill

Installation

Requirements

pip (recommended)

NVIDIA GPU (CUDA 12.8)

Apple Silicon

Step 2: Install OmniVoice

Or from source (latest)

Or editable dev install

With mirror: uv sync --default-index "https://mirrors.aliyun.com/pypi/simple"

NVIDIA GPU

Apple Silicon

CPU (slower)

With manual reference transcription (faster, more accurate)

Without ref_text — Whisper auto-transcribes ref_audio

audio is a list of torch.Tensor, shape (1, T) at 24kHz

Describe speaker via comma-separated attributes

Insert expressive non-verbal sounds inline

Chinese: pinyin with tone numbers (inline, uppercase)

English: CMU dict pronunciation in brackets (uppercase)

Voice Cloning (ref_text optional; omit for Whisper auto-transcription)

Voice Design

Auto Voice

Default: num_step=32 (high quality)

Fast: num_step=16 (slightly lower quality, ~2x faster)

Use float16 (not float32)

Or reduce batch size / text length in batch inference

Format: PINYINTONE_NUMBER within the sentence

Use mps device explicitly

Stop writing automation&scrapers

omnivoice-tts

SKILL.md

OmniVoice TTS Skill

Installation

Requirements

pip (recommended)

NVIDIA GPU (CUDA 12.8)

Apple Silicon

Step 2: Install OmniVoice

Or from source (latest)

Or editable dev install

With mirror: uv sync --default-index "https://mirrors.aliyun.com/pypi/simple"

NVIDIA GPU

Apple Silicon

CPU (slower)

With manual reference transcription (faster, more accurate)

Without ref_text — Whisper auto-transcribes ref_audio

audio is a list of torch.Tensor, shape (1, T) at 24kHz

Describe speaker via comma-separated attributes

Insert expressive non-verbal sounds inline

Chinese: pinyin with tone numbers (inline, uppercase)

English: CMU dict pronunciation in brackets (uppercase)

Voice Cloning (ref_text optional; omit for Whisper auto-transcription)

Voice Design

Auto Voice

Default: num_step=32 (high quality)

Fast: num_step=16 (slightly lower quality, ~2x faster)

Use float16 (not float32)

Or reduce batch size / text length in batch inference

Format: PINYINTONE_NUMBER within the sentence

Use mps device explicitly

Let your agent run on any real-world website

Related skills

Stop writing automation&scrapers