tts

Text-to-speech with dual backends, voice cloning, and timeline-accurate audio synthesis for dubbing and video narration. Supports two backends: Kokoro (local, offline) for simple speech synthesis, and Noiz (cloud) for voice cloning, emotion control, and precise segment timing Simple mode converts text, files, or URLs to audio with optional voice cloning from reference audio; timeline mode aligns speech to SRT subtitles with per-segment voice and emotion control Voice maps enable granular control over voice selection, language, speed, and emotion across individual segments or ranges Guest mode provides 15 built-in voices without API authentication; full features require a Noiz API key Supports multiple output formats (WAV, MP3, Opus, OGG) and integrates with Feishu, Telegram, and Discord

INSTALLATION
npx skills add https://github.com/noizai/skills --skill tts
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

tts

Convert any text into speech audio. Supports two backends (Kokoro local, Noiz cloud), two modes (simple or timeline-accurate), and per-segment voice control.

Triggers

  • text to speech / tts / speak / say
  • voice clone / dubbing
  • epub to audio / srt to audio / convert to audio
  • 语音 / 说 / 讲 / 说话

Simple Mode — text to audio

speak is the default — the subcommand can be omitted:

# Basic usage (speak is implicit)

python3 skills/tts/scripts/tts.py -t "Hello world"          # add -o path to save

python3 skills/tts/scripts/tts.py -f article.txt -o out.mp3

# Voice cloning — local file path or URL

python3 skills/tts/scripts/tts.py -t "Hello" --ref-audio ./ref.wav

python3 skills/tts/scripts/tts.py -t "Hello" --ref-audio https://example.com/my_voice.wav -o clone.wav

# Voice message format

python3 skills/tts/scripts/tts.py -t "Hello" --format opus -o voice.opus

python3 skills/tts/scripts/tts.py -t "Hello" --format ogg -o voice.ogg

Third-party integration (Feishu/Telegram/Discord) is documented in ref_3rd_party.md.

Timeline Mode — SRT to time-aligned audio

For precise per-segment timing (dubbing, subtitles, video narration).

Step 1: Get or create an SRT

If the user doesn't have one, generate from text:

python3 skills/tts/scripts/tts.py to-srt -i article.txt -o article.srt

python3 skills/tts/scripts/tts.py to-srt -i article.txt -o article.srt --cps 15 --gap 500

--cps = characters per second (default 4, good for Chinese; ~15 for English). The agent can also write SRT manually.

Step 2: Create a voice map

JSON file controlling default + per-segment voice settings. segments keys support single index "3" or range "5-8".

Kokoro voice map:

{

  "default": { "voice": "zf_xiaoni", "lang": "cmn" },

  "segments": {

    "1": { "voice": "zm_yunxi" },

    "5-8": { "voice": "af_sarah", "lang": "en-us", "speed": 0.9 }

  }

}

Noiz voice map (adds emo, reference_audio support). reference_audio can be a local path or a URL (user’s own audio; Noiz only):

{

  "default": { "voice_id": "voice_123", "target_lang": "zh" },

  "segments": {

    "1": { "voice_id": "voice_host", "emo": { "Joy": 0.6 } },

    "2-4": { "reference_audio": "./refs/guest.wav" }

  }

}

Dynamic Reference Audio Slicing:

If you are translating or dubbing a video and want each sentence to automatically use the audio from the original video at the exact same timestamp as its reference audio, use the --ref-audio-track argument instead of setting reference_audio in the map:

python3 skills/tts/scripts/tts.py render --srt input.srt --voice-map vm.json --ref-audio-track original_video.mp4 -o output.wav

See examples/ for full samples.

Step 3: Render

python3 skills/tts/scripts/tts.py render --srt input.srt --voice-map vm.json -o output.wav

python3 skills/tts/scripts/tts.py render --srt input.srt --voice-map vm.json --backend noiz --auto-emotion -o output.wav

When to Choose Which

Need

Recommended

Just read text aloud, no fuss

Kokoro (default)

EPUB/PDF audiobook with chapters

Kokoro (native support)

Voice blending ("v1:60,v2:40")

Kokoro

Voice cloning from reference audio

Noiz

Emotion control (emo param)

Noiz

Exact server-side duration per segment

Noiz

When the user needs emotion control + voice cloning + precise duration together, Noiz is the only backend that supports all three.

Guest Mode (no API key)

When no API key is configured, tts.py automatically falls back to guest mode — a limited Noiz endpoint that requires no authentication. Guest mode only supports --voice-id, --speed, and --format; voice cloning, emotion, duration, and timeline rendering are not available.

# Guest mode (auto-detected when no API key is set)

python3 skills/tts/scripts/tts.py -t "Hello" --voice-id 883b6b7c -o hello.wav

# Explicit backend override to use kokoro instead

python3 skills/tts/scripts/tts.py -t "Hello" --backend kokoro

Available guest voices (15 built-in):

voice_id

name

lang

gender

tone

063a4491

販売員(なおみ)

ja

F

喜び

4252b9c8

落ち着いた女性

ja

F

穏やか

578b4be2

熱血漢(たける)

ja

M

怒り

a9249ce7

安らぎ(みなと)

ja

M

穏やか

f00e45a1

旅人(かいと)

ja

M

穏やか

b4775100

悦悦|社交分享

zh

F

Joyful

77e15f2c

婉青|情绪抚慰

zh

F

Calm

ac09aeb4

阿豪|磁性主持

zh

M

Calm

87cb2405

建国|知识科普

zh

M

Calm

3b9f1e27

小明|科技达人

zh

M

Joyful

95814add

Science Narration

en

M

Calm

883b6b7c

The Mentor (Alex)

en

M

Joyful

a845c7de

The Naturalist (Silas)

en

M

Calm

5a68d66b

The Healer (Serena)

en

F

Calm

0e4ab6ec

The Mentor (Maya)

en

F

Calm

Security & data disclosure

This skill performs the following file and network operations at runtime:

  • Credential storage: When you run config --set-api-key, the key is saved to ~/.config/noiz/api_key (permissions 0600). The NOIZ_API_KEY environment variable is also supported as an alternative.
  • Legacy key migration: If ~/.noiz_api_key exists and ~/.config/noiz/api_key does not, the key is copied (not deleted) to the new location. A message is printed; the old file is left untouched for you to remove manually.
  • Network calls (Noiz backend): Text and optional reference audio are uploaded to https://noiz.ai/v1/ for synthesis. No data is sent unless you invoke a Noiz command.
  • Reference audio download: When --ref-audio is a URL, the file is downloaded to a temp file, used for the API call, then deleted. If no voice-id or ref-audio is provided, a default reference audio is downloaded from storage.googleapis.com or noiz.ai.
  • Temp files: Temporary audio/text files may be created during synthesis and are cleaned up after use.
  • ffmpeg: Invoked only in timeline render mode to assemble the final audio.

No files outside the output path and ~/.config/noiz/ are modified. The Kokoro backend runs entirely offline with no network access.

Requirements

  • ffmpeg in PATH (timeline mode only)
  • requests package: uv pip install requests (required for Noiz backend)
  • Get your API key at Noiz Developer, then run python3 skills/tts/scripts/tts.py config --set-api-key YOUR_KEY (guest mode works without a key but has limited features)
  • Kokoro: if already installed, pass --backend kokoro to use the local backend

Noiz API authentication

Use only the base64-encoded API key as Authorization—no prefix (e.g. no APIKEY or Bearer ). Any prefix causes 401.

For backend details and full argument reference, see reference.md.

BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card