speak-tts

Real-time text-to-speech with voice cloning on Apple Silicon, entirely on-device. Supports multiple input sources (text files, markdown, stdin, web articles, PDFs) and output modes (streaming, file save, playback, or both) Voice cloning from 10–30 second WAV samples at 24000 Hz mono; includes emotion tags like [laugh] , [sigh] , and [gasp] for audible effects Batch processing with auto-chunking for long documents, concatenation utilities, and resume capability for interrupted generations Requires Apple Silicon Mac, macOS 12.0+, and command-line tools (sox, ffmpeg, poppler); runs entirely locally via MLX with no API keys

INSTALLATION
npx skills add https://github.com/emzod/speak --skill speak-tts
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

speak - Talk to your Claude!

Give your agent the ability to speak to you real-time. Local text-to-speech, voice cloning, and audio generation on Apple Silicon.

Give your agent the ability to speak to you real-time. Local TTS with voice cloning on Apple Silicon.

Prerequisites

Requirement

Check

Install

Apple Silicon Mac

uname -m → arm64

Intel not supported

macOS 12.0+

sw_vers

-

sox

which sox

brew install sox

ffmpeg

which ffmpeg

brew install ffmpeg

poppler (PDF)

which pdftotext

brew install poppler

Input Sources

Source

Example

Text file

speak article.txt

Markdown

speak doc.md

Direct string

speak "Hello"

Clipboard

pbpaste | speak

Stdin

cat file.txt | speak

Web Articles

lynx -dump -nolist "https://example.com/article" | speak --output article.wav

Converting Formats

Format

Convert Command

PDF

pdftotext doc.pdf doc.txt

DOCX

textutil -convert txt doc.docx

HTML

pandoc -f html -t plain doc.html > doc.txt

Output Modes

Goal

Command

Save for later

speak text.txt --output file.wav

Listen now (streaming)

speak text.txt --stream

Listen now (complete)

speak text.txt --play

Both

speak text.txt --stream --output file.wav

Default Behavior

speak article.txt          # → ~/Audio/speak/article.wav (no playback)

speak "Hello"              # → ~/Audio/speak/speak_<timestamp>.wav

Directory Auto-Creation

Directory

Auto-Created?

~/Audio/speak/

✓ Yes

~/.chatter/voices/

✗ No

Custom directories

✗ No

Always create custom directories first:

mkdir -p ~/.chatter/voices/

mkdir -p ~/Audio/custom/

Voice Cloning

Voice cloning generates speech that matches your vocal characteristics (pitch, tone, cadence) from a short recording.

Quality Expectations

  • Output captures general voice characteristics but is not a perfect replica
  • Quality depends heavily on sample quality
  • 15-25 seconds is optimal (10s minimum, 30s maximum)

Recording Your Voice

Using QuickTime:

  • Open QuickTime Player → File → New Audio Recording
  • Record 20 seconds of clear speech
  • File → Export As → Audio Only (.m4a)
  • Convert to WAV (see below)

Using sox (command line):

# -d = use default microphone

# Recording starts immediately and stops after 25 seconds

sox -d -r 24000 -c 1 ~/.chatter/voices/my_voice.wav trim 0 25

Converting to Required Format

Voice samples MUST be: WAV, 24000 Hz, mono, 10-30 seconds.

# From MP3

ffmpeg -i voice.mp3 -ar 24000 -ac 1 voice.wav

# From M4A (QuickTime)

ffmpeg -i voice.m4a -ar 24000 -ac 1 voice.wav

# Trim to 25 seconds

ffmpeg -i long.wav -t 25 -ar 24000 -ac 1 trimmed.wav

# Check sample properties

ffprobe -i voice.wav 2>&#x26;1 | grep -E "Duration|Stream"

# Should show: Duration ~15-25s, 24000 Hz, mono

Using Your Voice

# Create directory

mkdir -p ~/.chatter/voices/

# Move sample

mv voice.wav ~/.chatter/voices/my_voice.wav

# Test

speak "Testing my voice" --voice ~/.chatter/voices/my_voice.wav --stream

# Use for content

speak notes.txt --voice ~/.chatter/voices/my_voice.wav --output presentation.wav

Path requirements:

  • ✓ Works: ~/.chatter/voices/my_voice.wav (tilde expanded by shell)
  • ✓ Works: /Users/name/.chatter/voices/my_voice.wav
  • ✗ Fails: my_voice.wav (relative path)
  • ✗ Fails: ./voices/my_voice.wav (relative path)

Voice Sample Tips

Good Sample

Bad Sample

Quiet room

Background noise

Natural pace

Rushed or monotone

Clear diction

Mumbling

Varied content

Repetitive phrases

Default Voice

When --voice is omitted, a built-in default voice is used:

speak "Hello world" --stream  # Uses default voice

Emotion Tags

Tags produce audible effects (actual sounds), not spoken words:

speak "[sigh] Monday again." --stream

# Output: (sigh sound) "Monday again."

Tag

Effect

[laugh]

Laughter

[chuckle]

Light chuckle

[sigh]

Sighing

[gasp]

Gasping

[groan]

Groaning

[clear throat]

Throat clearing

[cough]

Coughing

[crying]

Crying

[singing]

Sung speech

NOT supported: [pause], [whisper] (ignored)

For pauses: Use punctuation: "Wait... let me think."

Batch Processing

mkdir -p ~/Audio/book/

speak ch01.txt ch02.txt ch03.txt --output-dir ~/Audio/book/

# Creates: ch01.wav, ch02.wav, ch03.wav

# With auto-chunking (for long files)

speak chapters/*.txt --output-dir ~/Audio/book/ --auto-chunk

# Skip completed files

speak chapters/*.txt --output-dir ~/Audio/book/ --skip-existing

Auto-Chunk Behavior

When using --auto-chunk with batch processing:

  • Each input file is chunked independently
  • Chunks are generated and automatically concatenated per file
  • Final output: one .wav per input file (e.g., ch01.wav)
  • Intermediate chunks deleted (unless --keep-chunks)

You don't need to manually concatenate chunks — only concatenate final chapter files.

Concatenating Audio

# Explicit order (recommended)

speak concat ch01.wav ch02.wav ch03.wav --output book.wav

# Glob pattern (REQUIRES zero-padded filenames)

speak concat audiobook/*.wav --output book.wav

Zero-Padding Rules

Critical for correct concatenation order:

Files

Correct

Wrong

1-9

01, 02, ..., 09

1, 2, ..., 9

10-99

01, 02, ..., 99

1, 10, 2, ...

100+

001, 002, ..., 999

1, 100, 2, ...

Why: Shell glob expansion sorts alphabetically. 1, 10, 2 vs 01, 02, 10.

PDF to Audiobook (Complete Workflow)

Step 1: Find Chapter Boundaries

# Preview table of contents

pdftotext -f 1 -l 5 textbook.pdf toc.txt

cat toc.txt  # Note chapter page numbers

# Or search for "Chapter" markers

pdftotext textbook.pdf - | grep -n "Chapter"

Step 2: Extract Chapters (Zero-Padded!)

# For 100-page book with ~10 chapters

pdftotext -f 1 -l 12 -layout textbook.pdf ch01.txt

pdftotext -f 13 -l 25 -layout textbook.pdf ch02.txt

pdftotext -f 26 -l 38 -layout textbook.pdf ch03.txt

# ... continue for all chapters

Step 3: Estimate Time

speak --estimate ch*.txt

# Shows: total audio duration, generation time, storage needed

# Quick estimates:

# 1 page ≈ 2 min audio ≈ 1 min generation

# 100 pages ≈ 200 min audio ≈ 100 min generation ≈ 500 MB

Step 4: Generate Audio

mkdir -p audiobook/

speak ch01.txt ch02.txt ch03.txt --output-dir audiobook/ --auto-chunk

# Creates: audiobook/ch01.wav, audiobook/ch02.wav, audiobook/ch03.wav

Step 5: Concatenate

speak concat audiobook/ch01.wav audiobook/ch02.wav audiobook/ch03.wav --output complete_audiobook.wav

# Or with glob (only if zero-padded):

speak concat audiobook/ch*.wav --output complete_audiobook.wav

PDF Troubleshooting

Issue

Solution

Empty/garbled text

Scanned PDF — use OCR: brew install tesseract

Wrong encoding

Try: pdftotext -enc UTF-8 doc.pdf

Check word count

pdftotext doc.pdf - | wc -w (should be >100)

Multi-Voice Content

mkdir -p podcast/scripts podcast/wav

echo "Welcome to the show." > podcast/scripts/01_host.txt

echo "Thanks for having me." > podcast/scripts/02_guest.txt

speak podcast/scripts/01_host.txt --voice ~/.chatter/voices/host.wav --output podcast/wav/01.wav

speak podcast/scripts/02_guest.txt --voice ~/.chatter/voices/guest.wav --output podcast/wav/02.wav

speak concat podcast/wav/01.wav podcast/wav/02.wav --output podcast.wav

Options Reference

Option

Description

Default

--stream

Stream as it generates

false

--play

Play after complete

false

--output <path>

Output file

~/Audio/speak/

--output-dir <dir>

Batch output directory

-

--voice <path>

Voice sample (full path)

default

--timeout <sec>

Timeout per file

300

--auto-chunk

Split long documents

false

--chunk-size <n>

Chars per chunk

6000

--resume <file>

Resume from manifest

-

--keep-chunks

Keep intermediate files

false

--skip-existing

Skip if output exists

false

--estimate

Show duration estimate

false

--dry-run

Preview only

false

--quiet

Suppress output

false

Commands

Command

Description

speak setup

Set up environment

speak health

Check system status

speak models

List TTS models

speak concat

Concatenate audio

speak daemon kill

Stop TTS server

speak config

Show configuration

Performance

Metric

Value

Cold start

~4-8s

Warm start

~3-8s

Speed

0.3-0.5x RTF (faster than real-time)

Storage

~2.5 MB/min, ~150 MB/hour

Resume Capability

For interrupted long generations:

# Single file with auto-chunk — use --resume

speak long.txt --auto-chunk --output book.wav

# If interrupted, manifest saved at ~/Audio/speak/manifest.json

speak --resume ~/Audio/speak/manifest.json

# Batch processing — use --skip-existing

speak ch*.txt --output-dir audiobook/ --auto-chunk

# If interrupted, re-run same command:

speak ch*.txt --output-dir audiobook/ --auto-chunk --skip-existing

Common Errors

Error

Cause

Solution

"Voice file not found"

Relative path

Use full path: ~/.chatter/voices/x.wav

"Invalid WAV format"

Wrong specs

Convert: ffmpeg -i in.wav -ar 24000 -ac 1 out.wav

"Voice sample too short"

<10 seconds

Record 15-25 seconds

"Output directory doesn't exist"

Not created

mkdir -p dirname/

"sox not found"

Not installed

brew install sox

Scrambled concat order

Non-zero-padded

Use 01, 02, not 1, 2

Timeout

>5 min generation

Use --auto-chunk or --timeout 600

"Server not running"

Stale daemon

speak daemon kill &#x26;&#x26; speak health

Setup

speak "test"     # Auto-setup on first run (downloads model ~500MB)

speak setup      # Or manual setup

speak health     # Verify everything works

Server Management

Server auto-starts and shuts down after 1 hour idle.

speak health        # Check status

speak daemon kill   # Stop manually
BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card