SKILL.md

$2a

Task

MCP Tool

Fallback (Direct API)

List TTS voices

mcp__heygen__list_audio_voices

GET /v3/voices?engine=starfish

Generate speech audio

mcp__heygen__text_to_speech

POST /v3/voices/speech

Default Workflow

List voices with mcp__heygen__list_audio_voices (or GET /v3/voices?engine=starfish)

Pick a voice matching desired language, gender, and features

Call mcp__heygen__text_to_speech (or POST /v3/voices/speech) with text and voice_id

Use the returned audio_url to download or play the audio

List TTS Voices

Retrieve voices compatible with the Starfish TTS model.

Note: This uses the unified GET /v3/voices endpoint with the engine=starfish filter to return only TTS-compatible voices. Not all video voices support Starfish TTS. The response is paginated — use next_token to fetch additional pages.

Query Parameters

Param

Type

Description

engine

string

Filter by engine (use starfish for TTS voices)

type

string

public or private

language

string

Filter by language

gender

string

Filter by gender

limit

integer

Results per page, 1-100

token

string

Pagination cursor from next_token

curl

curl -X GET "https://api.heygen.com/v3/voices?engine=starfish" \

  -H "X-Api-Key: $HEYGEN_API_KEY"

TypeScript

interface AudioVoiceItem {

  voice_id: string;

  name: string;

  language: string;

  gender: "female" | "male" | "unknown";

  preview_audio_url: string | null;

  support_pause: boolean;

  support_locale: boolean;

  type: string;

}

interface TTSVoicesResponse {

  error: null | string;

  data: AudioVoiceItem[];

  has_more: boolean;

  next_token: string | null;

}

async function listTTSVoices(): Promise<AudioVoiceItem[]> {

  const allVoices: AudioVoiceItem[] = [];

  let token: string | null = null;

  do {

    const url = new URL("https://api.heygen.com/v3/voices");

    url.searchParams.set("engine", "starfish");

    if (token) url.searchParams.set("token", token);

    const response = await fetch(url.toString(), {

      headers: { "X-Api-Key": process.env.HEYGEN_API_KEY! },

    });

    const json: TTSVoicesResponse = await response.json();

    if (json.error) {

      throw new Error(json.error);

    }

    allVoices.push(...json.data);

    token = json.next_token;

  } while (token);

  return allVoices;

}

Python

import requests

import os

def list_tts_voices() -> list:

    all_voices = []

    token = None

    while True:

        params = {"engine": "starfish"}

        if token:

            params["token"] = token

        response = requests.get(

            "https://api.heygen.com/v3/voices",

            headers={"X-Api-Key": os.environ["HEYGEN_API_KEY"]},

            params=params,

        )

        data = response.json()

        if data.get("error"):

            raise Exception(data["error"])

        all_voices.extend(data["data"])

        if not data.get("has_more"):

            break

        token = data.get("next_token")

    return all_voices

Response Format

{

  "error": null,

  "data": [

    {

      "voice_id": "f38a635bee7a4d1f9b0a654a31d050d2",

      "name": "Chill Brian",

      "language": "English",

      "gender": "male",

      "preview_audio_url": "https://resource.heygen.ai/text_to_speech/WpSDQvmLGXEqXZVZQiVeg6.mp3",

      "support_pause": true,

      "support_locale": false,

      "type": "public"

    }

  ],

  "has_more": false,

  "next_token": null

}

Generate Speech Audio

Convert text to speech audio using a specified voice.

Endpoint

POST https://api.heygen.com/v3/voices/speech

Request Fields

Field

Type

Req

Description

text

string

Text content to convert (1-5000 characters)

voice_id

string

Voice ID from GET /v3/voices?engine=starfish

input_type

string

"text" (default) or "ssml" for full SSML markup

speed

number

Speech speed, 0.5-2.0 (default: 1.0)

language

string

Base language code (e.g., "en", "pt"). Auto-detected if omitted

locale

string

BCP-47 locale for multilingual voices (e.g., "en-US", "pt-BR")

curl

curl -X POST "https://api.heygen.com/v3/voices/speech" \

  -H "X-Api-Key: $HEYGEN_API_KEY" \

  -H "Content-Type: application/json" \

  -d '{

    "text": "Hello! Welcome to our product demo.",

    "voice_id": "YOUR_VOICE_ID",

    "speed": 1.0

  }'

TypeScript

interface TTSRequest {

  text: string;

  voice_id: string;

  input_type?: "text" | "ssml";

  speed?: number;

  language?: string;

  locale?: string;

}

interface WordTimestamp {

  word: string;

  start: number;

  end: number;

}

interface TTSResponse {

  error: null | string;

  data: {

    audio_url: string;

    duration: number;

    request_id?: string;

    word_timestamps?: WordTimestamp[];

  };

}

async function textToSpeech(request: TTSRequest): Promise<TTSResponse["data"]> {

  const response = await fetch(

    "https://api.heygen.com/v3/voices/speech",

    {

      method: "POST",

      headers: {

        "X-Api-Key": process.env.HEYGEN_API_KEY!,

        "Content-Type": "application/json",

      },

      body: JSON.stringify(request),

    }

  );

  const json: TTSResponse = await response.json();

  if (json.error) {

    throw new Error(json.error);

  }

  return json.data;

}

Python

import requests

import os

def text_to_speech(

    text: str,

    voice_id: str,

    input_type: str = "text",

    speed: float = 1.0,

    language: str | None = None,

    locale: str | None = None,

) -> dict:

    payload = {

        "text": text,

        "voice_id": voice_id,

        "speed": speed,

    }

    if input_type != "text":

        payload["input_type"] = input_type

    if language:

        payload["language"] = language

    if locale:

        payload["locale"] = locale

    response = requests.post(

        "https://api.heygen.com/v3/voices/speech",

        headers={

            "X-Api-Key": os.environ["HEYGEN_API_KEY"],

            "Content-Type": "application/json",

        },

        json=payload,

    )

    data = response.json()

    if data.get("error"):

        raise Exception(data["error"])

    return data["data"]

Response Format

{

  "error": null,

  "data": {

    "audio_url": "https://resource2.heygen.ai/text_to_speech/.../id=365d46bb.wav",

    "duration": 5.526,

    "request_id": "p38QJ52hfgNlsYKZZmd9",

    "word_timestamps": [

      { "word": "<start>", "start": 0.0, "end": 0.0 },

      { "word": "Hey", "start": 0.079, "end": 0.219 },

      { "word": "there,", "start": 0.239, "end": 0.459 },

      { "word": "<end>", "start": 5.526, "end": 5.526 }

    ]

  }

}

Usage Examples

Basic TTS

const result = await textToSpeech({

  text: "Welcome to our quarterly earnings call.",

  voice_id: "YOUR_VOICE_ID",

});

console.log(`Audio URL: ${result.audio_url}`);

console.log(`Duration: ${result.duration}s`);

With Speed Adjustment

const result = await textToSpeech({

  text: "We're thrilled to announce our newest feature!",

  voice_id: "YOUR_VOICE_ID",

  speed: 1.1,

});

With Language and Locale for Multilingual Voices

const result = await textToSpeech({

  text: "Bem-vindo ao nosso produto.",

  voice_id: "MULTILINGUAL_VOICE_ID",

  language: "pt",

  locale: "pt-BR",

});

With SSML Input

const result = await textToSpeech({

  text: '<speak>Hello <break time="1s"/> and welcome!</speak>',

  voice_id: "YOUR_VOICE_ID",

  input_type: "ssml",

});

Find a Voice and Generate Audio

async function generateSpeech(text: string, language: string): Promise<string> {

  const voices = await listTTSVoices();

  const voice = voices.find(

    (v) => v.language.toLowerCase().includes(language.toLowerCase())

  );

  if (!voice) {

    throw new Error(`No TTS voice found for language: ${language}`);

  }

  const result = await textToSpeech({

    text,

    voice_id: voice.voice_id,

  });

  return result.audio_url;

}

const audioUrl = await generateSpeech("Hello and welcome!", "english");

Pauses with Break Tags

Use SSML-style break tags in your text for pauses:

word <break time="1s"/> word

Rules:

Use seconds with s suffix: <break time="1.5s"/>

Must have spaces before and after the tag

Self-closing tag format

With v3, you can also use input_type: "ssml" for full SSML support, allowing richer markup beyond just break tags:

{

  "text": "<speak>Welcome! <break time=\"1s\"/> Let's get started.</speak>",

  "voice_id": "YOUR_VOICE_ID",

  "input_type": "ssml"

}

Best Practices

**Use GET /v3/voices?engine=starfish** to find compatible voices — the unified /v3/voices endpoint serves all voice types, so the engine=starfish filter is essential for TTS

**Check support_locale** before setting a locale — only multilingual voices support locale selection

Keep speed between 0.8-1.2 for natural-sounding output

Preview voices using the preview_audio_url before generating (may be null for some voices)

**Use word_timestamps** in the response for caption syncing or timed text overlays

Use SSML break tags in your text for pauses: word <break time="1s"/> word

**Use input_type: "ssml"** when you need full SSML markup control beyond simple break tags

Paginate voice listing — the v3 endpoint returns paginated results; use has_more and next_token to fetch all voices

text-to-speech

SKILL.md

Default Workflow

List TTS Voices

Query Parameters

curl

TypeScript

Python

Response Format

Generate Speech Audio

Endpoint

Request Fields

curl

TypeScript

Python

Response Format

Usage Examples

Basic TTS

With Speed Adjustment

With Language and Locale for Multilingual Voices

With SSML Input

Find a Voice and Generate Audio

Pauses with Break Tags

Best Practices

Stop writing automation&scrapers

text-to-speech

SKILL.md

Default Workflow

List TTS Voices

Query Parameters

curl

TypeScript

Python

Response Format

Generate Speech Audio

Endpoint

Request Fields

curl

TypeScript

Python

Response Format

Usage Examples

Basic TTS

With Speed Adjustment

With Language and Locale for Multilingual Voices

With SSML Input

Find a Voice and Generate Audio

Pauses with Break Tags

Best Practices

Let your agent run on any real-world website

Related skills

Stop writing automation&scrapers