speech-to-text

Transcribe audio to text with ElevenLabs Scribe and Whisper models via inference.sh CLI. Models: ElevenLabs Scribe v2 (98%+ accuracy, diarization), Fast…

INSTALLATION
npx skills add https://github.com/inference-sh/skills --skill speech-to-text
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

$28

Available Models

Model

App ID

Best For

ElevenLabs Scribe v2

elevenlabs/stt

98%+ accuracy, diarization, 90+ languages

Fast Whisper V3

infsh/fast-whisper-large-v3

Fast transcription

Whisper V3 Large

infsh/whisper-v3-large

Highest accuracy

Examples

Basic Transcription

belt app run infsh/fast-whisper-large-v3 --input '{"audio_url": "https://meeting.mp3"}'

With Timestamps

belt app sample infsh/fast-whisper-large-v3 --save input.json

# {

#   "audio_url": "https://podcast.mp3",

#   "timestamps": true

# }

belt app run infsh/fast-whisper-large-v3 --input input.json

Translation (to English)

belt app run infsh/whisper-v3-large --input '{

  "audio_url": "https://french-audio.mp3",

  "task": "translate"

}'

From Video

# Extract audio from video first

belt app run infsh/video-audio-extractor --input '{"video_url": "https://video.mp4"}' > audio.json

# Transcribe the extracted audio

belt app run infsh/fast-whisper-large-v3 --input '{"audio_url": "<audio-url>"}'

Workflow: Video Subtitles

# 1. Transcribe video audio

belt app run infsh/fast-whisper-large-v3 --input '{

  "audio_url": "https://video.mp4",

  "timestamps": true

}' > transcript.json

# 2. Use transcript for captions

belt app run infsh/caption-videos --input '{

  "video_url": "https://video.mp4",

  "captions": "<transcript-from-step-1>"

}'

Supported Languages

Whisper supports 99+ languages including:

English, Spanish, French, German, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, Hindi, Russian, and many more.

Use Cases

  • Meetings: Transcribe recordings
  • Podcasts: Generate transcripts
  • Subtitles: Create captions for videos
  • Voice Notes: Convert to searchable text
  • Interviews: Transcription for research
  • Accessibility: Make audio content accessible

Output Format

Returns JSON with:

  • text: Full transcription
  • segments: Timestamped segments (if requested)
  • language: Detected language

Related Skills

# ElevenLabs STT (98%+ accuracy, diarization)

npx skills add inference-sh/skills@elevenlabs-stt

# ElevenLabs TTS (reverse direction)

npx skills add inference-sh/skills@elevenlabs-tts

# Full platform skill (all 250+ apps)

npx skills add inference-sh/skills@infsh-cli

# Text-to-speech (reverse direction)

npx skills add inference-sh/skills@text-to-speech

# Video generation (add captions)

npx skills add inference-sh/skills@ai-video-generation

# AI avatars (lipsync with transcripts)

npx skills add inference-sh/skills@ai-avatar-video

Browse all audio apps: belt app store --category audio

Documentation

BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card