SKILL.md

video-understand

Understand video content locally using ffmpeg for frame extraction and Whisper for transcription. Fully offline, no API keys required.

Prerequisites

ffmpeg + ffprobe (required): brew install ffmpeg

openai-whisper (optional, for transcription): pip install openai-whisper

Commands

# Scene detection + transcribe (default)

python3 skills/video-understand/scripts/understand_video.py video.mp4

# Keyframe extraction

python3 skills/video-understand/scripts/understand_video.py video.mp4 -m keyframe

Regular interval extraction

python3 skills/video-understand/scripts/understand_video.py video.mp4 -m interval

Limit frames extracted

python3 skills/video-understand/scripts/understand_video.py video.mp4 --max-frames 10

Use a larger Whisper model

python3 skills/video-understand/scripts/understand_video.py video.mp4 --whisper-model small

Frames only, skip transcription

python3 skills/video-understand/scripts/understand_video.py video.mp4 --no-transcribe

Quiet mode (JSON only, no progress)

python3 skills/video-understand/scripts/understand_video.py video.mp4 -q

Output to file

python3 skills/video-understand/scripts/understand_video.py video.mp4 -o result.json

## CLI Options

| Flag | Description |

|------|-------------|

| `video` | Input video file (positional, required) |

| `-m, --mode` | Extraction mode: `scene` (default), `keyframe`, `interval` |

| `--max-frames` | Maximum frames to keep (default: 20) |

| `--whisper-model` | Whisper model size: tiny, base, small, medium, large (default: base) |

| `--no-transcribe` | Skip audio transcription, extract frames only |

| `-o, --output` | Write result JSON to file instead of stdout |

| `-q, --quiet` | Suppress progress messages, output only JSON |

## Extraction Modes

| Mode | How it works | Best for |

|------|-------------|----------|

| `scene` | Detects scene changes via ffmpeg `select='gt(scene,0.3)'` | Most videos, varied content |

| `keyframe` | Extracts I-frames (codec keyframes) | Encoded video with natural keyframe placement |

| `interval` | Evenly spaced frames based on duration and max-frames | Fixed sampling, predictable output |

If `scene` mode detects no scene changes, it automatically falls back to `interval` mode.

## Output

The script outputs JSON to stdout (or file with `-o`). See `references/output-format.md` for the full schema.

{

"video": "video.mp4",

"duration": 18.076,

"resolution": {"width": 1224, "height": 1080},

"mode": "scene",

"frames": [

{"path": "/abs/path/frame_0001.jpg", "timestamp": 0.0, "timestamp_formatted": "00:00"}

"frame_count": 12,

"transcript": [

{"start": 0.0, "end": 2.5, "text": "Hello and welcome..."}

"text": "Full transcript...",

"note": "Use the Read tool to view frame images for visual understanding."

}

video-understand

SKILL.md

video-understand

Prerequisites

Commands

Regular interval extraction

Limit frames extracted

Use a larger Whisper model

Frames only, skip transcription

Quiet mode (JSON only, no progress)

Output to file

Let your agent run on any real-world website

Related skills

Stop writing automation&scrapers