wonda-cli

Using the Wonda CLI to generate images, videos, music, and audio from the terminal — plus LinkedIn, Reddit, and X/Twitter research and automation

INSTALLATION
npx skills add https://github.com/degausai/wonda --skill wonda-cli
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

Wonda CLI

Wonda CLI is a content creation toolkit for terminal-based agents. Use it to generate images, videos, music, and audio; edit and compose media; publish to social platforms; and research/automate across LinkedIn, Reddit, and X/Twitter.

Install

If wonda is not found on PATH, install it first:

# npm

npm i -g @degausai/wonda

# Homebrew

brew tap degausai/tap && brew install wonda

Setup

  • Auth: wonda auth login (opens browser, recommended) or set WONDA_API_KEY env var
  • Verify: wonda auth check

Organizations & spend context

Wondercat orgs are shared wallets with their own seats and billing.

Members can spend from the org wallet (instead of their personal credits)

by switching context:

  • wonda organizations list (aliases: wonda orgs list, wonda org list) — see every org you belong to with your role and seat plan in each.
  • wonda use --org <slug> — sticky org context for this machine. Sets

X-Wonda-Org on every request; holds, charges, and wonda balance

route through the org wallet.

  • wonda use --personal — back to personal.

wonda topup always tops up your personal wallet, regardless of

context. Topping up the org wallet (and configuring auto top-up) is

admin-only and happens on the web at /organizations/<slug>. If a

member runs out of org credits, the error tells them to ask an admin or

switch back to personal — they cannot top up the org wallet from CLI.

Roles inside an org are separate from the seat plan:

  • Owner: the original creator. Cannot be demoted or kicked. Can transfer ownership to another member from the org page (rare).
  • Admin: can invite (single or bulk via paste), kick, change roles, change seats, top up, configure auto top-up, change monthly limits.
  • User: can only spend within the org wallet (subject to a per-member monthly limit if the admin set one).

A paid org seat (WONDA / WONDA_PREMIUM) grants the same paid feature access (skills, etc.) as a personal paid plan, but only while in org context. wonda use --personal falls back to the user's personal account plan.

Access tiers

Not all commands are available to every account type:

Tier

Access

Anonymous (temporary account, no login)

Media upload/download, editing (video/edit, image/edit, audio/edit), transcription, social publishing, scraping, analytics

Free (logged in, Basic/Free plan)

Everything above + generation (image/generate, video/generate, etc.), styles, recipes, brand

Paid (Plus, Pro, or Absolute plan)

Everything above + video analysis (requires credits), skill commands (wonda skill install/list/get)

Flagged (per-account PostHog flags)

wonda transitions (transitionsEnabled), wonda clipping (clippingEnabled). Flip the flag in PostHog for the account.

Local (no API call, no credits)

wonda brand extract <url> (no --save) extracts brand tokens from a URL via the bundled Patchright + Chromium driver. No auth required. Requires a one-time wonda stealth-browser install first.

If a command returns a 403 error, check your plan at https://app.wondercat.ai/settings/billing.

Social signups (Instagram, TikTok, etc.)

Drive them with the wonda device primitives + a throwaway mailbox from wonda email. The screenshot → decide → tap/type/swipe loop is how these flows work — there's no shortcut command, and that's fine: social apps change their UI constantly and any canned flow would drift faster than you could maintain it.

Standard loop:

  • wonda email account create --random → save {email, password}. Persist the resulting platform login with wonda credentials create --website instagram.com --username <handle> --email <email> --password-stdin <<< "<pw>" (passwords are AES-256-GCM encrypted at rest; retrieve later with wonda credentials get <id>).
  • wonda device create → pick a ready device (poll wonda device get <id> --fields status).
  • wonda device launch <device-id> com.instagram.android (or com.zhiliaoapp.musically for TikTok). Fall back to wonda device open-url if you'd rather start in the web flow.
  • Loop: wonda device screenshot <device-id> > s.json → decode the base64 PNG → read → pick an action → tap | type | swipe | key → screenshot again. Use --text "SomeButtonLabel" on tap before guessing coordinates; fall back to --x --y read off the screenshot for elements without matching text (number pickers, date spinners, etc.).
  • When the app sends a verification email, wonda email inbox wait <email> --timeout 120 — returns {codes: ["483921"], links: [...]} with the 6-digit code already extracted. wonda device type <device-id> --text "<code>" to feed it back. Race-safety: capture a timestamp before triggering the signup (SINCE=$(date -u +%FT%TZ)) and pass --since "$SINCE" — otherwise a fast mail server can land the email before your wait call and the old snapshot filters it out.
  • For number/date spinners: tap on the highlighted cell, Android pops up a numeric or alphabetic keyboard, wonda device type --text "<value>" replaces the selected text. wonda device key --code 4 dismisses the keyboard when done.

Consent-like taps — anything that accepts Terms/Privacy/Cookies, grants permissions, or publishes something. Before starting an automation that may hit these, ask the user once in chat whether to auto-accept them. If they say yes, tap through without pausing; if they say no, stop at each one and confirm. This does not apply to CAPTCHAs or "prove you're human" puzzles — always hand those off via wonda device stream (see next section).

Rate-limit signals — if the app shows you a visual puzzle ("we want to make sure you're a real person"), stop and hand off to the user with wonda device stream <id> (see next section). Don't click through puzzles yourself.

Voice cloning

Clone a voice from a 10s+ audio clip and use it in TTS. Hard limit: 20 cloned voices per account. Cost: $1.50 per clone.

# Clone from a local file (auto-uploads to media library first)

wonda voice create "Andu" --file ./sample.mp3 --description "My voice"

# Clone from existing wonda media

wonda voice create "Brand" --media-id <uuid>

# Optional source-audio preprocessing

wonda voice create "Clean" --file ./raw.wav --noise-reduction --normalize-volume

# List cloned voices (each row reports isExpired and expiresInDays)

wonda voice list

# One voice

wonda voice get <voice-id>

# Rename / re-describe (local only, no provider call)

wonda voice update <voice-id> --name "New Name" --description "..."

# Delete

wonda voice delete <voice-id>

Use a cloned voice in TTS by passing the providerVoiceId from voice get as voiceId to /audio/speech:

wonda audio speech "Hello world" \

  --model minimax-speech-2-8-hd \

  --params '{"voiceId":"<providerVoiceId>"}'

7-day expiry: cloned voices that haven't been used in TTS within 7 days are automatically expired. Running TTS with a cloned voice automatically refreshes its expiry. Idle voices that lapse must be re-cloned ($1.50 again).

Credentials vault

Persist logins created on external platforms (Instagram, TikTok, Twitter, etc.) so they can be reused on the next run. Passwords are AES-256-GCM encrypted with a server-side key and only decrypted on get.

# Create

wonda credentials create --website instagram.com --username myhandle \

  --email me@example.com --password-stdin <<< "hunter2" \

  --metadata '{"signup_source":"wonda-email"}'

# List (passwords omitted)

wonda credentials list --website instagram.com

# Get full record including decrypted password

wonda credentials get <id>

# Update any field (use --password-stdin to rotate; --username "" to clear)

wonda credentials update <id> --username newhandle

# Delete

wonda credentials delete <id>

# Fetch + record why you're using it in one call — POST, not GET, because

# it writes a 'used' event with the reason. Prefer this over `get` whenever

# you can articulate the reason.

wonda credentials use <id> --reason "instagram signup flow"

# See recent events (created / used / rotated / updated) for audit

wonda credentials events <id>

Fields: website (required — typed input like insta is canonicalized to instagram.com), username, email, password (required), metadata (arbitrary JSON). At least one of username / email must be present. Multiple records per (website, username) are allowed — dedupe on your side if you need to.

Event log: every credentials get/use, create, password rotate, and other updates are recorded as events on the credential (actor: cli | web | system). Use credentials events <id> or the web UI's history icon to audit. The event log is append-only and cascades on credential delete.

Handing off to a human

If automation hits a screen that requires a human to take over (consent flow you shouldn't auto-accept, ambiguous UI, step where the user prefers to act themselves), use wonda device stream <device-id> — returns a playerUrl signed with a short-lived JWT (1h). Give that URL to the user, they act in their own browser, and automation can resume afterward.

wonda device stream <device-id>

# → { "streamUrl": "wss://…", "playerUrl": "https://…", "deviceType": "social" }

Global output flags

All commands support these output control flags:

  • --json — Force JSON output (auto-enabled when stdout is piped)
  • --quiet — Only output the primary identifier (job ID, media ID, etc.) — ideal for scripting
  • -o <path> — Download output to file (implies --wait)
  • --fields status,outputs — Select specific JSON fields
  • --jq '.outputs[0].media.url' — Filter JSON output with a jq expression

How to think about content creation

You are a marketing director with access to a full production toolkit. Before touching any tool, think:

  • What product category? (beauty, food, tech, fashion, fitness, etc.)
  • What format performs for this category? (UGC memes for everyday products, cinematic for luxury, before/after for transformations, testimonial for services)
  • What's the hook? (relatable scenario, surprising twist, aspirational lifestyle, social proof)
  • What specific scene? (not "product on table" but "person discovering the product in a funny situation")

Decision flow

When asked to create content, follow this order:

Step 1: Gather context

wonda brand                                                    # Active brand: identity, colors, fonts, logos, products

wonda brand list                                               # All brands owned by this account/org

wonda brand show <brand-id>                                    # Specific brand

wonda brand extract https://stripe.com                         # Local-only: writes ./output/stripe.com/{DESIGN.md, tokens.json, assets/}

wonda brand extract https://stripe.com --save --make-active    # Local + persist + activate (the common path)

wonda brand extract https://stripe.com --save --name "Stripe"  # Persist with a custom name

wonda brand extract https://stripe.com --no-output --save      # Don't write to disk, persist only

wonda brand save                                               # Persist the most recent ./output/<domain>/ dir to the server

wonda brand save --from ./output/stripe.com --make-active

wonda brand pull <brand-id>                                    # Download a saved brand back to ./output/<domain>/

wonda brand activate <brand-id>                                # Set as the active brand

wonda brand upload-logo <brand-id> https://acme.com/logo.svg   # Attach a logo by URL (--variant wordmark|icon|dark|light)

wonda brand upload-font <brand-id> https://acme.com/Geist.woff2 --weight 700

wonda brand delete <brand-id>

wonda analytics instagram                                      # What content performs well

wonda scrape social --handle @competitor --platform instagram --wait  # Competitive research (if relevant)

# Cross-platform research (if relevant)

wonda x search "topic OR keyword"                              # Find conversations on X/Twitter

wonda x user-tweets @competitor                                # Competitor's recent tweets

wonda reddit search "topic" --sort top --time week             # Reddit discussions

wonda reddit feed marketing --sort hot                         # Subreddit trends

wonda linkedin search "topic" --type COMPANIES                 # LinkedIn company/people research

wonda linkedin profile competitor-vanity-name                  # LinkedIn profile intel

Step 2: Check content skills

Content skills are step-by-step guides for common content types. Each skill tells you exactly which models, prompts, and editing operations to use — and in what order. ALWAYS check skills before building from scratch.

wonda skill list                                # Browse all content skills

wonda skill get <slug>                          # Full step-by-step guide for a skill

Full skill index:

Slug

Description

Input

product-video

Product/scene video — prompt library for all categories

optional product image

ugc-talking

Talking-head UGC — single clip, two-angle PIP, or 20s+ with B-roll

optional reference

ugc-reaction-batch

Batch TikTok-native UGC reactions with viral strategy

optional product image

tiktok-ugc-pipeline

Scrape viral reel → generate 5 UGC → post as drafts

reel or TikTok URL

ugc-dance-motion

Dance/motion transfer

image + video

marketing-brain

Marketing strategy brain — hooks, visuals, ads

user brief

reddit-subreddit-intel

Scrape top posts, analyze virality, generate ideas

subreddit + product

twitter-influencer-search

Find X influencers and amplifiers

competitor/niche keywords

tiktok-slideshow-carousel

3-slide TikTok carousel — hook, bridge, product reveal

app screenshot + audience

creative-static-ads

Single-frame static ad images — 6 conversion pillars, 8 archetypes, 8 psychological hooks

product + optional image

ffmpeg

All local ffmpeg recipes — trim, audio swap, captions, social formats, scene split, silence cut, frame extraction, analysis artifacts

local video path or mediaId

image-edit

All image edit paths — img2img, background removal, crop, text overlay, vectorize

image mediaId or local path

If a skill matcheswonda skill get <slug>, read it, adapt to context, execute each step.

If no skill matches → build from scratch (Step 3).

Step 2.5: Decide whether finishing should be local

Not every media task should go back through Wonda editing. Use this routing rule:

  • Use wonda for AI generation, AI transcription/alignment, scraping, publishing, hosted transitions, and workflows that need media IDs or remote jobs.
  • Use local ffmpeg for deterministic transforms on files you already have or can download: trim, crop/scale/pad, concat (merging multiple clips), replace audio, extract audio/frame, reverse, normalize for delivery, burn captions, split scenes, cut silence, and build analysis artifacts. Always merge clips locally — server-side merge can hang for 30+ minutes once any input exceeds ~7MB.

When a task starts from a Wonda media ID but the actual edit is deterministic, move it to local files first:

wonda media download <mediaId> -o ./input.mp4

Before any local ffmpeg work:

which ffmpeg

which ffprobe

ffmpeg -version

ffprobe -v error -show_format -show_streams -of json ./input.mp4

Font rule for local caption/text work:

  • Prefer an explicit font file path over a family name.
  • Never assume a font exists. Check first with fc-match, fc-list, /System/Library/Fonts, /Library/Fonts, ~/Library/Fonts, or /usr/share/fonts.
  • If the task is mainly local finishing/captions/formatting/splitting/artifact extraction, check the ffmpeg skill before inventing commands.
  • wonda edit video renders locally for every single-video op (trim, crop, speed, volume, textOverlay, animatedCaptions with supplied captions, editAudio). No flag needed: the server returns a manifest, the CLI renders it, uploads the output, and finalizes the editor_job. Multi-video ops (overlay, splitScreen, splitScenes, motionDesign, merge) are not handled by wonda edit video. Build them locally with the ffmpeg skill instead (concat, overlay, hstack/vstack, scene assembly). Never mix per-clip audio then concat — concat the video tracks first, then layer the full voiceover or music track once over the joined timeline. Per-clip audio bakes create cut-line collisions and silent gaps.

Default local export target unless the user asked otherwise:

-c:v libx264 -preset medium -crf 18 -pix_fmt yuv420p -movflags +faststart -c:a aac -b:a 192k

Always pass -y as the first flag so the command auto-overwrites the output. ffmpeg prompts interactively when the output path exists and agent shells hang on that prompt until timeout.

Step 3: Build from scratch (chain endpoints)

When no skill matches, chain individual CLI commands. Each step produces an output that feeds into the next.

Single asset:

wonda generate image --model gpt-image-2 --prompt "..." --aspect-ratio 9:16 --wait -o out.png

# --params '{"quality":"high"}' — auto/low/medium/high (default auto)

# --negative-prompt "..."       — override what to exclude (model-dependent)

# --seed <number>               — pin the seed for reproducible results (model-dependent)

wonda generate video --model seedance-2 --prompt "..." --duration 5 --params '{"quality":"high"}' --wait -o out.mp4

wonda generate text --model <model> --prompt "..." --wait

wonda generate music --model suno-music --prompt "upbeat lo-fi" --wait -o music.mp3

Audio (speech, transcription, dialogue):

# List available voices (TTS + dialogue use the same set)

wonda audio voices

# Text-to-speech

wonda audio speech --model elevenlabs-tts --prompt "Your script here" \

  --params '{"voiceId":"hpp4J3VqNfWAUOO0d1Us"}' --wait -o speech.mp3

# elevenlabs-tts always requires a voiceId — pick one from `wonda audio voices`

# Transcribe audio/video to text

wonda audio transcribe --model elevenlabs-stt --attach $MEDIA --wait

# Multi-speaker dialogue (each speaker needs a voiceId from `wonda audio voices`)

wonda audio dialogue --model elevenlabs-dialogue \

  --prompt 'ALICE: Hi! BOB: Hello!' \

  --params '{"speakers":[{"label":"ALICE","voiceId":"hpp4J3VqNfWAUOO0d1Us"},{"label":"BOB","voiceId":"IKne3meq5aSn9XLyUdCD"}]}' \

  --wait -o dialogue.mp3

Audio AI operations (direct-inference, NOT editor ops):

# Denoise / dereverberate speech

wonda audio enhance --model replicate-resemble-enhance --attach $MEDIA \

  --params '{"denoise":true,"chunkSeconds":10}' --wait -o enhanced.wav

# Split a track into voice and instrumental stems

wonda audio extract-voice --model replicate-demucs --attach $MEDIA \

  --wait -o vocals.wav

DO NOT use wonda edit video --operation enhanceAudio or --operation voiceExtractor — those paths are deprecated. They still work but emit a warning, and they route through the heavier editor_job pipeline for no functional reason.

Add animated captions to a video:

The animatedCaptions operation handles everything in one step — it extracts audio, transcribes for word-level timing, and renders animated word-by-word captions onto the video.

# Generate a video with speech audio

VID_JOB=$(wonda generate video --model seedance-2 --prompt "..." --duration 5 --aspect-ratio 9:16 --params '{"quality":"high"}' --wait --quiet)

VID_MEDIA=$(wonda jobs get inference $VID_JOB --jq '.outputs[0].media.mediaId')

# Add animated captions (single step)

wonda edit video --operation animatedCaptions --media $VID_MEDIA \

  --params '{"fontFamily":"TikTok Sans SemiCondensed","position":"bottom-center","sizePercent":80,"strokeWidth":2.5,"fontSizeScale":0.8,"highlightColor":"rgb(252, 61, 61)"}' \

  --wait -o final.mp4

The video's original audio is preserved. Do NOT replace the audio with TTS — Sora already generated the speech.

**Alternative engine: --captions-engine ffmpeg.**

Use when the user wants the typewriter look or an opaque/rounded chyron behind the active word. Plain brew install ffmpeg is enough. This path is CLI-only today (it does not go through editor_job, so credits are not charged for the render).

# progressive (default for ffmpeg engine) — cumulative reveal,

# optional rounded pill behind the active word via highlightColor.

wonda edit video --operation animatedCaptions \

  --captions-engine ffmpeg --captions-preset progressive \

  --media $VID_MEDIA \

  --caption-segments "$(echo "$STT_OUT" | jq -c '.outputs[] | select(.outputKey=="wordTimestamps") | .outputValue | map({text: .word, startS: .start})')" \

  --params '{"fontFamily":"TikTok Sans","textColor":"#FFFFFF","strokeColor":"#000000","strokeWidth":3,"fontSizeScale":1.1,"paddingBottom":25,"highlightColor":"#FF3D3D","backgroundBorderRadius":18}' \

  -o final.mp4

# typewriter — letters appear one at a time at constant interval (60ms/char)

# with a square white caret. Pass plain white text (no background).

wonda edit video --operation animatedCaptions \

  --captions-engine ffmpeg --captions-preset typewriter \

  --media $VID_MEDIA \

  --caption-segments "$STT_WORD_TIMESTAMPS" \

  --params '{"fontFamily":"TikTok Sans","textColor":"#FFFFFF","fontSizeScale":1.1,"paddingBottom":12}' \

  -o final.mp4

Fonts are bundled into the binary, so the standard fontFamily values (TikTok Sans variants, Nohemi, Comic Cat, Gavency) work out of the box with no extra setup. --fonts-dir is an optional override for power users who want to bring their own font collection: when set, the renderer searches that directory first and only falls back to the bundled set if it doesn't find a match.

Vertical placement is controlled by paddingBottom (a percentage of canvas height, distance from canvas bottom to the caption's bottom edge). Sensible values: 12 for traditional bottom-of-frame subtitles, 25 for the TikTok 3/4-from-top sweet spot, 35 for visibly mid-bottom. paddingTop does the same when position starts with top-*. Without these, captions snap to the very edge of the canvas.

Transitions (effects pipelines on a single video):

wonda transitions presets                            # List built-in presets (JSON)

wonda transitions operations                         # Grouped by category (analysis/effect/...)

wonda transitions operations --json                  # Full per-param metadata

wonda transitions llms                               # Full reference (presets + ops + dependencies)

wonda transitions run --media $VID --preset flash_glow --wait -o out.mp4

# Or send an agent-generated timeline of clips (inline JSON):

wonda transitions run --media $VID \

  --clips '[{"layer_type":"video","start_frame":0,"end_frame":60}]' --wait -o out.mp4

# Or from a file (handy for long agent timelines):

wonda transitions run --media $VID --clips ./timeline.json --wait -o out.mp4

# To attach scene_transitions: pass an envelope (clips + scene_transitions)

# instead of a bare clip array — same file, both fields forwarded.

wonda transitions run --media $VID --clips ./timeline_with_transitions.json --wait -o out.mp4

# where timeline_with_transitions.json is:

#   { "clips": [...],

#     "scene_transitions": [{"name":"crossfade","params":{"duration":8},"boundaries":[60]}] }

wonda transitions job <jobId>                        # Poll a transition job

Use exactly one of --preset or --clips. Requires a full (logged-in) account. **Always read wonda transitions llms first when composing a clips timeline.** It documents the detect/segment/effect dependencies, which ops need masks, and the full clip-spec shape (layer types, tracks, effects, transforms).

**Preset variables (variables block).** Each preset declares the template variables it accepts under variables in wonda transitions presets. Each entry has name, description, and required. Required variables MUST be supplied or the job is rejected with a 400 — no more silent skipping. Pass them with --var name=value (repeatable) or, for the common prompt case, the --prompt shortcut:

# flash_glow_prompted requires { prompt }

wonda transitions run --media $VID --preset flash_glow_prompted \

  --prompt "woman in white dress" --wait -o out.mp4

# text_behind_person requires { prompt, text }

wonda transitions run --media $VID --preset text_behind_person \

  --var prompt="the person" --var text="HELLO WORLD" --wait -o out.mp4

# Numeric-typed vars: bare digits are decoded as numbers, "true"/"false" as

# bools, everything else stays a string. Presets that compare frame indices

# numerically (border_frame, marquee_text, quick_motion_text, bg_remove_scale)

# need this — quoting an int turns it back into a string.

wonda transitions run --media $VID --preset border_frame \

  --var exit_start_frame=200 --var exit_end_frame=251 --wait -o out.mp4

The prompt variable is a detection text query describing which subject to mask, fed to SAM3 to produce per-frame segmentation masks. Not a content-generation prompt.

Building a custom --clips timeline that needs detection masks? Add a clip with layer_type: "video" and a mask: {layer_type: "mask", analysis_steps: [{name: segment, params: {prompt: "..."}}]}. SAM3 handles both detection and segmentation in one step from the prompt, so no separate detect step is needed.

Pre-warming masks before render (recommended)

For presets with mask:<label> variables, run wonda transitions ensure-masks first so the render starts with masks already prepared. The first call for a (media, label) pair takes 1-3 minutes; subsequent calls are near-instant.

# 1. Ensure masks are prepared for the labels you'll use, blocking until ready.

wonda transitions ensure-masks --media $VID --labels person,phone --wait

# 2. Run the render. Masks are already prepared.

wonda transitions run --media $VID --preset slide_reflect_background \

  --var "masks=mask:person+phone" --wait -o out.mp4

ensure-masks flags:

  • --media MEDIA_ID — required, the video the masks are for
  • --label NAME — repeatable, one label per call (--label person --label phone)
  • --labels NAME,NAME — comma-separated alternative (--labels person,phone)
  • --wait — block until every label is prepared
  • --timeout DUR — cap wait time when --wait is set (default 10m)

Multi-prompt syntax: mask:woman+phone in --var is split into separate masks (woman, phone) and unioned per-frame. Pass each sub-label separately to ensure-masks so all of them are pre-warmed.

When to skip ensure-masks:

  • Non-mask presets (no mask:<label> variables) — nothing to prepare
  • A previous render already used these (media, labels) — already prepared

When ensure-masks matters most:

  • First render of a new media with mask-based presets
  • Iterating params on a render — pre-warm once, then run as many times as you want without re-preparing

**Multi-scene presets (requiresMultiScene: true).** Some presets use scene-aware logic and expect a video with multiple cuts/scenes. Check requiresMultiScene in wonda transitions presets. If true, feeding a single continuous shot will produce only one scene and the effect may look underwhelming. Combine clips first or use a video with natural cuts.

Tweaking preset params. Every preset is clip-shape. Pull a single preset with wonda transitions preset <name> --json, read its clips: (single-track) or tracks: (multi-track) field, edit any clip param, and submit as --clips. For multi-track presets, flatten by giving each clip a track index drawn from the track it came from. If the preset declares sceneTransitions:, pass that array through unchanged on the request.

# Single-track preset (e.g. flash_glow_montage): copy clips: directly

wonda transitions preset flash_glow_montage --json | jq '.preset.clips' > clips.json

# edit clips.json

wonda transitions run --media $VID --clips "$(cat clips.json)" --wait -o out.mp4

**Auto-repair safety net (--auto-repair, --face-bbox).** For --clips renders the worker runs a deterministic repair pass on the submitted JSON before rendering, default on. Repairs: width-fit font clamp, descender clamp against canvas bottom, stack-spacing snap (ROW1_py from cap-height formula), keyframe-bound clamp to [0, source_duration], same-y-row caption overlap trim, mask full-duration extension, stroke-width zeroing, letter-spacing target snap per font, mask-cutout duration extension, negative-start clamp, and (with --face-bbox) face-overlap caption shift. Pass --auto-repair=false for strict validation; out-of-spec values then surface as render errors.

# Push body captions off the speaker's face. bbox is x1,y1,x2,y2 in canvas pixels (top-left origin).

wonda transitions run --media $VID --clips ./timeline.json \

  --face-bbox 200,160,520,520 --wait -o out.mp4

# Strict mode — disable auto-repair to see exactly which clips fail validation.

wonda transitions run --media $VID --clips ./timeline.json \

  --auto-repair=false --wait -o out.mp4

--face-bbox only shifts body captions. Decorative text you want behind the speaker still routes through an explicit mask_cutout {prompt: "person"} clip.

Output URL paths differ by job type:

  • Inference jobs (generate, audio): .outputs[0].media.url and .outputs[0].media.mediaId
  • Editor jobs (edit): .outputs[0].url and .outputs[0].mediaId

Model waterfall

Image

Default: gpt-image-2. OpenAI's flagship — strongest prompt adherence, best text-in-image, high-fidelity edits via reference images. Handles 1-4 reference images. Quality tiers: auto (default), low, medium, high — pass via --params '{"quality":"high"}'. Caps at 1536px output.

For img2img editing specifically (change, add/remove, restyle, bg-remove, crop, text overlay, vectorize), use wonda skill get image-edit — it has the full edit-specific decision tree.

Pick something else only when one of these applies:

  • User explicitly requests another model
  • More than 4 reference imagesnano-banana-2 (gpt-image-2 caps at 4 refs; nano-banana-2 accepts up to 14). For 1-4 refs, stay on gpt-image-2.
  • Need vector output → runware-vectorize
  • Need background removal → birefnet-bg-removal
  • Cheapest possible / fastest drafts → z-image
  • Need >1536px / true 4K output → nano-banana-pro (1K/2K/4K) or nano-banana-2 (1K/2K/4K). gpt-image-2 caps at 1536px.
  • gpt-image-2 unavailable / OpenAI down → nano-banana-2 or seedream-4-5 or grok-imagine-pro

Video

Default: seedance-2 (duration 5/10/15s, default 5s, quality: high). Escalation:

  • Quality complaint or different style → sora2 or sora2pro
  • Max single-clip duration is 15s for Seedance 2, 20s for Sora → for longer content, stitch multiple clips via merge
  • Veo (veo3_1, veo3_1-fast) is available but NOT in the default waterfall. Only pick Veo when the user explicitly asks for Veo by name.

Image-to-video routing (MANDATORY when attaching a reference image):

  • Person/face visible in the reference image → MUST use kling_3_pro (preserves identity better for faces)
  • No person in reference image → use seedance-2
  • Text-to-video (no reference image): Seedance 2 generates people fine. This rule ONLY applies when you --attach an image.

Kling model family:

  • kling_3_pro — Text-to-video and image-to-video, supports start/end images, custom elements (@Element1, @Element2), 3-15s duration, 16:9/9:16/1:1
  • kling_2_6_pro — General purpose, 5-10s, 16:9/9:16/1:1, text-to-video and image-to-video
  • kling_2_6_motion_control — Motion transfer: requires both a reference image AND a reference video, recreates the video's motion with the image's appearance
  • kling2_5-pro — Budget Kling option, 5-10s, supports first/last frame images

Kling prompt rules (important): Kling's prompt field caps at 2,500 characters and Kling responds poorly to Sora-style structured briefs (SCENE: / SUBJECT: / MOTION: / BANNED LOOK: section headers). In that format Kling latches onto atmosphere nouns and silently drops the central subject (verified empirically: the same 2,842-char Sora-style prompt that rendered correctly on Sora 2 Pro and Seedance 2 produced no phone at all on Kling — even when trimmed to 2,250 chars). When escalating Seedance → Kling, or targeting Kling directly, rewrite the prompt as short natural-language prose (~1,000–1,500 chars) and lead with the hero subject in the opening sentence rather than burying it inside a SUBJECT: block. Do NOT pass a Sora-formatted prompt through to Kling unchanged.

Other video models:

  • grok-imagine-video — xAI video generation, 5-15s, supports 7 aspect ratios including 4:3 and 3:2
  • topaz-video-upscale — Upscale video resolution (1-4x factor, supports fps conversion)
  • sync-lipsync-v2-pro — Legacy lipsync for user-supplied video + audio pairs. Inferior to native-audio generation and almost never the right choice for new content. See the "Lip sync" section for rules.

Seedance family (DEFAULT video model, watermarks automatically removed):

  • seedance-2 — Base Seedance 2.0 (T2V/I2V, 5-15s, high=standard/basic=fast)
  • seedance-2-omni — Multi-reference generation (images, audio refs)
  • seedance-2-video-edit — Edit existing video via text prompt

Video durations: Accepted --duration values vary by model. Check with wonda capabilities or wonda models info <slug>.

Audio

  • Music: suno-music (set --params '{"instrumental":true}' for no vocals)
  • Text-to-speech: elevenlabs-tts — only for explicit narrator/voice-over asks over silent footage. Do NOT use to "make a UGC character talk" — Sora / Sora 2 Pro / Veo 3.1 / Kling 3 / Seedance 2 generate native synced speech in any language, which looks and sounds far better. Always set voiceId in params. Default female voice: --params '{"voiceId":"21m00Tcm4TlvDq8ikWAM"}' (Rachel).
  • Transcription: elevenlabs-stt
  • Multi-speaker dialogue: elevenlabs-dialogue
  • Enhance audio (clean up noisy speech): replicate-resemble-enhance via wonda audio enhance — denoise + dereverberate. Use when a voice recording sounds muffled, echoey, or has background noise. NOT a general "sounds better" button; if the source is already clean this can soften it.
  • Extract voice (isolate vocals / split stems): replicate-demucs via wonda audio extract-voice — splits into voice and instrumental tracks. Use to pull a speaker or singer off a track, or to isolate the music behind a vocal.

Native synced speech (preferred over TTS + lipsync): Sora, Sora 2 Pro, Veo 3.1, Kling 3, and Seedance 2 all generate dialogue in any language directly inside the video, with mouth movements baked in. Put the line (and language) in the video model's --prompt. Never chain elevenlabs-ttssync-lipsync-v2-pro to fake speech over a silent generation.

Characters

Characters are reusable saved combos (image + optional voice audio) you can mention in prompts with @name. The server auto-injects the image, optional face video, and audio into the right slots for the selected model. Works on Kling 3 Pro (start_image + element_1 + voice_audio) and Seedance 2 Omni (ref_image_1 + ref_video_1 + ref_audio_1). Name rules: must start with a letter, 1–31 chars, alphanumeric + _/-.

Provider gotchas (Seedance 2 Omni): when a character is mentioned, the API routes Seedance to MuAPI automatically. Replicate enforces a 15s ref_audio_1 cap and rejects famous-celebrity refs with E005 — input flagged as sensitive. MuAPI is the reliable path for character-driven jobs. Even on MuAPI, top-tier celebrity refs (think Sydney Sweeney, Leonardo DiCaprio) are blocked with "Face detected in uploaded image. Please use an image without real people." Non-celebrity faces and lesser-known public figures pass cleanly. If you see that error on a real-person ref, use Kling 3 Pro instead (its character pipeline runs voice cloning server-side, so the raw face audio never touches a moderation classifier).

From a Kling clip — extract a frame + voice from a generation you like:

VID=$(wonda generate video --model kling_3_pro --prompt "young man, grey tshirt, talking to camera" --wait --quiet)

VID_MEDIA=$(wonda jobs get inference $VID --jq '.outputs[0].media.mediaId')

wonda character from-media alex --source $VID_MEDIA --frame-ms 2500

wonda generate video --model kling_3_pro --prompt "@alex welcomes viewers to the channel" --wait -o alex-welcome.mp4

From scratch — generate a portrait and a TTS sample, then bind them:

IMG=$(wonda generate image --model nano-banana-2 --prompt "young woman, studio portrait" --wait --quiet)

IMG_MEDIA=$(wonda jobs get inference $IMG --jq '.outputs[0].media.mediaId')

AUD=$(wonda audio speech --model elevenlabs-tts --prompt "Hi, this is me" --params '{"voiceId":"21m00Tcm4TlvDq8ikWAM"}' --wait --quiet)

AUD_MEDIA=$(wonda jobs get inference $AUD --jq '.outputs[0].media.mediaId')

wonda character create maya --image $IMG_MEDIA --audio $AUD_MEDIA

List / inspect / update / delete: wonda character list, wonda character get <name>, wonda character update <name> --audio $NEW, wonda character delete <name>. Only one character with audio can be referenced per generation.

Prompt writing rules

Follow this waterfall top-to-bottom. Use the FIRST matching rule and stop.

-

PASSTHROUGH — If the user says "use my exact prompt" / "verbatim" / "no enhancements" → copy their words exactly. Zero modifications.

-

IMAGE-TO-VIDEO — When a source image feeds into a video model, describe MOTION ONLY. The model can see the image. Do NOT describe the image content.

  • Good: "gentle breathing motion, camera slowly pushes in, atmospheric lighting shifts"
  • Bad: "Two cats on a lavender background breathing softly" (describes the image)

-

EMPTY PROMPT (from scratch) — Use the user's exact request as the prompt. Do NOT add style descriptors, lighting, composition, or mood.

  • User says "create an image of a cat with sunglasses" → prompt: "create an image of a cat with sunglasses"
  • Do NOT enhance to "A playful orange tabby wearing oversized reflective sunglasses, studio lighting, shallow depth of field"

-

NON-EMPTY PROMPT (adapting a template) — Keep the structure and style, only swap content to match the user's request. Keep prompts literal and constraint-heavy.

Aspect ratio rules

Three cases, no exceptions:

  • User specifies a ratio → use it: --aspect-ratio 16:9
  • User doesn't mention ratio → explicitly set --aspect-ratio 9:16 for social content (UGC, TikTok, Reels, Stories). Portrait is the default for any social/marketing video.
  • Editing existing media → use --aspect-ratio auto to preserve source dimensions

UGC and social content is ALWAYS portrait (9:16). If someone asks for a TikTok, Reel, Story, or UGC video, always use --aspect-ratio 9:16. Landscape is only for YouTube, presentations, or when explicitly requested.

Square (1:1) is supported by all Kling models and some image models — use for Instagram feed posts when requested.

Common chaining patterns

These patterns show how to compose multi-step pipelines by chaining CLI commands. Each step's output feeds into the next.

No need to download and re-upload between steps. Every generation and edit

produces a media ID in its output. Pass that ID directly to the next command

via --media or --audio-media. Use --jq '.outputs[0].media.mediaId'

for inference jobs and --jq '.outputs[0].mediaId' for editor jobs.

Only use -o <file> on the FINAL step to download the finished output.

Animate an image to video

MEDIA=$(wonda media upload ./product.jpg --quiet)

# No person in image → Seedance 2

wonda generate video --model seedance-2 --prompt "camera slowly pushes in, product rotates" \

  --attach $MEDIA --duration 5 --params '{"quality":"high"}' --wait -o animated.mp4

# Person in image → Kling (ONLY when attaching a reference image with a person)

wonda generate video --model kling_3_pro --prompt "the person turns and smiles" \

  --attach $MEDIA --duration 5 --wait -o person.mp4

Replace audio on a video (TTS voiceover or music)

# Generate TTS

TTS_JOB=$(wonda audio speech --model elevenlabs-tts --prompt "The script" \

  --params '{"voiceId":"21m00Tcm4TlvDq8ikWAM"}' --wait --quiet)

TTS_MEDIA=$(wonda jobs get inference $TTS_JOB --jq '.outputs[0].media.mediaId')

# Mix onto video (mute original, full voiceover)

wonda edit video --operation editAudio --media $VID_MEDIA --audio-media $TTS_MEDIA \

  --params '{"videoVolume":0,"audioVolume":100}' --wait -o with-voice.mp4

Only use this when you need to REPLACE the video's audio. Sora, Sora 2 Pro, Veo 3.1, Kling 3, and Seedance 2 all generate native synced speech in any language — don't replace it with TTS unless the user explicitly asks for a different voiceover. Never reach for this step to "add speech" to a UGC/talking-head clip; put the dialogue in the video model's prompt instead.

Add static text overlay

Static overlays (meme text, "chat did i cook", etc.) use smaller font sizes than captions. They're ambient, not meant to dominate the frame.

wonda edit video --operation textOverlay --media $VID_MEDIA \

  --prompt-text "chat, did i cook" \

  --params '{"fontFamily":"TikTok Sans SemiCondensed","position":"top-center","sizePercent":66,"fontSizeScale":0.5,"strokeWidth":4.5,"paddingTop":10}' \

  --wait -o with-text.mp4

Featured textOverlay + animatedCaptions presets. wonda edit {video,image,audio} accepts --preset <name> (scoped to --operation). --params fields override preset values on key collisions.

textOverlay (static, top-centered):

  • TikTok White Highlight — black text on a slightly rounded white box.
  • TikTok Black Highlight — white text on a slightly rounded black box.
  • TikTok Red Highlight — white text on a slightly rounded red (#E14135) box.

animatedCaptions (STT-driven, bottom-centered):

  • TikTok White Captions — black text, white highlight on the active word.
  • TikTok Black Captions — white text, black highlight on the active word.
  • TikTok Red Captions — white text, red (#E14135) highlight on the active word.
wonda edit video --operation textOverlay \

  --preset "TikTok Red Highlight" --media <id> \

  --params '{"text":"YOUR HEADLINE"}' --wait -o ./out.mp4

Image textOverlay renders server-side; video textOverlay renders locally. No flag needed in either case.

Font sizing guide:

  • Static overlays: sizePercent: 66, fontSizeScale: 0.5, strokeWidth: 4.5
  • Animated captions: sizePercent: 80, fontSizeScale: 0.8, strokeWidth: 2.5, highlightColor: rgb(252, 61, 61)
  • Font: TikTok Sans SemiCondensed for both

Add animated captions (word-by-word with timing)

The animatedCaptions operation extracts audio, transcribes, and renders animated word-by-word captions — all in one step.

wonda edit video --operation animatedCaptions --media $VIDEO_MEDIA \

  --params '{"fontFamily":"TikTok Sans SemiCondensed","position":"bottom-center","sizePercent":80,"strokeWidth":2.5,"fontSizeScale":0.8,"highlightColor":"rgb(252, 61, 61)"}' \

  --wait -o with-captions.mp4

For quick static captions (no timing, just text on screen), use textOverlay with --prompt-text:

wonda edit video --operation textOverlay --media $VIDEO_MEDIA \

  --prompt-text "Summer Sale - 50% Off" \

  --params '{"fontFamily":"TikTok Sans SemiCondensed","position":"bottom-center","sizePercent":80}' \

  --wait -o captioned.mp4

Add background music

MUSIC_JOB=$(wonda generate music --model suno-music \

  --prompt "upbeat lo-fi hip hop, warm vinyl crackle" --wait --quiet)

MUSIC_MEDIA=$(wonda jobs get inference $MUSIC_JOB --jq '.outputs[0].media.mediaId')

wonda edit video --operation editAudio --media $VID_MEDIA --audio-media $MUSIC_MEDIA \

  --params '{"videoVolume":100,"audioVolume":30}' --wait -o with-music.mp4

Editor output chaining

When chaining multiple editor operations (e.g., editAudio → animatedCaptions → textOverlay), extract the media ID from each editor job output and pass it to the next step. Note the jq path differs from inference jobs:

# Inference jobs: .outputs[0].media.mediaId

# Editor jobs:    .outputs[0].mediaId

EDIT_JOB=$(wonda edit video --operation editAudio --media $VID --audio-media $AUDIO \

  --params '{"videoVolume":0,"audioVolume":100}' --wait --quiet)

STEP1_MEDIA=$(wonda jobs get editor $EDIT_JOB --jq '.outputs[0].mediaId')

CAP_JOB=$(wonda edit video --operation animatedCaptions --media $STEP1_MEDIA \

  --params '{"fontFamily":"TikTok Sans SemiCondensed","position":"bottom-center","sizePercent":80,"strokeWidth":2.5,"fontSizeScale":0.8,"highlightColor":"rgb(252, 61, 61)"}' --wait --quiet)

STEP2_MEDIA=$(wonda jobs get editor $CAP_JOB --jq '.outputs[0].mediaId')

wonda edit video --operation textOverlay --media $STEP2_MEDIA \

  --prompt-text "Hook text" --params '{"position":"top-center","fontFamily":"TikTok Sans SemiCondensed","sizePercent":66,"fontSizeScale":0.5,"strokeWidth":4.5}' --wait -o final.mp4

Merge multiple clips

Always merge locally with ffmpeg. Server-side merge (wonda edit video --operation merge) can hang for 30+ minutes once any input exceeds ~7MB.

Download every Wonda media ID, then concat. Stream-copy is fast but requires matching codec/profile/resolution; fall back to re-encode if it errors:

wonda media download $CLIP1 -o /tmp/clip-1.mp4

wonda media download $CLIP2 -o /tmp/clip-2.mp4

wonda media download $CLIP3 -o /tmp/clip-3.mp4

cat > /tmp/concat.txt <<EOF

file '/tmp/clip-1.mp4'

file '/tmp/clip-2.mp4'

file '/tmp/clip-3.mp4'

EOF

ffmpeg -y -f concat -safe 0 -i /tmp/concat.txt -c copy /tmp/merged.mp4

# If stream-copy fails, re-encode:

# ffmpeg -y -f concat -safe 0 -i /tmp/concat.txt \

#   -c:v libx264 -preset medium -crf 18 -pix_fmt yuv420p -movflags +faststart \

#   -c:a aac -b:a 192k /tmp/merged.mp4

# Re-upload only if a downstream wonda step needs the mediaId.

MERGED_MEDIA=$(wonda media upload /tmp/merged.mp4 --quiet)

File order in concat.txt = playback order. See the ffmpeg skill for the full concat reference.

Split scenes / keep a specific scene

Two modes — pick by intent:

# Keep a specific scene (split mode) — splits into scenes, auto-selects one

wonda edit video --operation splitScenes --media $VID_MEDIA \

  --params '{"mode":"split","threshold":0.5,"minClipDuration":2,"outputSelection":"last"}' \

  --wait -o last-scene.mp4

# outputSelection: "first", "last", or 1-indexed number (e.g. 2 for second scene)

# Remove a scene (omit mode) — removes one scene, merges the rest

wonda edit video --operation splitScenes --media $VID_MEDIA \

  --params '{"mode":"omit","threshold":0.5,"minClipDuration":2,"outputSelection":"first"}' \

  --wait -o without-first.mp4

# outputSelection: which scene to REMOVE

Use omit mode for "remove frozen first frame" (common with Sora videos). Use split mode for "keep just scene X".

Image editing

Any image edit — img2img, background removal, crop, text overlay, vectorize — has its own skill with the full decision tree, aspect-ratio rules, and model waterfall for edits:

wonda skill get image-edit

One gotcha worth keeping here: image and video background removal use different models (birefnet-bg-removal vs bria-video-background-removal). Never swap them.

Lip sync (last-resort fallback — prefer native-audio video models)

Sora, Sora 2 Pro, Veo 3.1, Kling 3, and Seedance 2 all generate speech in any language with correctly synced mouth movements as part of the video itself. That path produces dramatically better results than sync-lipsync-v2-pro: better lip physics, better lighting, better costs, and no second inference round-trip. For any talking UGC, ad, or spokesperson video, put the dialogue directly in the video model's prompt — do not chain TTS + lipsync.

Only reach for sync-lipsync-v2-pro when the user EXPLICITLY supplies both a pre-existing video and a pre-existing audio clip and asks you to align the mouth to that audio. If a user asks for lipsync as the default method of making a character speak, push back: the native-audio video models are the better tool and work in any language.

wonda generate video --model sync-lipsync-v2-pro --attach $VIDEO_MEDIA,$AUDIO_MEDIA --wait -o synced.mp4

Video upscale

wonda generate video --model topaz-video-upscale --attach $VIDEO_MEDIA \

  --params '{"upscaleFactor":2}' --wait -o upscaled.mp4

Clipping (longform → vertical shorts)

wonda clipping takes a long video (podcast, interview, talking-head)

and produces short vertical clips. Selection is LLM-driven and supports

a natural-language --brief so you can ask for specific moments instead

of generic virality.

V1 renders 9:16 with face-tracked reframe (LR-ASD active-speaker

detection + One-Euro stabilizer, default) and the existing

animatedCaptions op + a top-third hook overlay per clip. Pass

--reframe blur-fill to keep the full landscape source inside a

vertical canvas with a blurred background instead.

Async: POST /api/v1/clipping returns a clippingJobId; the CLI polls

GET /api/v1/clipping/jobs/{id} under --wait. Pass --output <dir>

and the CLI downloads each rendered clip + a plan.json.

Auth: requires the clippingEnabled PostHog feature flag in prod; local

dev bypasses automatically.

**Source — never pass YouTube URLs to --url.** The flag exists on the

CLI but the underlying --url flow shells out to yt-dlp on the

video-worker container (Cloud Run / GCP datacenter IP). YouTube

blocks datacenter IPs with the "Sign in to confirm you're not a bot"

challenge and the worker has no cookie store, so YouTube ingest fails

at progress 0.05 with that error and the LLM hold has to be released.

For YouTube, always download locally and upload first:

yt-dlp -o /tmp/source.mp4 \

  -f "bv*[ext=mp4][height<=720]+ba[ext=m4a]/b[ext=mp4][height<=720]" \

  --merge-output-format mp4 "<youtube-url>"

MEDIA=$(wonda media upload /tmp/source.mp4 --quiet)

--url is fine for direct mp4 URLs (no JS, no anti-bot cookies).

# Plan only — fast, no render

wonda clipping --media $MEDIA --brief "the most controversial moments" --dry-run --wait

# Full pipeline: select + render + download

wonda clipping --media $MEDIA \

  --brief "the most controversial moments" \

  --caption-preset "TikTok Red Captions" \

  --hook auto \

  --wait --output ./clips/

# Filter by speaker (uses ElevenLabs diarization labels)

wonda clipping --media $MEDIA --speaker SPEAKER_00 --wait --output ./clips/

# Speaker rename for readable rationales

wonda clipping --media $MEDIA --speaker Joe \

  --speaker-map '{"SPEAKER_00":"Joe","SPEAKER_01":"Guest"}' --wait --output ./clips/

# Tune count and durations — pick a target length with a tolerance

wonda clipping --media $MEDIA --brief "punchy one-liners" \

  --count 5 --duration 20 --tolerance 5 --wait --output ./clips/

# Or specify an explicit min/max range instead (mutually exclusive

# with --duration/--tolerance)

wonda clipping --media $MEDIA --brief "punchy one-liners" \

  --count 5 --min-duration 8 --max-duration 30 --wait --output ./clips/

# Auto-pick FX preset per clip from a catalog

wonda clipping --media $MEDIA --auto-preset \

  --preset-catalog '[{"slug":"flash_glow","description":"glow + scene flash"},{"slug":"text_glow","description":"per-word text glow"}]' \

  --wait --output ./clips/

Job-status shape (returned by GET /api/v1/clipping/jobs/{id}):

{

  "clippingJobId": "...",

  "status": "succeeded",

  "stage": "succeeded",

  "progress": 1,

  "plan": {

    "sourceDurationSec": 1800.5,

    "speakers": ["SPEAKER_00", "SPEAKER_01"],

    "clips": [

      {

        "start": 12.4,

        "end": 38.7,

        "title": "Why he quit the agency",

        "hookText": "He admits…",

        "rationale": "Concedes \"the agency model is dead\" then explains why...",

        "score": 87,

        "dominantSpeaker": "SPEAKER_00",

        "reframeMode": "blur-fill",

        "preset": null,

        "mediaId": "uuid-of-rendered-clip",

        "url": "https://storage.googleapis.com/.../clip.mp4"

      }

    ]

  },

  "error": null

}

Editor operations reference

Operation

Inputs

Key Params

animatedCaptions

video_0

fontFamily, position, sizePercent, fontSizeScale, strokeWidth, highlightColor

textOverlay

video_0 + prompt

fontFamily, position, sizePercent, fontSizeScale, strokeWidth

editAudio

video_0 + audio_0

videoVolume (0-100), audioVolume (0-100)

merge

video_0..video_4

Handle order = playback order

overlay

video_0 (bg) + video_1 (fg)

position, resizePercent

splitScreen

video_0 + video_1

targetAspectRatio (16:9 or 9:16)

trim

video_0

trimStartMs, trimEndMs (milliseconds)

splitScenes

video_0

mode (split/omit), threshold, outputSelection

speed

video_0

speed (multiplier: 2 = 2x faster)

extractAudio

video_0

Extracts audio track

reverseVideo

video_0

Plays backwards

skipSilence

video_0

maxSilenceDuration (default 0.03)

imageCrop

video_0

aspectRatio

textOverlay

video_0 (image)

Same as video textOverlay — works on images, outputs image (png/jpg)

Valid textOverlay fonts: Inter, Montserrat, Bebas Neue, Oswald, TikTok Sans, TikTok Sans Condensed, TikTok Sans SemiCondensed, TikTok Sans SemiExpanded, TikTok Sans Expanded, TikTok Sans ExtraExpanded, Nohemi, Poppins, Raleway, Anton, Comic Cat, Gavency

Valid positions: top-left, top-center, top-right, center-left, center, center-right, bottom-left, bottom-center, bottom-right

Marketing &#x26; distribution

# Connected social accounts

wonda accounts instagram

wonda accounts tiktok

# Analytics

wonda analytics instagram

wonda analytics tiktok

wonda analytics meta-ads

# Scrape competitors

wonda scrape social --handle @nike --platform instagram --wait

wonda scrape social-status <taskId>                   # Get results of a social scrape

wonda scrape ads --query "sneakers" --country US --wait

wonda scrape ads --query "sneakers" --country US --search-type keyword \

  --active-status active --sort-by impressions_desc --period last30d \

  --media-type video --max-results 50 --wait

wonda scrape ads-status <taskId>                      # Get results of an ads search

# Download a single reel or TikTok video

SCRAPE=$(wonda scrape video --url "https://www.instagram.com/reel/ABC123/" --wait --quiet)

# → returns scrape result with mediaId in the media array

# Publish

wonda publish instagram --media <id> --account <accountId> --caption "New drop"

wonda publish instagram --media <id> --account <accountId> --caption "..." --alt-text "..." --product IMAGE --share-to-feed

wonda publish instagram-carousel --media <id1>,<id2>,<id3> --account <accountId> --caption "..."

wonda tiktok creator-info --account <accountId>      # Live privacy options + comment/duet/stitch defaults

wonda publish tiktok --media <id> --account <accountId> --caption "New drop" --privacy PUBLIC_TO_EVERYONE

wonda publish tiktok --media <id> --account <accountId> --caption "..." --privacy PUBLIC_TO_EVERYONE \

  --disable-comment --commercial-disclose --brand-organic

wonda publish tiktok-carousel --media <id1>,<id2> --account <accountId> --caption "..." \

  --privacy PUBLIC_TO_EVERYONE --cover-index 0

# Schedule a post (Instagram and TikTok single posts)

wonda publish instagram --media <id> --account <accountId> --caption "..." --scheduled-at 2026-05-01T14:00:00Z

wonda publish tiktok --media <id> --account <accountId> --caption "..." --scheduled-at 2026-05-01T14:00:00-07:00

# --scheduled-at takes an RFC3339 timestamp with timezone; 5 min – 29 days out.

# Manage scheduled jobs

wonda publish scheduled list                  # List pending scheduled posts

wonda publish scheduled cancel <outputJobId>  # Cancel before it fires

# History

wonda publish history instagram --limit 10

wonda publish history tiktok --limit 10

# Browse media library

wonda media list --kind image --limit 20

wonda media info <mediaId>

Stealth browser ( wonda stealth-browser )

Stealth-browser primitives backed by Patchright (^1.59.4), the undetected Playwright fork. Same Chromium binary that wonda x tweet --browser and wonda linkedin connect --browser use, but exposed for arbitrary URLs in an ephemeral context: no persistent profile, no session state, no cookies between calls.

One-time setup on a fresh machine:

wonda stealth-browser install   # downloads Chromium (~300 MB), installs Patchright npm deps

This installs Patchright + Chromium for ALL four drivers (stealth-browser record, x --browser, linkedin --browser, brand extract) so you only run it once.

Today's only run-time subcommand is record (URL → webm). Use it for screen recordings of cookie-banner-gated pages (Notion public shares, pdf.js renders, any site where bare Playwright trips a bot check), marketing demo capture, etc.

wonda stealth-browser record https://example.notion.site/page \

  --output recording.webm \

  --duration 5 \

  --viewport 960x1080 \

  --inject-js scripts/page-script.mjs   # optional: runs after load, before timer starts

# Transcode webm to mp4 at 30 fps (Patchright records webm/VP8)

ffmpeg -y -i recording.webm -t 5 -r 30 -an \

  -c:v libx264 -pix_fmt yuv420p -crf 18 recording.mp4

The --inject-js file is wrapped in an async IIFE so top-level await works. It runs AFTER domcontentloaded + networkidle + 400 ms paint settle, BEFORE the duration timer starts. Any await inside counts against the recording window. Use it for dark-theme injection, cookie-banner removal, scroll animations, anything that needs to happen in page context.

Node.js requirement: wonda needs Node >= v20 on PATH. Brew users get it via the node dependency; npm users have it by definition; install.sh users may need brew install node (or any Node distribution). If Node is missing, wonda stealth-browser install fetches a private copy into ~/.wonda/node/.

X/Twitter

Supports reads, writes, and social graph.

# Auth setup (run `wonda x auth --help` for details)

wonda x auth set --auth-token <token> --ct0 <ct0>

wonda x auth set --account burner --auth-token <...> --ct0 <...>  # multi-account

wonda x auth check

# Read

wonda x search "sneakers" -n 20                     # Search tweets

wonda x user @nike                                   # User profile

wonda x user-tweets @nike -n 20                      # User's recent tweets

wonda x read <tweet-id-or-url>                       # Single tweet

wonda x replies <tweet-id-or-url>                    # Replies to a tweet

wonda x thread <tweet-id-or-url>                     # Full thread (author's self-replies)

wonda x home                                         # Home timeline (--following for Following tab)

wonda x bookmarks                                    # Your bookmarks

wonda x likes                                        # Your liked tweets

wonda x following @handle                            # Who a user follows

wonda x followers @handle                            # A user's followers

wonda x lists @handle                                # User's lists (--member-of for memberships)

wonda x list-timeline <list-id-or-url>               # Tweets from a list

wonda x news --tab trending                          # Trending topics (tabs: for_you, trending, news, sports, entertainment)

# Write (uses internal API — use on secondary accounts)

wonda x tweet "Hello world"                          # Post a tweet

wonda x tweet "Hello world" --browser                # Full stealth via real browser (Patchright)

wonda x tweet "Hello world" --attach ~/clip.mp4      # Attach image/gif/video (up to 4)

wonda x reply <tweet-id-or-url> "Great point"        # Reply

wonda x like <tweet-id-or-url>                       # Like

wonda x unlike <tweet-id-or-url>                     # Unlike

wonda x retweet <tweet-id-or-url>                    # Retweet

wonda x unretweet <tweet-id-or-url>                  # Unretweet

wonda x follow @handle                               # Follow

wonda x unfollow @handle                             # Unfollow

# Maintenance

wonda x refresh-ids                                  # Refresh cached GraphQL query IDs from X's JS bundles

All paginated commands support: -n <count>, --cursor, --all, --max-pages, --delay <ms>.

Tweet modes: The tweet command has two modes:

  • Default (API): X's internal GraphQL (CreateTweet for ≤280 chars, CreateNoteTweet for long-form Premium). Fast (<1s), supports --attach for media. Occasionally fails with error 226 when X rotates query IDs or feature flags — when that happens, recapture via twitter-tone-research/_artifacts/scripts/capture-ct-bw.mjs and bump the three knobs in xclient/.
  • **--browser (Patchright):** Launches a real undetected Chrome browser, opens x.com compose, types with human-style jitter, clicks Post. Supports --attach (image/gif/video, up to 4) — files are driven through the hidden compose input via Playwright's setInputFiles, no native picker dialog opens; the script waits for X's upload pipeline to finalize (up to 5 min for video) before submitting. Zero fingerprinting risk. Slower (~10s text, ~30-90s with video) but fully drift-proof — no queryIds, feature flags, or request shape to maintain. Patchright + Chromium auto-install to ~/.wonda/xclient/ on first invocation (~315 MB, one-time, shared with any Playwright install).

LinkedIn

Supports search, profiles, companies, messaging, and engagement.

# Auth setup (run `wonda linkedin auth --help` for details)

wonda linkedin auth set --li-at-value <v> --jsessionid-value <v>

wonda linkedin auth set --account brand-A --li-at-value <...> --jsessionid-value <...>  # multi-account

wonda linkedin auth check

# Read

wonda linkedin me                                    # Your identity

wonda linkedin search "data engineer" --type PEOPLE  # Search (types: PEOPLE, COMPANIES, ALL)

wonda linkedin profile johndoe                       # View profile (vanity name or URL)

wonda linkedin company google                        # View company page

wonda linkedin conversations                         # List message threads

wonda linkedin messages <conversation-urn>           # Read messages in a thread

wonda linkedin notifications -n 20                   # Recent notifications

wonda linkedin connections                           # Your connections

wonda linkedin reactions <activity-id>               # Reactions with reactor profiles + type

wonda linkedin browser-bootstrap                     # Inject stored cookies into patchright profile (one-time + on rotation)

wonda linkedin comments <activity-id> --browser      # Commenters with profile + vanity (daemon auto-starts if needed)

wonda linkedin search-posts "<keyword>" --date-range past-week  # Keyword to recent posts + author profile (patchright; for social listening see content-skills/linkedin-social-listening.md)

# Daemon lifecycle (the Patchright LinkedIn daemon)

wonda linkedin daemon start                          # Spawn detached + wait for ready (uses stored cookies)

wonda linkedin daemon status                         # Alive? uptime, cookie age, last health probe

wonda linkedin daemon stop                           # Graceful quit

wonda linkedin daemon logs --follow                  # tail driver.log

wonda linkedin enrich-engagers --activity-id <id>    # Scrape engagers + enrich each with profile + current employer (joined JSON)

# Write

wonda linkedin connect <vanity-name> --message "Hey!" # Send connection request with note

wonda linkedin connect <vanity-name> -m "Hey!" --browser  # Full stealth via real browser (Patchright)

wonda linkedin like <activity-urn>                   # Like a post

wonda linkedin unlike <activity-urn>                 # Remove a like

wonda linkedin send-message <conversation-urn> "Hi!" # Send a message

wonda linkedin post "Excited to announce..."         # Create a post

wonda linkedin delete-post <activity-id>             # Delete a post

Paginated commands support: -n <count>, --start, --all, --max-pages, --delay <ms>.

Connection request modes: The connect command has two modes:

  • Default (API): Voyager REST API with fingerprint mitigations (profile visit → drawer warm-up → connect). Fast (~3s), supports notes via customMessage.
  • **--browser (Patchright):** Drives the long-running Patchright daemon for full stealth. Daemon auto-starts on first --browser invocation if not already running (override with --no-auto-start). Zero fingerprinting risk. Slower (~10s) but fully safe. Use as a fallback when you need extra protection. Patchright + Chromium auto-install to ~/.wonda/patchright-li-driver/ on first invocation (~315 MB, one-time). Cookie pickup: daemon reads ~/.wonda/linkedin-cookies.json (what wonda linkedin auth set writes) at boot and watches for updates every 5 min, so you can rotate cookies without restarting the daemon. Restarting the daemon mid-session is penalized by LinkedIn anti-fraud — use wonda linkedin auth set + auto-refresh instead.

Engager enrichment: wonda linkedin enrich-engagers --activity-id <id> scrapes reactors (and optionally commenters via --comments), then fetches each engager's profile + current employer + company page, and emits a single joined JSON document keyed by vanity with profile and currentEmployer (industry, headcount, HQ, description, employee count) blocks per engager. Use --max-profiles N to cap the batch (default 25 on sensitive accounts, hard ceiling 100) and --out file.json to write to disk.

For ICP qualification of post engagers, see content-skills/linkedin-icp-qualify.md.

Reddit

Auth is optional — many reads work unauthenticated. Supports search, feeds, users, posts, trending, and chat/DMs.

# Auth setup (run `wonda reddit auth --help` for details)

wonda reddit auth set --session-value <jwt>

wonda reddit auth set --account burner-1 --session-value <jwt>  # multi-account

wonda reddit auth check

# Read (works without auth)

wonda reddit search "AI video" --sort top --time week   # Search posts (sort: relevance, hot, top, new, comments)

wonda reddit subreddit marketing                        # Subreddit info

wonda reddit feed marketing --sort hot                  # Subreddit posts (sort: hot, new, top, rising)

wonda reddit user spez                                  # User profile

wonda reddit user-posts spez --sort top                 # User's posts

wonda reddit user-comments spez                         # User's comments

wonda reddit post <id-or-url> -n 50                     # Post with comments

wonda reddit trending --sort hot                        # Popular/trending posts

# Read (requires auth)

wonda reddit home --sort best                           # Your home feed

# Write (requires auth)

wonda reddit submit marketing --title "Great tool" --text "Check this out..."  # Self post

wonda reddit submit marketing --title "Great tool" --url "https://..."         # Link post

wonda reddit comment <parent-fullname> --text "Nice post!"                     # Reply

wonda reddit vote <fullname> --up                       # Upvote (--down, --unvote)

wonda reddit subscribe marketing                        # Subscribe (--unsub to unsubscribe)

wonda reddit save <fullname>                            # Save a post or comment

wonda reddit unsave <fullname>                          # Unsave

wonda reddit delete <fullname>                          # Delete your post or comment

Paginated commands support: -n <count>, --after <cursor>, --all, --max-pages, --delay <ms>.

Reddit chat / DMs

Direct messaging via the Matrix protocol. Requires a separate chat token.

# Auth setup (run `wonda reddit chat auth-set --help` for details)

wonda reddit chat auth-set

# Read

wonda reddit chat inbox                                  # List DM conversations with latest messages

wonda reddit chat messages <room-id> -n 50               # Fetch messages from a room

wonda reddit chat all-rooms                              # List ALL joined rooms (not limited to sync window)

# Write

wonda reddit chat send <room-id> --text "Hey!"           # Send a DM (mimics browser typing behavior)

# Management

wonda reddit chat accept-all                             # Accept all pending chat requests

wonda reddit chat refresh                                # Force-refresh the Matrix chat token

Important: The chat token expires every ~24h. The CLI auto-refreshes on use, but if it expires fully, re-run auth-set. Rate limit DM sends to 15-20/day with varied text to avoid detection. The send command includes a typing delay (1-5s) to mimic human behavior.

Workflow &#x26; discovery

Brand extraction ( brand extract )

Extract a website's design system (colors, typography, radii, shadows, spacing, fonts, logo, hero decor, CSS pattern backgrounds, dashed/dotted border treatments, :root custom properties) into a DESIGN.md + tokens.json + assets/. Runs locally via the bundled Patchright + Chromium driver (same Chromium cache as wonda x tweet --browser, wonda linkedin connect --browser, wonda stealth-browser record).

Requires a one-time wonda stealth-browser install to download Patchright + Chromium (~300 MB, shared across all four browser drivers).

This is the in-house replacement for the previous npx-based brand-extraction CLI used in the slide-generation / slide-generation-system / creative-static-ads / premium-static-ads skills.

# Local-only — no auth, no credits, no API call

wonda brand extract https://linear.app                       # Writes ./output/linear.app/{DESIGN.md, tokens.json, assets/}

wonda brand extract https://stripe.com --output ./refs       # Writes ./refs/stripe.com/...

wonda brand extract https://vercel.com --screenshot          # Also writes page.png

wonda brand extract https://stripe.com --viewport 1440x900   # Override default 1920x1080

# Persist to the server (uploads assets via media presign + POSTs /brand/save)

wonda brand extract https://stripe.com --save                # Local + persist

wonda brand extract https://stripe.com --save --make-active  # Local + persist + activate (the common path)

wonda brand extract https://stripe.com --no-output --save    # Don't write to disk, persist only

# Move a persisted brand around

wonda brand save --from ./output/stripe.com --make-active    # Persist a previously-extracted dir

wonda brand pull <brand-id>                                  # Download a saved brand back to ./output/<domain>/

Flags:

  • --save: upload assets/ via the media presign flow and POST {tokens, mediaIds} to /api/v1/brand/save. Requires auth.
  • --make-active: implies --save. Sets the new brand as active.
  • --output <dir>: override the local output dir. Default is ./output/<domain>/. Mutually exclusive with --no-output.
  • --no-output: don't write to disk (in-memory extract for piping). Mutually exclusive with --output.
  • --name "Brand Name": override the brand name when persisting. Defaults to the domain stem capitalized.
  • --screenshot: also save page.png alongside DESIGN.md.
  • --viewport WxH: viewport size for the headless browser. Default 1920x1080.

Outputs (when --no-output is not set, always to <output-dir>/<domain>/):

  • DESIGN.md: Markdown summary of tokens, typography, hero decor, logo, CSS patterns, dashed borders, and root CSS variables. Read this in the slide / static-ad skills before composing HTML.
  • tokens.json: raw structured JSON of the extraction.
  • page.png: only when --screenshot is passed.
  • assets/: raw hero decor files plus assets/fonts/ for any non-Google @font-face URLs. Always written when not --no-output.

Prints written file paths to stdout. With --save, also prints the API response (brandId, sourceDomain, warnings). Non-zero exit on failure (network error, navigation timeout, browser crash, save failure).

Video analysis

Analyze a video to extract a composite frame grid (visual) and audio transcript (text). Useful for understanding video content before creating variations. Requires a full account (not anonymous) and costs credits based on video duration (ElevenLabs STT pricing).

If the video was just uploaded and is still normalizing, the CLI auto-retries until the media is ready.

# Analyze a video — returns composite grid image + transcript

ANALYSIS_JOB=$(wonda analyze video --media $VIDEO_MEDIA --wait --quiet)

# The job output contains:

# - compositeGrid: image showing 24 evenly-spaced frames

# - transcript: full text of any speech

# - wordTimestamps: word-level timing [{word, start, end}]

# - videoMetadata: {width, height, durationMs, fps, aspectRatio}

# Download the composite grid for visual inspection

wonda analyze video --media $VIDEO_MEDIA --wait -o /tmp/grid.jpg

# Get just the transcript

wonda analyze video --media $VIDEO_MEDIA --wait --jq '.outputs[] | select(.outputKey=="transcript") | .outputValue'

Error handling: 402 = insufficient credits, 409 = media still processing (CLI auto-retries).

Chat (AI assistant)

Interactive chat sessions for content creation — the AI handles generation, editing, and iteration.

wonda chat create --title "Product launch"            # New session

wonda chat list                                       # List sessions (--limit, --offset)

wonda chat messages <chatId>                          # Get messages

wonda chat send <chatId> --message "Create a UGC reaction video"

wonda chat send <chatId> --message "Edit it" --media <id>

wonda chat send <chatId> --message "..." --aspect-ratio 9:16 --quality-tier max

wonda chat send <chatId> --message "..." --style <styleId>

wonda chat send <chatId> --message "..." --passthrough-prompt  # Use exact prompt, no AI enhancement

Jobs &#x26; runs

wonda jobs get inference <id>                         # Inference job status

wonda jobs get editor <id>                            # Editor job status

wonda jobs get publish <id>                           # Publish job status

wonda jobs wait inference <id> --timeout 20m          # Wait for completion

wonda run get <runId>                                 # Run status

wonda run wait <runId> --timeout 30m                  # Wait for run completion

Discovery

wonda models list                                     # All available models

wonda models info <slug>                              # Model details and params

wonda operations list                                 # All editor operations

wonda operations info <operation>                     # Operation details

wonda capabilities                                    # Full platform capabilities

wonda pricing list                                    # Pricing for all models

wonda pricing estimate --model seedance-2 --prompt "..." # Cost estimate

wonda style list                                      # Available visual styles

wonda topup                                            # Top up credits (opens Stripe checkout)

Editing audio &#x26; images

# Edit audio

wonda edit audio --operation <op> --media <id> --wait -o out.mp3

For any image edit (crop, text overlay, img2img, background removal, vectorize) pull the dedicated skill: wonda skill get image-edit.

Alignment (timestamp extraction)

wonda alignment extract-timestamps --model <model> --attach <mediaId> --wait

Quality tiers

Tier

Image Model

Resolution

Video Model

When

Standard

gpt-image-2 (auto) — alt: nano-banana-2 1K

1024×1024 / 1024×1536 (gpt) / 1K (nano)

seedance-2 (high, 5s)

Default. gpt-image-2 for strongest prompt adherence + text-in-image; nano-banana-2 for faster Gemini iteration with multi-reference support.

High

gpt-image-2 (high) — alt: nano-banana-2 2K

1024×1024 / 1024×1536 (gpt) / 2K (nano)

seedance-2 (high, 15s)

Crisp output. Use --params '{"quality":"high"}' on gpt-image-2 or bump --params '{"resolution":"2K"}' on nano-banana-2. Also offer sora2pro.

Max

nano-banana-pro 4K — alt: nano-banana-2 4K

4K

seedance-2 (high, 15s)

True 4K (gpt-image-2 caps at 1536px). Use --params '{"resolution":"4K"}'. Also offer sora2pro (1080p) for video.

Troubleshooting

Symptom

Likely Cause

Fix

Sora rejected image

Person in image

Switch to kling_3_pro

Video adds objects not in source

Motion prompt describes elements not in image

Simplify to camera movement and atmosphere only

Text unreadable in video

AI tried to render text in generation

Remove text from video prompt, use textOverlay instead

Hands look wrong

Complex hand actions in prompt

Simplify to passive positions or frame to exclude

Style inconsistent across series

No shared anchor

Use same reference image via --attach

Changes to step A not in step B

Stale render

Re-run all downstream steps

Timing expectations

  • Image: 30s - 2min
  • Video (Sora): 2 - 5min
  • Video (Sora Pro): 5 - 10min
  • Video (Veo 3.1): 1 - 3min
  • Video (Kling): 3 - 8min
  • Video (Grok): 2 - 5min
  • Music (Suno): 1 - 3min
  • TTS: 10 - 30s
  • Editor operations: 30s - 2min
  • Lip sync: 1 - 3min
  • Video upscale: 2 - 5min

Error recovery

  • Unknown model: wonda models list
  • No API key: wonda auth login or set WONDA_API_KEY env var
  • Job failed: wonda jobs get inference <id> for error details
  • Bad params: wonda models info <slug> for valid params
  • Timeout: wonda jobs wait inference <id> --timeout 20m
  • Insufficient credits (402): wonda topup to add credits
BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card