Make the video, let it post itself

How I built an
AI content factory

I make short videos without opening an editor. Footage goes in, a captioned vertical clip comes out, and it posts itself to YouTube, Instagram and TikTok on a schedule. Most of it was written by an AI coding agent, and this page walks through how every part works, with the prompts you'd use to build your own.

00

Start here: what this actually is

This is the whole setup I use to make videos without ever opening an editor. Raw footage goes in one end (a baseball broadcast, a clip of someone's stream, a declassified UFO file) and a finished vertical video comes out the other, captioned and ready to go, then it posts itself to YouTube, Instagram and TikTok on a schedule while I'm doing something else.

I didn't type most of it. I told an AI coding agent (Claude Code) what I wanted and it wrote the scripts, which is also how you'd recreate any of this. That's why the page reads two ways at once. You can follow it to understand how each piece works, or you can grab the prompt at the bottom of each part and have Claude build your own version.

The short list of what you need looks like this. A computer that can run command-line tools, which is any Mac, any Linux box, or a Windows machine with Ubuntu inside it through WSL, which is what I run. A free program called ffmpeg that does the actual video cutting. Python and Node installed. And a handful of free API keys for the parts you can't do on your own machine, like the text-to-speech voice and the upload to YouTube. The next two sections lay all of that out in one place, with a prompt you can paste straight into Claude to get going.

If you've never opened a terminal, the code blocks here are meant to show you what's happening, not to scare you off. You can hand almost all of it to Claude and describe what you want in normal words. Nothing on this page exposes a real key or login either. Anywhere a secret would go you'll see a placeholder like <YOUR_KEY>, and the real values live in one file that never leaves my machine.

01

The kinds of video I make and automate

Before the how, here's the what. Everything below runs on the same shared parts, but each one is its own little channel with its own audience.

MLB Statcast highlight shorts. Short clips built around a single nasty pitch or a big defensive play, with the radar numbers and spin data laid over the footage. Two YouTube channels run these, and a scheduler posts a fresh one every few hours on its own.

Streamer clip shorts. Vertical clips pulled from a Kick streamer's long VODs, transcribed and captioned in that bouncing word-by-word style, with the swearing muted automatically. A safety check looks over every clip before it can go in the queue.

Beat-synced dance edits. Full-song music videos cut to the beat, where the cuts, zooms and effects land on the downbeats. The builder reads the tempo straight off the audio and places every cut for you.

Highlight reels. The best live moments of a single stream, trimmed down into one watchable piece.

Story videos. Short narrated arcs paced after the leaked MrBeast production guide, the internal write-up on hooks and retention that went around online, applied to the streamer's own footage.

UFO and UAP documentary shorts. Narrated mini-documentaries with an AI voice, rendered in React with a declassified-file look: redacted memos, grainy witness footage, rubber stamps that slam down on screen.

02

What you need: APIs, downloads, and the prompt

This is the shopping list for the whole stack. You won't need all of it for any single channel, so skip what doesn't apply to the one you're building. Most of these services have a free tier that's plenty to start with.

Every API in one place

These are the outside services the builders call. Each row links to that service's own docs, where you sign up and get a key.

Service / APIWhat it does hereDocs
YouTube Data API v3Primary upload target. videos.insert via multipart/related POST (uploadType=multipart, part=snippet,status); search.list (forMine) for same-title dedup; thumbnails.set, playlists.insert, playlistItems.insert, channels.list (verify auth), videos.delete (take down a bad cut).
Google OAuth 2.0 (installed/desktop app)Auth for YouTube. Authorization-code flow with access_type=offline + prompt=consent to mint a refresh token; refresh_token grant for each upload. Loopback-server flow and a WSL-safe manual paste/exchange flow.
Instagram Graph API, Content Publishing (Reels)Cross-post to Reels. Resumable upload: create REELS container (upload_type=resumable) -> POST raw bytes to rupload.facebook.com -> media_publish. Uses a long-lived FB Page token (FB_PAGE_TOKEN) + IG_USER_ID.
TikTok Content Posting APICross-post to TikTok feed via PULL_FROM_URL (TikTok fetches a public MP4 URL), then poll publish/status until PUBLISH_COMPLETE. OAuth 2.0 with PKCE (S256), refresh-token grant, scopes user.info.basic,video.upload.
ffmpeg / ffprobePre-upload media sanity: ffprobe compares audio vs video stream duration; ffmpeg hard-cuts a frozen audio-only tail (stream copy + -t <video_dur>). Also extracts frames for vision-gate contact sheets.
Discord REST API (channels/messages)Cron job result notifications to an #errors channel after each IG post (success/failure with remaining-queue count).
faster-whisper (CTranslate2 Whisper)Word-level + segment-level speech-to-text for every clip; drives karaoke caption timing, swear-bleep span detection, and the lip-sync word-onset beat grid. Run with CUDA float16 (cuDNN/cuBLAS DLLs injected on Windows), CPU int8 fallback.
librosaAudio beat-tracking (librosa.beat.beat_track) to build the per-song beat grid for the dance AMV, and RMS energy analysis (librosa.feature.rms) to locate the song's loudest bar so the kaleidoscope/hero payoff lands on the drop.
FFmpeg (libx264 + libass + filtergraphs)All cutting, composition, ASS subtitle burning, vstack/xstack grids, zoompan punches, boxblur frosted margins, kaleidoscope mirror stacks, gif-burst overlays, audio bleep mixing and loudnorm. ASS (Advanced SubStation Alpha) is the caption format.
yt-dlpFetches copyright-safe filler b-roll (cute-animal compilations) for split-screen shorts, and sources reaction-gif / meme beds for the AMV sticker layer.
Google Vertex AI, Gemini (generateContent)Multimodal art-director / QA advisor: ingests rendered QA montage grids (base64 inline images) and gives concrete edit feedback. Auth via Application Default Credentials.
Kick Clips APISource of the clip harvest, public stream clips with real view counts and ULID ids feed clip_library.json (the ranked, vibe-tagged reaction pool).
Tenor GIF APIReaction-meme cutaway pool (memes2/) popped over highlight beats and the AMV sticker scatter.
MLB StatsAPI, feed/livePer-game pitch-level truth: GET /api/v1.1/game/{gamePk}/feed/live → liveData.plays.allPlays[].playEvents[] gives playId, pitchData.coordinates (x0/z0 release, pX/pZ plate, pfxX/pfxZ movement), pitchData.breaks (breakHorizontal + breakVerticalInduced = real fan-facing break), startSpeed, extension, plateTime, strikeZoneTop/Bottom; matchup.pitcher/batter. Used to build the tunnel diagram, resolve a playId to (atBatIndex,pitchNumber), and read game state (inning/score). No official public doc.
MLB StatsAPI, playByPlayGET /api/v1/game/{gamePk}/playByPlay → allPlays[].result.description (regex-parsed for outfield assists since runners[].credits is empty in this sim feed) and about.captivatingIndex (tiebreak). No official public doc.
MLB StatsAPI, schedule / teamsGET /api/v1/schedule?sportId=1&startDate=&endDate=&gameType=R to enumerate Final games in a window; GET /api/v1/teams?sportId=1 to map team id → abbreviation. No official public doc.
MLB StatsAPI, game content (same-day highlights)GET /api/v1/game/{gamePk}/content → highlights.items[].playbacks[] exposes broadcast highlight mp4s on mlb-cuts-diamond.mlb.com (FORGE assets) the SAME day a game ends, while Savant's per-play clips lag ~1 day. Used by hand to source the walk-off comp's footage (not called in-script). No official public doc.
Baseball Savant, Statcast search CSVGET /statcast_search/csv?... (params: all=true, hfSea=YEAR|, hfGT=R|, hfPT=PITCHTYPE|, type=details, player_type, pitchers_lookup[]=ID, game_date_gt/lt, pitch_speed_min/max). Returns pitch-level rows: release_speed, release_spin_rate, pfx_x/pfx_z, launch_speed, estimated_woba_using_speedangle, game_pk, at_bat_number, pitch_number, player_name(=batter). The discovery layer for spin/velo/movement rankings.
Baseball Savant, sporty-videos (film room)GET /sporty-videos?playId={playId} returns an HTML page; the per-play clip mp4 is scraped from the <source src=...mp4> tag (with &#xNN; / &amp; un-escaped). This is how every lane fetches its actual footage.
Baseball Savant, arm-strength leaderboardGET /leaderboard/arm-strength?type=outfield&csv=true → fielder_name + arm_of (season max arm strength mph), joined by name to label outfield-assist shorts (shown as the fielder's season arm, NOT this throw's velo, which no source provides).
Sandlot DuckDB (Statcast Parquet)Local data/sandlot.duckdb / data/pitches_{year}.parquet (~6M pitches/season). scan_statcast.py opens it read-only and runs three SQL lanes in one pass to pre-select candidates (hardest-hit, lowest-xwOBA hits, highest-spin swinging Ks) before bridging to playIds via StatsAPI.
Pillow (PIL)Renders the data insets as RGBA PNGs overlaid into the margin: the pitch-tunnel Bezier diagram, the catcher's-view strike-zone contact plot, and the walk-off stat card.
YouTube Data API v3, videos.insertUpload target (via a shared scripts/yt-shorts-upload.py + per-channel OAuth account). Snippet carries title/description/tags/categoryId=17 (Sports); status public, selfDeclaredMadeForKids=false.
ElevenLabs TTS (perceived-velo VO)Optional narration mp3 passed via --vo; the builder paces the video to the VO duration and appends a slow-mo replay so total length covers the narration.
Remotion (core)React-based programmatic video framework: Composition/Sequence/AbsoluteFill, useCurrentFrame, interpolate, spring, Audio/OffthreadVideo/Img, staticFile, registerRoot
@remotion/cliHeadless render + studio CLI: remotion compositions, remotion render, remotion studio; configured via remotion.config.ts
@remotion/media-utils, getAudioDurationInSecondsReads VO mp3 length inside calculateMetadata to compute durationInFrames so the video auto-fits the narration
@remotion/google-fontsModule-level font loading (Oswald/Special Elite/Share Tech Mono); Remotion waits for fonts before rendering. Weights MUST be restricted to avoid dozens of font requests per render
ElevenLabs Text-to-Speech (with-timestamps)Generates VO audio + char-level alignment in one call (POST /v1/text-to-speech/{voiceId}/with-timestamps); alignment arrays are chunked into caption cues. Adam voice id pNInz6obpgDQGcFmaJgB
@remotion/captionsCaption helpers used in the ohtani long-form variant (words→lines) alongside the ElevenLabs alignment
FFmpeg (filtergraph)Audio mixing (amix/adelay/stream_loop) for the drone/SFX bed; in the ufo-shorts variant, the entire video: blurred-fill vertical framing, zoompan Ken-Burns, drawtext captions, doc-inset overlays
CapCutAPI (sun-guannan/CapCutAPI)Open-source Python tool that programmatically builds CapCut/JianYing drafts; exposes a Flask HTTP API and an MCP server. The core of this experiment.
pyJianYingDraftUnderlying library (vendored into the repo) that models and serializes CapCut/JianYing's native draft_info.json, tracks, segments, materials, keyframes, plus a Windows UI-automation export controller.
Model Context Protocol (MCP)Protocol used by mcp_server.py to expose 11 editing tools (create_draft, add_video, add_text, save_draft, ...) over stdio JSON-RPC to an AI client.
CapCut / JianYing (剪映) desktop appThe target editor. The generated draft folder is copied into its drafts directory; the final MP4 export happens here (manually, or via the orphaned UI-automation controller).
Alibaba Cloud OSS (oss2 SDK)Optional draft hosting: when is_upload_draft=true, the zipped draft is uploaded and a 24h pre-signed URL is returned. Credentials are config-only and were not populated here.
FlaskHTTP server framework for capcut_server.py (the REST interface mirroring the MCP tools).
Canva Connect MCP (claude.ai connector)Available-but-unused. Surfaces only as a deferred MCP connector (authenticate/complete_authentication, OAuth-gated). It is NOT wired into any pipeline, see gotchas for the truth.
ffmpeg filters (ffmpeg-filters.html)Every filtergraph node used here: scale/crop/pad/setsar, boxblur, eq, overlay, drawbox, drawtext, ass, zoompan, split, vstack/hstack, concat (filter), volume, atempo, asetpts/setpts, adelay, amix, aresample, loudnorm, and the lavfi synth sources (color, noise, sine, anoisesrc, anullsrc).
ffmpeg CLI (ffmpeg.html)Seeking (-ss before vs after -i), -t hard-cut, stream mapping (-map 0:v:0/0:a:0), -stream_loop, the concat demuxer (-f concat -safe 0), -movflags +faststart, encoder flags (libx264/aac/pcm_s16le).
ffprobe (ffprobe.html)Reading stream-level duration (-select_streams a:0/v:0 -show_entries stream=duration) vs container duration (-show_entries format=duration), the basis of the frozen-tail guard.
faster-whisperWord-level timestamps (word_timestamps=True) that drive both the karaoke caption timing and the swear-bleep span list.
Advanced SubStation Alpha (.ass / libass)Subtitle format burned in via the ass filter; the [V4+ Styles] and [Events] Format lines, override tags (\fad, \t, \fscx, \pos, \c, \p1 vector drawings). No stable canonical spec URL, left blank rather than fabricated.
PlaywrightDrives a real logged-in Chrome so the bot can read the live chat off the page and type replies. There is no API for posting to a YouTube live chat, so the bot uses the page like a person.
Claude Code CLIThe brain. Run in print mode, it reads one chat message plus the context and writes back a single reply. Uses the CLI login, so no API key.
ffmpegCuts the live audio into short chunks for transcription, and grabs a video frame so the bot can glance at the screen.
YouTube live chatNot an API. The actual web page the bot drives like a human, reading messages and clicking send.

What to install on your machine

The free, local tools. On Windows I run these inside Ubuntu through WSL. On a Mac use Homebrew, and on Linux use your package manager. Pick the line that matches your system.

ffmpeg, the video engine
winget install Gyan.FFmpeg     # Windows
brew install ffmpeg            # macOS
sudo apt install ffmpeg        # Debian / Ubuntu
Python tools, for transcribing, beat detection and data
pip install yt-dlp faster-whisper librosa duckdb requests pillow numpy

yt-dlp grabs source videos and audio. faster-whisper turns speech into caption timings, and it runs much faster on an NVIDIA GPU with CUDA installed, falling back to the CPU if you don't have one. librosa finds the beat for the dance edits. duckdb holds the pitch database. requests talks to the APIs.

Node and Remotion, for the React-rendered documentaries
npm create video@latest        # scaffold a Remotion project
Playwright, for the live-chat browser bot
npm i playwright
npx playwright install chromium
CapCut draft API, optional (read the CapCut section for the catch)
# clone the open-source CapCutAPI project from GitHub, then:
pip install -r requirements.txt

Or just point Claude at it

You don't have to wire any of this up by hand. Open Claude Code in an empty folder, paste a prompt like the one below, and answer the questions it asks. It will tell you which keys and tools to install, write the scripts, and test them with you.

starter prompt

I want to build an automated short-form video pipeline. It should take a source video, cut a vertical 1080x1920 clip, and burn in captions from a transcript, then render the result with ffmpeg. After that it should upload the finished file to YouTube using the Data API v3 with an OAuth refresh token I'll provide. Before you write any code, tell me exactly which API keys and command-line tools I need and how to install them. I'm on <YOUR_OS>. Then build it step by step and run it on a test clip with me.

For a specific format, copy the matching starter prompt from the bottom of its section below. Each one describes that builder closely enough that Claude can rebuild it from scratch.

03

The ffmpeg Cookbook: Reusable Filtergraph Recipes Behind Every Builder

Every video factory in this workspace (MLB Statcast shorts, UAP documentary shorts, Kick/stream "brainrot" highlight reels) comes down to the same handful of ffmpeg filtergraphs reapplied. None of these pipelines touch a GUI editor or a heavyweight framework. They shell out to /usr/bin/ffmpeg from Python, building one big -filter_complex string per clip. I pulled the cross-cutting recipes out into this standalone cookbook so I can lift any one of them into a new builder.

The problem every recipe solves is taking arbitrary 16:9 source footage and turning it into a polished vertical 1080x1920 Short, with burned captions, on-clip stat overlays, mixed VO/music/SFX, and a clean tail, reproducibly and headlessly on both WSL Linux and Windows. The same shapes keep coming back: a blurred-fill or vstack split to reach 9:16 without stretching, a "frosted" translucent margin for captions and stat cards, libass .ass burn-in for animated karaoke, volume/sine gating for swear censorship, setpts/atempo for speed ramps, overlay ... enable='between(t,..)' for time-gated picture-in-picture, and a ffprobe-driven freeze-tail guard at the upload choke-point.

I've written the recipes de-f-stringed (the source is Python f-strings) as copy-pasteable commands with placeholder values. The load-bearing literals (boxblur=24:1, [email protected]:t=fill, eval=frame, s=WxH) are preserved exactly, because those specific values are what I tuned over many render cycles.

APIs & services

Service / APIWhat it does hereDocs
ffmpeg filters (ffmpeg-filters.html)Every filtergraph node used here: scale/crop/pad/setsar, boxblur, eq, overlay, drawbox, drawtext, ass, zoompan, split, vstack/hstack, concat (filter), volume, atempo, asetpts/setpts, adelay, amix, aresample, loudnorm, and the lavfi synth sources (color, noise, sine, anoisesrc, anullsrc).
ffmpeg CLI (ffmpeg.html)Seeking (-ss before vs after -i), -t hard-cut, stream mapping (-map 0:v:0/0:a:0), -stream_loop, the concat demuxer (-f concat -safe 0), -movflags +faststart, encoder flags (libx264/aac/pcm_s16le).
ffprobe (ffprobe.html)Reading stream-level duration (-select_streams a:0/v:0 -show_entries stream=duration) vs container duration (-show_entries format=duration), the basis of the frozen-tail guard.
faster-whisperWord-level timestamps (word_timestamps=True) that drive both the karaoke caption timing and the swear-bleep span list.
Advanced SubStation Alpha (.ass / libass)Subtitle format burned in via the ass filter; the [V4+ Styles] and [Events] Format lines, override tags (\fad, \t, \fscx, \pos, \c, \p1 vector drawings). No stable canonical spec URL, left blank rather than fabricated.

How it's built, step by step

  1. Probe the source with ffprobe (dimensions + duration). Decide the 9:16 strategy: blurred-fill (centered undistorted clip over a blurred copy of itself) for near-full-frame looks, or a vstack split (cam on top, b-roll/duplicate on the bottom) for two-up layouts.
  2. Cut each window with a TWO-STAGE seek: fast keyframe -ss (start-3) before -i, then accurate -ss 3 -t len after -i. Re-encode every cut to IDENTICAL params (libx264, yuv420p, 30fps, 1080x1920, aac/pcm) so a later concat demuxer can -c copy.
  3. Transcribe each cut with faster-whisper (word_timestamps=True, language='en', vad off for short/noisy clips). Keep the word list for both karaoke timing and swear-span detection.
  4. Build the .ass subtitle file in UTF-8 with the Name-column-correct header. Strip emoji from text. Compute Dialogue start/end against MEASURED cumulative beat offsets (ffprobe each rendered intermediate, not the planned lengths).
  5. Compose the frame per clip: blurred-fill OR vstack to reach 1080x1920 -> drawbox frosted margin (navy scrim + gold divider) -> drawtext stat lines -> overlay PIP/stat-card/meme insets gated with enable='between(t,A,B)'.
  6. Censor audio: mute the voice across each swear span with volume='if(between(t,S,E)+...,0,1)':eval=frame and amix a 1kHz sine gated to the same spans (a clean beep instead of silence).
  7. Join segments: concat DEMUXER (-f concat -i list.txt -c copy) when intermediates are byte-compatible, or the concat FILTER (concat=n=N:v=1:a=1, re-encodes) when inputs differ.
  8. Mix audio in an AUDIO-ONLY pass: adelay-place each VO block, duck the music bed (volume=0.2), amix with normalize=0, then loudnorm=I=-14:TP=-1.5:LRA=11 ONCE -> .m4a. (Decoupled from the ass video graph to dodge the doubled-audio bug.)
  9. Final video pass: burn subs with -vf ass=file.ass:fontsdir=fonts, mux the premade audio with -c:a copy -shortest, write -movflags +faststart.
  10. At the upload choke-point, run the freeze-tail guard: compare audio vs video stream durations; if audio is longer, hard-cut both with -t <video_dur> -c copy and re-probe the container duration to confirm the dead tail is gone.

Under the hood

Recipe 1, Vertical 9:16 blurred-fill (the letterbox killer)

ffmpeg -i in.mp4 -filter_complex "\
[0:v]split=2[bg][fg];\
[bg]scale=1080:1920:force_original_aspect_ratio=increase,crop=1080:1920,boxblur=24:2,eq=saturation=1.5:brightness=-0.04[bgb];\
[fg]scale=1080:-2[fgs];\
[bgb][fgs]overlay=(W-w)/2:(H-h)/2[v]" \
-map "[v]" -map 0:a -c:v libx264 -preset veryfast -crf 20 -pix_fmt yuv420p -c:a aac -b:a 160k out.mp4

Why: fills a 9:16 frame from 16:9 source without black bars or stretching. split makes two copies of the input; one is scaled-to-cover + blurred + darkened as a background, the other is scaled to width (-2 keeps even-numbered height) and centered on top with overlay=(W-w)/2:(H-h)/2.

Recipe 2, Split-screen vstack (cam top, looping b-roll bottom)

ffmpeg -i cam.mp4 -stream_loop -1 -i broll.mp4 -filter_complex "\
[0:v]scale=1080:960:force_original_aspect_ratio=increase,crop=1080:960,setsar=1[top];\
[1:v]scale=1080:960:force_original_aspect_ratio=increase,crop=1080:960,setsar=1[bot];\
[top][bot]vstack=inputs=2[v]" \
-map "[v]" -map 0:a -shortest -c:v libx264 -preset veryfast -crf 20 -pix_fmt yuv420p -c:a aac out.mp4

Why: two stacked 1080x960 halves = 1080x1920. -stream_loop -1 loops the b-roll; -shortest ends the output when the cam audio ends. CRITICAL: an uncapped -stream_loop -1 with no -shortest/-t produces an infinite file (an overnight run hit 6.8 GB with no moov atom, which looked like a "freeze").

Recipe 3, Frosted bottom caption margin (blurred video + navy scrim, not flat black)

ffmpeg -i clip.mp4 -loop 1 -i statcard.png -filter_complex "\
[0:v]split=2[va][vb];\
[va]scale=1080:1920:force_original_aspect_ratio=increase,crop=1080:1920,boxblur=24:1,eq=brightness=-0.28:saturation=0.55,setsar=1[bg];\
[vb]scale=1080:1200:force_original_aspect_ratio=increase:flags=lanczos,crop=1080:1200,setsar=1[scaled];\
[bg][scaled]overlay=0:0:shortest=1[v0];\
[v0]drawbox=x=0:y=1200:w=1080:h=720:[email protected]:t=fill,\
drawbox=x=0:y=1198:w=1080:h=5:[email protected]:t=fill,\
drawtext=fontfile=/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf:text='98.4 MPH   SLIDER':fontcolor=0xa5dfff:fontsize=34:x=44:y=1404:[email protected]:shadowx=2:shadowy=2[v1];\
[1:v]format=rgba,scale=510:340[card];\
[v1][card]overlay=548:1340:shortest=1[v]" \
-map "[v]" -map 0:a -c:v libx264 -preset medium -pix_fmt yuv420p -c:a aac -b:a 192k out.mp4

Why: the clip is pinned to the top 1080x1200; the bottom 720px band is the SAME blurred/darkened footage showing through a translucent navy drawbox scrim (@0.66 alpha gives a glass tint rather than opaque black) with a thin gold divider just above it. Stat drawtext lines and a stat-card PNG overlay then live in that frosted panel. This was the user-approved "premium" look across the K-comp and nasty-tunnel MLB builders.

Recipe 4, ASS burn-in and the Name-column comma bug

Canonical CORRECT header (the [Events] Format: line includes Name, and Dialogue carries the matching empty slot Cap,,):

[Script Info]
ScriptType: v4.00+
PlayResX: 1080
PlayResY: 1920
WrapStyle: 0
ScaledBorderAndShadow: yes

[V4+ Styles]
Format: Name, Fontname, Fontsize, PrimaryColour, SecondaryColour, OutlineColour, BackColour, Bold, Italic, Underline, StrikeOut, ScaleX, ScaleY, Spacing, Angle, BorderStyle, Outline, Shadow, Alignment, MarginL, MarginR, MarginV, Encoding
Style: Cap, Anton, 92, &H00FFFFFF&, &H000000FF&, &H00000000&, &H64000000&, -1, 0, 0, 0, 100, 100, 0, 0, 1, 6, 2, 5, 70, 70, 0, 1

[Events]
Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text
Dialogue: 0,0:00:01.00,0:00:03.00,Cap,,0,0,0,,{\fad(40,40)}HELLO WORLD

Burn it onto video:

ffmpeg -i in.mp4 -vf "ass=captions.ass:fontsdir=fonts" -c:v libx264 -crf 18 -pix_fmt yuv420p -c:a copy out.mp4

THE BUG: the [Events] Format: line MUST list the Name (actor) column between Style and MarginL, and every Dialogue: must carry the matching empty field (the Cap,, double-comma). If Format: omits Name but the Dialogue keeps the empty slot (or vice-versa), the comma count shifts and the stray comma is parsed as the first character of Text, so every caption renders as ,POV / ,NOW. fontsdir=fonts is required so libass finds bundled TTFs (e.g. Anton/Bebas) on Windows; without it libass falls back to a generic sans. Animated karaoke is done entirely with override tags: {\\fad(40,40)} for fades, {\\t(0,250,\\fscx112\\fscy112)} for a pop-in scale, per-word {\\c&H0000F0FF&\\fscx122\\fscy122}WORD{\\c&H00FFFFFF&\\fscx100\\fscy100} for the highlighted active word.

Recipe 5, Audio swear-censor: volume gate + 1kHz beep

ffmpeg -i clip.mp4 -filter_complex "\
sine=f=1000:r=48000:d=30[bp];\
[0:a]volume='if(between(t,1.20,1.46)+between(t,3.10,3.40),0,1)':eval=frame[vv];\
[bp]volume='if(between(t,1.20,1.46)+between(t,3.10,3.40),0.35,0)':eval=frame[bb];\
[vv][bb]amix=inputs=2:normalize=0:duration=first[a]" \
-map 0:v -map "[a]" -c:v copy -c:a aac -b:a 160k out.mp4

Why: the between(t,S,E)+between(t,S2,E2) expression ORs each swear span. volume='if(...,0,1)':eval=frame zeros the voice exactly across those windows, while a parallel 1kHz sine is gated to 0.35 over the SAME spans and mixed back, so you hear a clean beep instead of a dropout. eval=frame is mandatory; without it volume evaluates the expression once at init and the gate never moves (the single most-dropped detail). Span timestamps come from a word-level whisper transcript, padded ~60ms each side. (For caption text, the matching trick is F**K-style starring of the same word list.)

Recipe 6, Slow-mo and speed ramps (setpts + atempo chain)

# ratio = source_len / output_len ; >1 = speed up, <1 = slow down.  Example: 2.5x faster
ffmpeg -i in.mp4 -filter_complex "\
[0:v]setpts=0.4000*PTS[v];\
[0:a]atempo=2.0,atempo=1.25[a]" \
-map "[v]" -map "[a]" -c:v libx264 -pix_fmt yuv420p -c:a aac out.mp4

Why: video speed is setpts=(1/ratio)*PTS. Audio uses atempo, which only accepts 0.5–2.0, so any ratio outside that range is decomposed into a chain (2.5 → atempo=2.0,atempo=1.25; for half-speed slow-mo, setpts=2.0*PTS + atempo=0.5). Pair with reverse/areverse, negate (invert), hflip (mirror), or hue=H=2*PI*t for the "fx montage" effects.

Recipe 7, Ken Burns push-in (zoompan)

ffmpeg -i in.mp4 -filter_complex "\
[0:v]scale=1080:-2:flags=lanczos,setsar=1,\
zoompan=z='min(1.001+0.000700*on,1.09)':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=1:s=1080x1920:fps=30,\
format=yuv420p[v]" -map "[v]" -an out.mp4

Why: a slow programmatic push-in adds motion to static or slow footage (used for the UAP doc-photo "fast cut" feel). GOTCHA: zoompan's size arg uses an x: s=1080x1920. The colon form s=1080:1920 is a filterchain parse error. d=1 advances one input frame per output frame so on (output frame index) drives the zoom curve.

Recipe 8, Time-gated PIP / inset overlays

# Always-on stat-card inset pinned into the margin:
ffmpeg -i base.mp4 -loop 1 -i tunnel.png -filter_complex "\
[1:v]format=rgba,scale=510:340[card];\
[0:v][card]overlay=548:1340:shortest=1[v]" -map "[v]" -map 0:a out.mp4

# Reaction/recreation PIP shown ONLY between t=4.2s and t=9.8s, white-bordered:
ffmpeg -i base.mp4 -stream_loop -1 -i react.mp4 -filter_complex "\
[1:v]scale=760:428,setsar=1,pad=768:436:4:4:white,format=yuv420p[pip];\
[0:v][pip]overlay=(W-w)/2:560:enable='between(t,4.20,9.80)'[v]" \
-map "[v]" -map 0:a -t 12 out.mp4

Why: overlay=X:Y:enable='between(t,A,B)' shows a second source only during a window. That's the basis of stat-card insets, "AI RECREATION" PIP boxes, and reaction-meme cutaways. pad=...:white draws a border; -loop 1 for a still PNG, -stream_loop -1 + -t/-shortest for a looping clip. To gate multiple inset windows in one pass, chain overlay nodes, each with its own enable=.

Recipe 9, Freeze-tail guard (container-vs-video, -t hardcut)

# Detect: video vs audio stream duration
ffprobe -v error -select_streams v:0 -show_entries stream=duration -of csv=p=0 out.mp4
ffprobe -v error -select_streams a:0 -show_entries stream=duration -of csv=p=0 out.mp4
# If audio runs >~0.3s longer than video, the player holds the LAST frame while audio plays = frozen tail.
# Fix: hard-cut BOTH streams at the video duration, lossless copy:
ffmpeg -y -i out.mp4 -t 28.640 -map 0:v:0 -map 0:a:0 -c copy -movflags +faststart fixed.mp4
# Verify the muxed container is now ~= video duration:
ffprobe -v error -show_entries format=duration -of csv=p=0 fixed.mp4

Why: when audio outlasts video the tail freezes. -c copy -shortest is NOT reliable here. With stream copy it often leaves the long audio in and the tail still freezes. An explicit -t <video_dur> ceiling cuts both streams cleanly. Always confirm against format=duration (the real muxed length), because a stream's duration tag can be stale after a copy (ffmpeg carries the source stream duration over even when fewer packets are muxed). This guard fires fail-open (any probe/encode hiccup returns the original file) so tooling trouble never blocks a legit upload, and it caught a ~111-file batch that had shipped with a ~1.8s frozen tail.

Recipe 10, Concat: demuxer (copy) vs filter (re-encode)

# DEMUXER — near-instant, but EVERY input must share codec/res/fps/pixfmt/timebase:
printf "file 'seg01.mkv'\nfile 'seg02.mkv'\nfile 'seg03.mkv'\n" > list.txt
ffmpeg -f concat -safe 0 -i list.txt -c copy out.mkv

# FILTER — re-encodes, tolerates differing inputs:
ffmpeg -i a.mp4 -i b.mp4 -i c.mp4 -filter_complex \
"[0:v][0:a][1:v][1:a][2:v][2:a]concat=n=3:v=1:a=1[outv][outa]" \
-map "[outv]" -map "[outa]" -c:v libx264 -preset medium -pix_fmt yuv420p -c:a aac -b:a 192k -movflags +faststart out.mp4

Why: the demuxer is the fast path but requires byte-compatible segments, which is exactly why these pipelines render normalized intermediates (identical 1080x1920/30fps/yuv420p) before joining. The concat filter joins clips with mismatched parameters at the cost of a full re-encode; use it when each segment was composed independently.

Recipe 11, Audio bed mixing (adelay + amix) and the loudnorm decoupling

ffmpeg -i base.mkv -i vo1.mp3 -i vo2.mp3 -i bed.mp3 -filter_complex "\
[1:a]adelay=2200|2200[n1];\
[2:a]adelay=12000|12000[n2];\
[3:a]volume=0.2[bed];\
[0:a][n1][n2][bed]amix=inputs=4:duration=longest:normalize=0[m];\
[m]loudnorm=I=-14:TP=-1.5:LRA=11,aresample=48000[a]" \
-map "[a]" -c:a aac -b:a 192k audio.m4a

Why: adelay=ms|ms (one value per channel) time-places each VO block at its offset; amix ... normalize=0 preserves the levels you set (default normalize=1 divides by input count and crushes loudness); a single loudnorm=I=-14:TP=-1.5:LRA=11 hits the YouTube loudness target. CRITICAL BUG: running a many-input amix+loudnorm in the SAME filtergraph as an ass video filter makes ffmpeg DOUBLE the audio (output ~2x length, half speed, mis-tagged bitrate). Fix = render audio in this audio-only pass to .m4a, then a second pass burns subs on video and muxes with -c:a copy -shortest. loudnorm also resamples up, so always append aresample=48000.

Recipe 12, Procedural transition assets via lavfi (license-free SFX/visuals)

# TV-static video flash:
ffmpeg -f lavfi -i "color=c=gray:s=1080x1920:r=30:d=0.6" -vf "noise=alls=100:allf=t+u,format=yuv420p" -c:v libx264 -pix_fmt yuv420p static.mp4
# White-noise burst:
ffmpeg -f lavfi -i "anoisesrc=d=0.5:c=white:a=0.6:r=48000" -c:a pcm_s16le static.wav
# Silent stereo bed for a still-image title card:
ffmpeg -f lavfi -i "anullsrc=r=48000:cl=stereo" -t 3 -c:a pcm_s16le silence.wav

Why: lavfi synth sources generate copyright-safe SFX and visual transitions (static flash, white-noise riser, the 1kHz censor beep, silent beds for -loop 1 PNG cards) entirely procedurally, so there's no asset licensing and the output is fully reproducible. Overlay the static clip on a "rewind" cut with [v][stat]overlay=0:0:enable='between(t,C-0.06,C+0.28)'.

Recipe 13, Two-stage seek (fast keyframe + accurate decode)

ffmpeg -ss 3722.000 -i source.mp4 -ss 3.000 -t 6.0 \
-c:v libx264 -preset veryfast -crf 18 -r 30 -c:a aac -b:a 160k -avoid_negative_ts make_zero cut.mp4

Why: a single -ss BEFORE -i is a fast keyframe seek but lands seconds off (captions then grab adjacent speech); a single -ss after -i is frame-accurate but slow over a long file. Two-stage = fast keyframe seek to (start-3) before the input, then a short accurate decode-seek of 3s after the input, which is fast AND frame-accurate. Note: over a multi-hour VOD even this drifts because the planning transcript's timestamps are non-linear, so windows stay approximate.

Recipe 14, Tile-grid "multiply" (split → hstack → vstack)

ffmpeg -i in.mp4 -filter_complex "\
[0:v]split=9[a][b][c][d][e][f][g][h][i];\
[a]scale=360:640[a2];[b]scale=360:640[b2];[c]scale=360:640[c2];\
[d]scale=360:640[d2];[e]scale=360:640[e2];[f]scale=360:640[f2];\
[g]scale=360:640[g2];[h]scale=360:640[h2];[i]scale=360:640[i2];\
[a2][b2][c2]hstack=3[r0];[d2][e2][f2]hstack=3[r1];[g2][h2][i2]hstack=3[r2];\
[r0][r1][r2]vstack=3[v]" -map "[v]" -map 0:a out.mp4

Why: split into N×N streams, scale each to a (1080/N)×(1920/N) cell, hstack each row then vstack the rows, which gives the meme "multiply into infinity" effect. Step N up over successive segments (1→2→4→6...) for an escalating grid.

Recipe 15, Pre-crop a burned-in overlay (chat column) before reframing

ffmpeg -i cam.mp4 -filter_complex "\
[0:v]crop=iw*0.78:ih:iw*0.22:0,scale=-2:1200,crop=1080:1180,setsar=1[v]" -map "[v]" -map 0:a out.mp4

Why: stream VODs often have mobile chat burned into the lower-left of the frame. crop=iw*0.78:ih:iw*0.22:0 drops the left 22% (the chat column) BEFORE scaling/centering, so leaked chat never reaches the cam crop. Use even widths (scale=-2:H) to stay yuv420p-legal. If the source is 720p, this also upscales to fill.

Gotchas & hard-won lessons

Prompts to build it yourself

The actual kind of instructions you'd hand an AI coding agent (Claude Code) to build this from scratch.

prompt

Build a Python ffmpeg helper that reformats any 16:9 clip into a 1080x1920 Short by centering the undistorted video over a blurred, darkened copy of itself (split + scale-to-cover + boxblur + overlay), then burns word-level karaoke captions from a faster-whisper transcript using a libass .ass file. Make sure the Events Format line includes the Name column.

prompt

Add a 'frosted margin' template: pin the clip to the top 1080x1200, fill the bottom 720px with the blurred video under a translucent navy drawbox scrim plus a thin gold divider, then drawtext stat lines and overlay a stat-card PNG into that margin.

prompt

Write a swear-censor that takes word-level timestamps and, for each swear span, mutes the voice with volume='if(between(t,S,E)+...,0,1)':eval=frame while mixing a 1kHz sine gated to the same spans, then remux the video losslessly.

prompt

Add a frozen-tail guard before upload: use ffprobe to compare audio vs video stream durations and, if audio is longer, hard-cut both streams at the video duration with -t and -c copy, then re-probe the container duration to confirm the dead tail is gone. Make it fail open.

04

MLB Statcast Shorts Auto-Factory (two YouTube channels)

I built a Python + ffmpeg pipeline that turns free MLB data into vertical YouTube Shorts and auto-posts a subset on a cron. Each short follows one content lane, meaning a hook with a data "payoff" revealed late, and they all share one visual template: a clean broadcast clip pinned to the top of a 1080x1920 frame, with all text and a data inset living in a frosted bottom margin (blurred-and-darkened video showing through a navy scrim, a gold divider line, and a stacked drawtext block). I don't use cold-open title cards. The stat lands late as the payoff instead of being given away on the cover.

The lanes are: nasty-tunnel / UNHITTABLE (a filthy putaway pitch + a pitch-tunnel diagram), outfield-assist / GUNNED DOWN (an outfielder throwing a runner out + a strike-zone contact inset), perceived-velocity / PERCEIVED VELOCITY (radar reading vs. how fast it "felt" given the pitcher's extension, optionally paced to an ElevenLabs VO with a slow-mo replay), walk-off (a multi-scene debut comp ending on a stat card), plus countdown comps: nastiest-week Top-6, single-pitcher career top-spin, and single-game K-comp. Three more lanes are scanned straight out of the Sandlot DuckDB in one pass: HOW HARD?! (hardest-hit balls → reveal exit velo), EARNED OR LUCKY? (cheap hits with tiny xwOBA → reveal xwOBA), and UNHITTABLE (nastiest swinging-K → reveal pitch + RPM).

Only two lanes actually fire on the cron. A 4-hour dispatcher (post_sportsstats_4h.py) alternates nasty ↔ assist, picking the opposite of whatever was posted last (read from a shared posted.txt ledger) and falling back to the other lane if the chosen one has nothing fresh. The other lanes are one-offs or reach a channel through a manual pending/ queue that the dispatcher drains first. A shared, playId-keyed ledger guarantees the same clip is never posted to both channels.

APIs & services

Service / APIWhat it does hereDocs
MLB StatsAPI, feed/livePer-game pitch-level truth: GET /api/v1.1/game/{gamePk}/feed/live → liveData.plays.allPlays[].playEvents[] gives playId, pitchData.coordinates (x0/z0 release, pX/pZ plate, pfxX/pfxZ movement), pitchData.breaks (breakHorizontal + breakVerticalInduced = real fan-facing break), startSpeed, extension, plateTime, strikeZoneTop/Bottom; matchup.pitcher/batter. Used to build the tunnel diagram, resolve a playId to (atBatIndex,pitchNumber), and read game state (inning/score). No official public doc.
MLB StatsAPI, playByPlayGET /api/v1/game/{gamePk}/playByPlay → allPlays[].result.description (regex-parsed for outfield assists since runners[].credits is empty in this sim feed) and about.captivatingIndex (tiebreak). No official public doc.
MLB StatsAPI, schedule / teamsGET /api/v1/schedule?sportId=1&startDate=&endDate=&gameType=R to enumerate Final games in a window; GET /api/v1/teams?sportId=1 to map team id → abbreviation. No official public doc.
MLB StatsAPI, game content (same-day highlights)GET /api/v1/game/{gamePk}/content → highlights.items[].playbacks[] exposes broadcast highlight mp4s on mlb-cuts-diamond.mlb.com (FORGE assets) the SAME day a game ends, while Savant's per-play clips lag ~1 day. Used by hand to source the walk-off comp's footage (not called in-script). No official public doc.
Baseball Savant, Statcast search CSVGET /statcast_search/csv?... (params: all=true, hfSea=YEAR|, hfGT=R|, hfPT=PITCHTYPE|, type=details, player_type, pitchers_lookup[]=ID, game_date_gt/lt, pitch_speed_min/max). Returns pitch-level rows: release_speed, release_spin_rate, pfx_x/pfx_z, launch_speed, estimated_woba_using_speedangle, game_pk, at_bat_number, pitch_number, player_name(=batter). The discovery layer for spin/velo/movement rankings.
Baseball Savant, sporty-videos (film room)GET /sporty-videos?playId={playId} returns an HTML page; the per-play clip mp4 is scraped from the <source src=...mp4> tag (with &#xNN; / &amp; un-escaped). This is how every lane fetches its actual footage.
Baseball Savant, arm-strength leaderboardGET /leaderboard/arm-strength?type=outfield&csv=true → fielder_name + arm_of (season max arm strength mph), joined by name to label outfield-assist shorts (shown as the fielder's season arm, NOT this throw's velo, which no source provides).
Sandlot DuckDB (Statcast Parquet)Local data/sandlot.duckdb / data/pitches_{year}.parquet (~6M pitches/season). scan_statcast.py opens it read-only and runs three SQL lanes in one pass to pre-select candidates (hardest-hit, lowest-xwOBA hits, highest-spin swinging Ks) before bridging to playIds via StatsAPI.
ffmpeg / ffprobeAll rendering: filter_complex graphs (scale/crop/boxblur/eq, overlay, drawbox, drawtext), concat for multi-segment comps, libx264 yuv420p + aac. ffprobe reads clip duration to size each segment.
Pillow (PIL)Renders the data insets as RGBA PNGs overlaid into the margin: the pitch-tunnel Bezier diagram, the catcher's-view strike-zone contact plot, and the walk-off stat card.
YouTube Data API v3, videos.insertUpload target (via a shared scripts/yt-shorts-upload.py + per-channel OAuth account). Snippet carries title/description/tags/categoryId=17 (Sports); status public, selfDeclaredMadeForKids=false.
ElevenLabs TTS (perceived-velo VO)Optional narration mp3 passed via --vo; the builder paces the video to the VO duration and appends a slow-mo replay so total length covers the narration.

How it's built, step by step

  1. SCAN/DISCOVER: pick candidates per lane. The Statcast-search lanes (nasty-week, pitcher-topspin) pull a Baseball Savant CSV filtered by season/pitch-type/date and rank by release_spin_rate or 12*hypot(pfx_x,pfx_z) movement. The three DuckDB lanes (scan_statcast.py) query data/sandlot.duckdb directly. Outfield assists get regex-parsed from StatsAPI playByPlay descriptions, and perceived-velo is computed from feed/live extension. Each scanner writes data/<lane>/recent.json (or a date-keyed file) in a shared play schema.
  2. RESOLVE PLAY → IDS: for a chosen play, hit StatsAPI feed/live to map playId → (atBatIndex+1, pitchNumber), pull every pitch of the at-bat's trajectory, and tag is_target on the exact pitch so the tunnel/zone inset highlights the right one rather than whatever ended the AB.
  3. FETCH FILM (playId-keyed cache): fetch_film() scrapes baseballsavant.mlb.com/sporty-videos?playId=… for the mp4 and writes it to raw_{idx:02d}_{playId[:8]}.mp4. The playId hash in the filename is load-bearing, because the rotating 'recent' datasets reshuffle which play sits at index N, so an index-only glob would serve stale footage under new text.
  4. RENDER INSET PNG: Pillow draws the lane's data graphic. That's a zoomed pitch-tunnel (Bezier release→plate curves, target pitch gold/thick) for nasty/K/topspin, a catcher's-view strike-zone dot plot for assist/perceived-velo, or a full-frame stat card for walk-off.
  5. COMPOSE SEGMENT (frosted template): one ffmpeg filter_complex does it all. Split the clip, build a blurred/darkened 1080x1920 background plus a sharp 1080x1200 top crop overlaid at y=0, lay a navy drawbox scrim and gold divider across y=1200..1920, stack the drawtext margin block (centered gold kicker header, then star name / stat line / gold payoff / matchup+date), and overlay the inset PNG into the bottom-right of the margin.
  6. CONCAT (countdown comps): the multi-K / Top-N / multi-scene builds render each segment, then concat=n=N:v=1:a=1 worst→best (or scene order) so the video climaxes on #1. Encoded libx264 4M/yuv420p, +faststart.
  7. POST: post_sportsstats_4h.py runs every 4h. It drains any manual pending/ short first, otherwise it alternates nasty↔assist (opposite of last in posted.txt, fallback to the other lane). The chosen lane's poster picks the freshest un-posted short from review/batch_*, generates a playbook-compliant title (title_ledger dedupes phrasing across both channels), writes the YouTube snippet sidecar, and uploads via the channel's OAuth account. The shared playId-keyed posted.txt prevents cross-channel reposts.

Under the hood

The frosted-margin filtergraph (the shared look)

Every single-clip lane composes with the same ffmpeg filter_complex. The clip is split: one copy is blown up to fill 1080x1920 and heavily blurred+darkened as the background, the other is scaled to a sharp 1080x1200 top panel. A semi-opaque navy drawbox from y=1200 down forms the frosted margin (the blurred video still shows through it), capped by a thin gold divider, then the text block is drawn into it.

[0:v]trim=start=1.00:duration=DUR,setpts=PTS-STARTPTS,fps=30,split=2[va][vb];
[va]scale=1080:1920:force_original_aspect_ratio=increase,
    crop=1080:1920,boxblur=24:1,eq=brightness=-0.28:saturation=0.55,setsar=1[bg];
[vb]scale=1080:1200:force_original_aspect_ratio=increase:flags=lanczos,
    crop=1080:1200,setsar=1[scaled];
[0:a]atrim=start=1.00:duration=DUR,asetpts=PTS-STARTPTS,aresample=44100[atrim];
[1:v]format=rgba,scale=510:340,trim=duration=DUR,setpts=PTS-STARTPTS[tunnel];
[bg][scaled]overlay=0:0:shortest=1[v0];
[v0]
drawbox=x=0:y=1200:w=1080:h=720:[email protected]:t=fill,
drawbox=x=0:y=1198:w=1080:h=5:[email protected]:t=fill,
drawtext=fontfile=DejaVuSans-Bold.ttf:text='UNHITTABLE':
  fontcolor=0xffd166:fontsize=32:x=(w-text_w)/2:y=1222:[email protected]:shadowx=2:shadowy=2
[v1];
[v1]drawtext=...:text='PITCHER NAME':fontsize=44:x=44:y=1286...,
    drawtext=...:text='88 MPH   SWEEPER':fontcolor=0xa5dfff:fontsize=30:x=44:y=1342...,
    drawtext=...:text='3,348 RPM':fontcolor=0xffd166:fontsize=50:x=44:y=1404...,   # gold payoff
    drawtext=...:text='vs HITTER':x=44:y=1470...,
    drawtext=...:text='18\" BREAK':x=46:y=1520...,
    drawtext=...:text='LAA @ HOU   |   2026-06-09':[email protected]:x=46:y=1562...
[v2];
[v2][tunnel]overlay=548:1340:shortest=1[v]

Key constants: brand gold 0xffd166, info-blue 0xa5dfff, scrim [email protected]. The clip is trimmed start=1.0 to skip the windup; DUR = min(12, max(4, clip_dur-SS)). Long victim names are auto-shrunk with vs_fs = max(26, min(46, int(500/(0.68*len(vs))))) so they never run under the inset (which starts at x≈548). Margin text sits at y=1222..1604, lifted off the very bottom, which is hidden behind the Shorts progress bar / action buttons.

drawtext escaping (load-bearing)

def _esc(s):
    return (s.replace("\\","\\\\").replace(":","\\:").replace("'","").replace("%","\\%"))

Names are also ASCII-folded (unicodedata.normalize("NFKD", ...)) so accents don't break the fontfile.

Savant discovery, real CSV param set

SAVANT_CSV = "https://baseballsavant.mlb.com/statcast_search/csv"
# nasty-week, per breaking/offspeed pitch type, swinging strikes in a date window:
params = {"all":"true","hfSea":"2026|","hfGT":"R|","type":"details",
          "hfPT":"SL|","game_date_gt":"2026-05-18","game_date_lt":"2026-05-24"}
# per-pitcher career top-spin:
params = {"all":"true","hfGT":"R|","hfSea":"2018|","player_type":"pitcher",
          "pitchers_lookup[]":"545333","type":"details"}
url = SAVANT_CSV + "?" + urllib.parse.urlencode(params, safe='|')   # keep the | delimiters

Movement = 12 * math.hypot(pfx_x, pfx_z) inches. A SPIN ranking excludes {CH,FS,FO,KN,EP}, since high spin on those is a Statcast misclassification rather than skill. The real fan-facing break shown on screen comes from the feed's breaks object (hypot(breakHorizontal, breakVerticalInduced)), NOT the smaller pfx metric.

Film fetch + playId-keyed cache

def fetch_film(play_id, dest):
    if dest.exists() and dest.stat().st_size > 100_000: return True
    html = GET(f"https://baseballsavant.mlb.com/sporty-videos?playId={play_id}")
    m = re.search(r'<source[^>]*src="([^"]+)"', html)
    url = re.sub(r'&#x([0-9a-fA-F]+);', lambda x: chr(int(x.group(1),16)), m.group(1)).replace('&amp;','&')
    dest.write_bytes(GET_bytes(url))

# Cache path MUST carry the playId hash — never a bare index glob:
clip = clips_dir / f"raw_{idx:02d}_{(play_id or 'x')[:8]}.mp4"

This is the fix for the wrong-footage-under-right-text desync: because fetch_film skips download when a same-named file exists and the 'recent' datasets re-order plays every scan, an index-only name would silently reuse a prior batch's clip.

Sandlot DuckDB, the three lanes in one pass

con = duckdb.connect(str(Path.home()/"sandlot/data/sandlot.duckdb"), read_only=True)

# HOW HARD?!  — hardest-hit balls, reveal exit velo
"select game_pk, game_date, batter, inning, home_team, away_team, events, "
"round(launch_speed,1) ls from pitches where {win} and launch_speed >= 106.0 "
"order by launch_speed desc limit 60"

# EARNED OR LUCKY? — cheap hits (tiny xwOBA that fell), reveal xwOBA
"select ... round(estimated_woba_using_speedangle,3) xw from pitches where {win} "
"and events in ('single','double','triple') "
"and estimated_woba_using_speedangle <= 0.130 "
"order by estimated_woba_using_speedangle asc limit 50"

# UNHITTABLE — nastiest swinging strikeout, reveal pitch + RPM
"select ... pitch_name, round(release_speed,1) velo, round(release_spin_rate) rpm "
"from pitches where {win} and events='strikeout' and description ilike '%swinging%' "
"and pitch_name in ('Sweeper','Curveball','Slider',...) and release_spin_rate is not null "
"order by release_spin_rate desc limit 50"

{win} is game_date >= DATE 'YYYY-MM-DD' (rolling window) or game_year = YYYY. Each row's lane_score/payoff is baked in, then bridged to a StatsAPI playId (find_batted matches launchSpeed within 0.6; find_strikeout takes the K-pitch playId) for film fetch.

Perceived velocity (radar vs. felt)

RUBBER = 60.5
avg_ext = mean(extension over THIS league's feed)   # league-relative, not hardcoded 6.0
pv_ratio = startSpeed * (RUBBER - avg_ext) / (RUBBER - extension)   # constant-velo ratio
pv_time  = ((RUBBER - avg_ext) / plateTime) * 0.681818              # ft/s -> mph cross-check
gap = pv_ratio - startSpeed     # "+2.6 MPH OVER RADAR"

With --vo, the clip plays once at speed then a slow-mo replay (setpts=slowfac*(PTS-STARTPTS), no frame interpolation / loop seam), slowfac chosen so total length ≈ VO duration; audio is the VO only, apad-padded.

Replay-cut auto-detector (long fan-interference clips)

detect_replay_cuts.py runs select='gt(scene,0.3)',showinfo, keeps cuts ≥ FLOOR(13s), clusters timestamps within 0.6s; a cluster of ≥2 = a dissolve/graphic = replay start → cut there. Ambiguous long clips get a safe DEFAULT_CAP=26s flagged for spot-check. Overrides are written playId-keyed.

Title playbook (baked into the posters)

40-55 chars, 4-7 words, real player last name (SEO), ONE all-caps power phrase, ONE emoji at the END, one ! max and no ?, never lead with a digit. #Shorts goes in the DESCRIPTION (title hashtags steal feed space), 5-6 tags = 2 general + 2 niche + 1 player. Titles are drawn from a randomized template pool and run through title_ledger.pick_unique(... keys=...) so no phrasing repeats back-to-back and no title is reused across either channel. Example renders: Suzuki Guns Down Herrera at Third 🎯, Skubal's Slider Froze Witt 🥶.

Gotchas & hard-won lessons

Prompts to build it yourself

The actual kind of instructions you'd hand an AI coding agent (Claude Code) to build this from scratch.

prompt

Build a Python builder that makes a vertical 1080x1920 MLB short from a single Baseball Savant film-room clip: pin the clip to the top as a sharp 1080x1200 panel, fill the rest with a heavily blurred + darkened copy of the same clip, and lay a frosted bottom margin (navy drawbox scrim + thin gold divider) holding a stacked drawtext block: centered gold kicker, pitcher name, velo+pitch, a big gold stat payoff, the victim, and matchup/date. Do it all in one ffmpeg filter_complex. No title card; the stat is the late payoff.

prompt

Add a nasty-pitch lane: query the Baseball Savant statcast_search CSV per breaking/offspeed pitch type for swinging strikes in a date window, rank by release_spin_rate (exclude changeups/splitters/forks/knuckles as spin-sensor glitches), resolve each to a StatsAPI feed/live playId, draw a pitch-tunnel diagram (Bezier release→plate curves with the putaway pitch highlighted gold) with Pillow, and overlay it into the bottom-right of the frosted margin. Cache the fetched clip by playId hash, never by bare index.

prompt

Write a 4-hour cron dispatcher that alternates two YouTube Shorts lanes (nasty and outfield-assist) by posting the opposite of whatever was posted last (read from a shared posted.txt ledger), falling back to the other lane if the chosen one has nothing fresh, and draining a manual pending/ queue first. Use a shared playId-keyed ledger so the same clip is never posted to both channels, and generate titles from a randomized template pool deduped against a cross-channel title ledger.

prompt

Add a Sandlot DuckDB scanner that opens data/sandlot.duckdb read-only and runs three lanes in one pass: HOW HARD?! (launch_speed >= 106 desc), EARNED OR LUCKY? (singles/doubles/triples with estimated_woba_using_speedangle <= 0.130 asc), and UNHITTABLE (swinging-strikeout breaking balls by release_spin_rate desc). Bridge each candidate to a StatsAPI playId via playByPlay (match launchSpeed within 0.6, or take the K-pitch playId) and write a recent.json with the lane payoff and score baked in.

05

A Kick Streamer: VODs into Brainrot Shorts and Dance Edits

A two-pronged short-form video factory I built for streamer / Streamer, a Kick streamer. It turns multi-hour stream VODs into vertical (1080×1920) content for YouTube Shorts / TikTok. Both engines share one clip library and one safety system.

1. Brainrot shorts + EDL highlight reels. I pull the loudest moments out of a VOD, transcribe them word-by-word with faster-whisper, burn karaoke captions plus meme cutaways and SFX, and render 9:16. The mature engine is vod3/build_highlight.py, a JSON-EDL-driven builder where each "beat" declares a source window, framing, FX, narrative caption, emphasis words, reaction memes, animated graphic stamps, and SFX. Captions and audio both run through a multi-tier censor (swears → starred + audio-bleeped; slurs blanked; a crisis/negativity blocklist that *drops the whole beat*).

2. Beat-synced "Go" dance AMV (ohtani_supercut/build_supercut_go_streamer.py). This is a forked Chemical Brothers "Go" supercut engine that cuts a pool of ~70 short dance clips on the musical beat (librosa beat-tracking), with escalating FX (zoom punches, hue shifts, RGB-split accents, 2×2/4×4 mirror kaleidoscopes, grids, a hero-10 mosaic) and the biggest spectacle landing on the song's energy peak, then trimming to a phrase-aligned loop you can play over and over without a visible seam. overlay_gif_bursts.py scatters beat-synced reaction-gif stickers on top.

Because the creator is 21+, the vision safety gate works as content moderation. It judges *movement and framing* rather than skin coverage, runs two adversarial passes, and extracts *full-resolution* frames across the whole clip (not tiled thumbnails) so it catches clips that look "solo/safe" in a thumbnail but aren't at full res.

From one VOD the same rig also cuts two other formats. One is plain highlight reels, where I pull the best live moments of a stream into a tight few minutes. The other is what I call story videos, short narrated arcs paced after the leaked MrBeast production guide, the internal write-up on hooks and retention that went around online. Stories pull B-roll from the clip library but always lay fresh footage on top, so the channel doesn't show you the same moment twice.

The safety check is there for a practical reason. YouTube quietly limits how far it pushes a clip it reads as too revealing, so a borderline frame just doesn't travel, and on top of that the AI model doing the edit will turn down anything it considers too sexual. The gate keeps each clip clear of both limits. It looks at how the body is moving and how the shot is framed rather than measuring how much skin is on screen, since that read is what actually lines up with both the algorithm and the model's own refusals.

APIs & services

Service / APIWhat it does hereDocs
faster-whisper (CTranslate2 Whisper)Word-level + segment-level speech-to-text for every clip; drives karaoke caption timing, swear-bleep span detection, and the lip-sync word-onset beat grid. Run with CUDA float16 (cuDNN/cuBLAS DLLs injected on Windows), CPU int8 fallback.
librosaAudio beat-tracking (librosa.beat.beat_track) to build the per-song beat grid for the dance AMV, and RMS energy analysis (librosa.feature.rms) to locate the song's loudest bar so the kaleidoscope/hero payoff lands on the drop.
FFmpeg (libx264 + libass + filtergraphs)All cutting, composition, ASS subtitle burning, vstack/xstack grids, zoompan punches, boxblur frosted margins, kaleidoscope mirror stacks, gif-burst overlays, audio bleep mixing and loudnorm. ASS (Advanced SubStation Alpha) is the caption format.
yt-dlpFetches copyright-safe filler b-roll (cute-animal compilations) for split-screen shorts, and sources reaction-gif / meme beds for the AMV sticker layer.
Google Vertex AI, Gemini (generateContent)Multimodal art-director / QA advisor: ingests rendered QA montage grids (base64 inline images) and gives concrete edit feedback. Auth via Application Default Credentials.
Kick Clips APISource of the clip harvest, public stream clips with real view counts and ULID ids feed clip_library.json (the ranked, vibe-tagged reaction pool).
Tenor GIF APIReaction-meme cutaway pool (memes2/) popped over highlight beats and the AMV sticker scatter.

How it's built, step by step

  1. INGEST: download the Kick VOD to a local Windows C: drive instead of the WSL UNC path, since large-file IO is much faster there. Extract a low-rate mono WAV with ffmpeg -i VOD.mp4 -vn -ac 1 -ar 8000 audio8k.wav for peak scanning, plus a 16kHz WAV for transcription.
  2. CANDIDATE PICK (auto): compute RMS energy on 0.5s windows, smooth it, then take the top-N peaks. For long VODs I use an even-bucket spread (vod7/candidate_scan.py: split the timeline into ~30 buckets and take the top 1-2 peaks per bucket with a 25s min-gap) so the candidates aren't all clustered in one hot hour. Writes candidates.json.
  3. FULL TRANSCRIBE: run faster-whisper (medium/large-v3) over the whole stream with VAD → segments.json ({s,e,t} per line). I use that to locate quotable moments by text search.
  4. DRIFT RE-PROBE (critical): segments.json timestamps DRIFT +10..25s vs. the real source time. Before trusting any timestamp, run _scan/probe.py: cut a WIDE window (start-10s .. +30s), accurate two-stage seek, re-transcribe, fuzzy-match the expected text, and emit a corrected start. Library entries are tagged v:1 (drift-verified) or v:0 (probe before use).
  5. VISION SAFETY GATE: extract FULL-RESOLUTION frames across the whole candidate clip (_scan/qa_grab.py), build a montage sheet (qa_sheet.py), and judge movement/framing in two adversarial passes (Gemini multimodal advisor + Claude). Full-res across the timeline is mandatory, because tiled thumbnails can fake a 'solo/safe' read. Failed keys are dropped; survivors get pre-starred captions baked in.
  6. AUTHOR THE EDL: write highlight.json, which is a beats[] array. Each beat = a source window + framing (cam / split / card), optional FX, narrative thread, emphasis words, meme(s), graphics stamps, sfx, echo, multiply.
  7. BUILD (highlight): build_highlight.py PASS 1 = cut + transcribe + SAFETY FILTER (bleep swears, drop crisis beats). PASS 2 = render PCM intermediates → concat → ONE global ASS → final pass mixes SFX + loudnorm ONCE, burns subtitles, overlays TV-static rewind flashes. Output is one warm or brainrot reel.
  8. DANCE AMV (alt track): prep_song.py --src song.mp3 --name foo → librosa beat grid beats_foo.json. Then build_supercut_go_streamer.py --pool clip_pool_streamer.json --song foo.mp3 --beats beats_foo.json cuts the dance pool on-beat with escalating FX, payoff on the energy peak, phrase-aligned loop trim. Optionally overlay_gif_bursts.py --in amv.mp4 --beats beats_foo.json scatters beat-synced sticker bursts.
  9. SHIP: rendered MP4s land in shorts_out/streamer_go/ (AMVs) or per-VOD out/ dirs (highlights). Cron pickup posts to YouTube @StreamerClips / TikTok with native-audio or music-feature-eligible tagging.

Under the hood

1. Auto candidate-pick (RMS energy, even-bucket spread)

I find loud moments cheaply on an 8kHz WAV. The long-VOD version spreads picks across the timeline so they don't cluster:

# vod7/candidate_scan.py
w = wave.open(WAV, "rb"); sr = w.getframerate()
a = np.abs(np.frombuffer(w.readframes(w.getnframes()), dtype=np.int16).astype(np.float32))
win = int(sr * 0.5); m = (len(a)//win)*win
rms = np.sqrt((a[:m].reshape(-1, win)**2).mean(axis=1) + 1)
sm  = np.convolve(rms, np.ones(8)/8, mode="same")          # smooth
# 30 buckets, top-2 peaks each, 25s min gap -> spread across the whole stream
for bi in range(BUCKETS):
    idxs = np.argsort(sm[bs:be])[::-1] + bs
    for idx in idxs:
        t = idx*0.5
        if any(abs(t-c["start"]) < MIN_GAP for c in chosen): continue
        chosen.append({"start": round(t,2), "len": 8.0, "rms": round(sm[idx],1)})

2. faster-whisper on Windows, the cuBLAS/cuDNN DLL injection

nvidia ships as a namespace package (no __file__), so the GPU DLL dirs must be hand-added or CUDA transcription silently falls back to CPU:

import nvidia, site
nvbases = list(getattr(nvidia, "__path__", []))
for sp in site.getsitepackages():
    c = os.path.join(sp, "nvidia")
    if os.path.isdir(c): nvbases.append(c)
for nvbase in nvbases:
    for sub in ("cublas\\bin", "cudnn\\bin"):
        d = os.path.join(nvbase, sub)
        if os.path.isdir(d):
            os.add_dll_directory(d); os.environ["PATH"] = d + os.pathsep + os.environ["PATH"]
# then: WhisperModel("large-v3", device="cuda", compute_type="float16")
#       .transcribe(clip, language="en", word_timestamps=True, vad_filter=False)

3. The EDL ("segments") schema

highlight.json is the source of truth. One beat:

{"id":10, "kind":"cam", "format":"zoom_punch", "start":3732.5, "len":5.0, "punchAt":1.2,
 "thread":"SIX. SEVEN.", "sfx":"impact", "speech":true,
 "meme":"six_seven_hand_gesture_meme_0", "memeAt":1.5,
 "emphasis":["six","seven"],
 "graphics":[{"text":"6 7","style":"explosion","at":3.9,"dur":1.5}]}

kind ∈ {cam (webcam, blurred frosted margin), split (vstack with filler b-roll), card (title card)}. srcLenlen triggers a speed change; fx ∈ {reverse, invert, mirror, hue}; crop:[w,h,x,y] frames a sub-region (board-share/corner cam); multiply does the kaleidoscope tiling; rewind adds a VHS static cut.

4. Censor / safety system (load-bearing moderation)

Three tiers, applied at both the burned-caption and the audio level:

SWEAR  = {"fuck","shit","bitch","dick","pussy", ...}        # -> F**K on screen + audio bleep
CENSOR = {"<slur>", ...}                                     # -> blanked caption + audio bleep
SWEAR_AUDIO = SWEAR | CENSOR
# crisis / negativity -> DROP THE ENTIRE BEAT (audio + caption)
BLOCK_TOK    = {"die","kill","suicide","depressed","selfharm", ...}
BLOCK_PHRASE = ["want to die","kill myself","social battery", ...]

def star_token(t):                       # F**K
    m = re.match(r"^(\w)(\w*)(\w)$", t)
    return m.group(1)+"*"*max(1,len(m.group(2)))+m.group(3) if m else t

The audio bleep mutes the swear span and lays a 1000Hz sine over it in one filtergraph:

expr = "+".join(f"between(t,{s:.2f},{e:.2f})" for s,e in spans)
ff(["-i", clip, "-filter_complex",
    f"sine=f=1000:r=48000:d={LEN+0.6}[bp];"
    f"[0:a]volume='if({expr},0,1)':eval=frame[vv];"      # mute the swear
    f"[bp]volume='if({expr},0.35,0)':eval=frame[bb];"    # gate the beep
    f"[vv][bb]amix=inputs=2:normalize=0:duration=first[a]",
    "-map","0:v","-map","[a]","-c:v","copy","-c:a","aac", tmp])

A backstop: harvested library captions are *pre-starred deterministically* (build_lib.py), so even a render-time whisper misfire can never put an un-starred swear on screen.

5. The "segments drift → re-probe" lesson

# _scan/probe.py  — segments.json starts drift +10..25s, so search a WIDE window
PRE, POST = 10.0, 30.0
# accurate TWO-STAGE seek: coarse keyframe seek to start-3, then fine decode +3
ffmpeg -ss {s0-3} -i src -ss 3 -t {dur} _probe.mp4
# re-transcribe, fuzzy word-overlap match against the expected text, suggest corrected start

6. Brainrot composition filtergraphs (build_highlight.py)

Frosted purple margin (chat column pre-cropped away), centered foreground:

PRECROP = "crop=iw*0.78:ih:iw*0.22:0"          # drop Kick mobile chat column (left 22%)
FROST_BG = ("scale=1080:1920:force_original_aspect_ratio=increase,crop=1080:1920,"
            "boxblur=22:2,eq=brightness=0.05:saturation=1.65,"
            "drawbox=x=0:y=0:w=1080:h=1920:[email protected]:t=fill")

Kaleidoscope "multiply into infinity" replicates one treated frame into a growing n×n grid (1→2→4→6→8→10):

def tile_grid(src, dst, n):
    cw, ch = 1080//n, 1920//n
    parts=[f"{src}split={n*n}" + "".join(f"[t{i}]" for i in range(n*n))]
    for i in range(n*n): parts.append(f"[t{i}]scale={cw}:{ch}[g{i}]")
    rows=[]
    for r in range(n):
        parts.append("".join(f"[g{r*n+c}]" for c in range(n)) + f"hstack={n}[row{r}]"); rows.append(f"[row{r}]")
    parts.append("".join(rows) + f"vstack={n}{dst}")
    return ";".join(parts)

Karaoke is one *global* ASS, offset to measured cumulative beat starts; the active word gets a neon pop + wiggle, while the emphasis word stays green. Audio is mixed and loudnormed ONCE in a separate audio-only pass. Combining the many-input amix+loudnorm with the ASS video filter in one graph triggers an ffmpeg audio-doubling bug.

7. Beat-synced "Go" dance AMV engine

prep_song.py beat-tracks any song with librosa:

y, sr = librosa.load(song, sr=22050, mono=True)
tempo, beat_frames = librosa.beat.beat_track(y=y, sr=sr, units="frames")
beat_times = librosa.frames_to_time(beat_frames, sr=sr)
# -> beats_<name>.json {tempo, beat_times, hook_start, hook_end, bars, duration}

build_supercut_go_streamer.py finds the loudest ~1.5s bar (the payoff anchor) and builds a bar-quantized escalation plan:

def energy_peak_rel(song, hook_start, hook_end):
    y, sr = librosa.load(song, sr=22050, mono=True)
    rms = librosa.feature.rms(y=y, frame_length=2048, hop_length=512)[0]
    win = int(1.5*sr/512); sm = np.convolve(rms, np.ones(win)/win, mode="same")
    ta = np.arange(len(sm))*512/sr
    scored = np.where((ta>=hook_start)&(ta<hook_end), sm, -1.0)
    return float(ta[int(np.argmax(scored))] - hook_start)

Scene modes: SOLO / ZOOMIN / ZOOMOUT / HUE / INVERT / RGBSPLIT per beat; phrase boundaries (every 4th bar) escalate GRID4→GRID8→GRID9→GRID16 and MULT4→MULT16; the peak bar gets MULT16 then HERO10 (720px hero + nine 180px minis via chained overlays); the finale is a rapid KHR montage. The kaleidoscope is nested 2×2 mirroring:

# render_multiply: levels=2 -> 16 copies
fc += (f"[{cur}]split=4[a][b][c][d];[b]hflip[b2];[c]vflip[c2];[d]hflip,vflip[d2];"
       f"[a][b2]hstack[top];[c2][d2]hstack[bot];[top][bot]vstack[out];")

Photosensitivity guard: INVERT/RGBSPLIT are single-beat accents only, never chained (a fx_accent_cooldown enforces it) because sustained strobing gets down-ranked. The background is a Philippines blue→red gradient (hidden behind the footage band) with scattered confetti. Output is trimmed to LOOP_SECS (phrase-aligned, ~3×8-bar phrases ≈ 49–50.5s) so the loop has no visible seam. --clean collapses all grids/mults to SOLO for a lip-sync edit.

8. Beat-synced gif-sticker bursts

overlay_gif_bursts.py pops a fresh sticker on every Nth beat at rotating scatter slots. Inputs stay bounded by grouping bursts by (file,size): each group is one looped input, scaled+padded with a neon border once, then split per burst:

fc.append(f"[{in_idx}:v]scale={size}:-1:flags=lanczos,setsar=1,"
          f"pad=iw+{2*BORDER}:ih+{2*BORDER}:{BORDER}:{BORDER}:color={acc}[{pg}]")
fc.append(f"[{pg}]split={n}{outs}")
# each overlay gated to its beat window with a sine bob:
fc.append(f"[{cur}][{lbl}]overlay=x={b['x']}:y='{b['y']}+9*sin(6*t)':"
          f"enable='between(t,{b['t0']},{b['t1']})'[{nxt}]")

Hard no-kids rule is enforced in source selection: any gif depicting minors is removed from the pool by hand, and "extra" categories are structurally human-free (explosions / car crashes / dancing cats).

9. clip_library.json schema (the harvest)

{"id":"clip_01J...ULID", "file":"000094_Pelly_clip_...mp4", "title":"...",
 "views":12702, "mature":false, "is_streamer":true,
 "category":"reaction", "vibe":["gasp","shocked","deadpan"],
 "safe":true, "flags":["none"], "win":[14,18], "caption":"wait WHAT", "score":4}

win=[start,end] in clip-local seconds. Sibling files: clip_library_gold.json (the 26 best reaction beats) and streamer_reactions.json (cross-VOD reactions you can drop straight into an EDL as a beat; they carry forceText, src, start, len, bleep, and a v verification flag).

Gotchas & hard-won lessons

Prompts to build it yourself

The actual kind of instructions you'd hand an AI coding agent (Claude Code) to build this from scratch.

prompt

Build a Python pipeline that takes a multi-hour Kick stream VOD, finds the loudest moments by RMS energy (spread evenly across the timeline so they don't cluster), cuts them, transcribes each with faster-whisper word timestamps, and renders 1080x1920 brainrot shorts with neon karaoke captions where the active word pops and a punchy emoji title sits on top. Make two render styles: a blurred-background centered 'full' style and a 'split' style with cute-animal filler b-roll on the bottom half.

prompt

Add a JSON-EDL highlight builder where each beat declares a source window, framing (webcam / split / title card), optional speed/reverse/hue FX, a narrative caption thread, emphasis words to highlight green, reaction-meme cutaways, animated graphic stamps, and SFX. Run a censor that stars swears in the burned text AND bleeps them in the audio with a 1000Hz tone, blanks slurs, and DROPS any beat whose speech trips a crisis/self-harm blocklist. Mix all SFX and loudnorm once in a separate audio pass to avoid the ffmpeg audio-doubling bug.

prompt

Build a beat-synced dance AMV engine: beat-track any song with librosa into a beats.json, then cut a pool of short vertical dance clips on the beat with escalating FX (zoom punches, hue shifts, single-beat invert/RGB-split accents, 2x2 and 4x4 mirror kaleidoscopes, grids, and a hero+9-minis mosaic), and land the biggest spectacle on the song's loudest bar (RMS energy peak). Trim the output to a phrase-aligned loop with no visible seam. Add a separate pass that scatters beat-synced reaction-gif stickers at rotating slots, grouping by (file,size) to keep the ffmpeg input count bounded.

prompt

Build a content-moderation safety gate for a 21+ creator's clip harvest that judges movement and framing rather than skin coverage. Extract full-resolution frames across the whole clip (not thumbnails) into a montage, run two adversarial vision passes (a Vertex AI Gemini multimodal advisor plus a second judge), drop any clip that fails, and merge survivors into a vibe-tagged clip_library.json with real view counts, a clip window, a pre-starred caption, and a drift-verification flag.

06

Remotion Video Factories: React-Rendered Documentary Shorts (a UAP channel)

I built a Remotion 4 video factory that renders dark "declassified dossier" documentary shorts about the Department of War UAP releases (the war.gov/ufo / PURSUE theme) for the uap-channel channel. Each episode is a 1920x1080, 30fps React composition. An ElevenLabs voiceover drives the runtime, burned-in captions ride the VO timeline, and the visuals are 100% procedural React/SVG with no stock footage: glowing UAP orbs, FLIR reticles, recon maps with pulsing crosshairs, silhouetted witnesses, redacted memos, and a persistent CLASSIFIED//PURSUE chrome frame. A low eerie drone bed sits under everything.

The architecture I settled on is a shared design-system library plus a per-video isolated entry point. src/lib/ is a barrel of reusable dossier components (DossierFrame, Orb, MapPing, RedactedDoc, Captions, Grain, …). Every video lives in its own src/videos/<name>/ folder with its OWN index.ts that calls registerRoot(), so renders never cross-couple and multiple agents can build episodes in parallel without touching each other's files. Audio duration is read at bundle time via calculateMetadata + getAudioDurationInSeconds, so the composition length auto-fits the narration.

The pattern generalizes. ohtani-longform is the same skeleton scaled up to multi-chapter long-form essays (per-chapter VO durations summed in calculateMetadata, props injected into a <Series>). ufo-shorts-claudec is the pure-ffmpeg contrast: same documentary DNA (ElevenLabs narration, Ken-Burns cuts, redacted-doc insets, UNRESOLVED stamps) but assembled entirely with an ffmpeg filtergraph instead of React, output as vertical 1080x1920 Shorts.

APIs & services

Service / APIWhat it does hereDocs
Remotion (core)React-based programmatic video framework: Composition/Sequence/AbsoluteFill, useCurrentFrame, interpolate, spring, Audio/OffthreadVideo/Img, staticFile, registerRoot
@remotion/cliHeadless render + studio CLI: remotion compositions, remotion render, remotion studio; configured via remotion.config.ts
@remotion/media-utils, getAudioDurationInSecondsReads VO mp3 length inside calculateMetadata to compute durationInFrames so the video auto-fits the narration
@remotion/google-fontsModule-level font loading (Oswald/Special Elite/Share Tech Mono); Remotion waits for fonts before rendering. Weights MUST be restricted to avoid dozens of font requests per render
ElevenLabs Text-to-Speech (with-timestamps)Generates VO audio + char-level alignment in one call (POST /v1/text-to-speech/{voiceId}/with-timestamps); alignment arrays are chunked into caption cues. Adam voice id pNInz6obpgDQGcFmaJgB
@remotion/captionsCaption helpers used in the ohtani long-form variant (words→lines) alongside the ElevenLabs alignment
FFmpeg (filtergraph)Audio mixing (amix/adelay/stream_loop) for the drone/SFX bed; in the ufo-shorts variant, the entire video: blurred-fill vertical framing, zoompan Ken-Burns, drawtext captions, doc-inset overlays

How it's built, step by step

  1. Write the script as audio-friendly prose in public/<name>/narration.txt (e.g. spell out 'war dot gov, slash, U F O' so the TTS reads it cleanly).
  2. Generate VO + captions: ELEVENLABS_API_KEY=<key> node tools/gen_vo.mjs pNInz6obpgDQGcFmaJgB public/<name>. That calls the ElevenLabs /with-timestamps endpoint and writes vo.mp3, cues.json (caption chunks), and meta.json (duration). pNInz6… = Adam, the house narrator.
  3. Scaffold the video by copying src/videos/potato/ as the structural template: index.ts (registerRoot, NO JSX), Root.tsx (one <Composition> with calculateMetadata), <Name>Video.tsx (the timeline), plus any video-specific hero components (e.g. OrbSplit.tsx).
  4. Build the timeline by composing shared-lib components from ../../lib inside <Sequence> blocks, using a local seg(T,a,b) helper to lay scenes out as fractions of total duration T; wrap each in <FadeScene>; import cues via import cuesData from '../../../public/<name>/cues.json'.
  5. Validate the bundle: npx remotion compositions src/videos/<name>/index.ts (catches missing barrel exports / React #130 before a full render).
  6. Render headless: npx remotion render src/videos/<name>/index.ts <Id> out/<name>.mp4 --log=error (Chrome headless, proven working on Windows local C:).
  7. Verify visually: extract scene frames with ffmpeg -i out/<name>.mp4 -vf "select='not(mod(n\,200))'" -vsync vfr out/<name>_%02d.png and eyeball each scene (no blank frames, captions visible, graphics present).
  8. Compress for upload: ffmpeg -i out/<name>.mp4 -c:v libx264 -preset slow -crf 21 -pix_fmt yuv420p -c:a aac -b:a 192k out/<name>_web.mp4 (default render bitrate is ~450MB for 2 min).

Under the hood

1. Remotion 4 setup & the per-video isolation registry

remotion.config.ts is tiny. It sets JPEG frames and overwrite:

import {Config} from '@remotion/cli/config';
Config.setVideoImageFormat('jpeg');
Config.setOverwriteOutput(true);

The key architectural move is that each video is its own Remotion root. There is no single registry of all videos. Instead every src/videos/<name>/index.ts registers exactly one root, and you point the CLI at that file. That is what lets parallel agents build episodes without merge collisions.

// src/videos/potato/index.ts  — MUST contain no JSX
import {registerRoot} from 'remotion';
import {RemotionRoot} from './Root';
registerRoot(RemotionRoot);

2. Audio-driven duration via calculateMetadata

The composition length is computed from the VO file at bundle time, so the video is exactly as long as the narration (plus a 2s tail). calculateMetadata runs before render and can return both durationInFrames and injected props.

// src/videos/potato/Root.tsx
import {Composition, staticFile} from 'remotion';
import {getAudioDurationInSeconds} from '@remotion/media-utils';
import {PotatoVideo} from './PotatoVideo';

export const RemotionRoot = () => (
  <Composition
    id="Potato" component={PotatoVideo}
    fps={30} width={1920} height={1080}
    durationInFrames={3972}            // placeholder; overridden below
    calculateMetadata={async () => {
      const d = await getAudioDurationInSeconds(staticFile('potato/vo.mp3'));
      return {durationInFrames: Math.ceil(d * 30) + 60};
    }}
  />
);

The long-form variant (ohtani) uses the same hook at a larger scale: it fetches a vo/index.json of per-chapter durations, reads each chapter's char-alignment to build caption lines, rotates a b-roll pool per chapter, sums everything, and returns {durationInFrames: total, props: {chapters}} to drive a <Series>:

calculateMetadata={async () => {
  const index = await fetchJson('vo/index.json');
  const chapters = [];
  for (const spec of SCRIPT) {
    const align = await fetchJson(`vo/${spec.id}.align.json`);
    chapters.push({...spec, durationSec: byId[spec.id], lines: linesFromWords(wordsFromAlign(align))});
  }
  const total = introF + chapters.reduce((s,c)=> s + Math.round(c.durationSec*FPS), 0);
  return {durationInFrames: total, props: {chapters}};
}}

3. The timeline: fraction-based scene layout

PotatoVideo.tsx lays scenes out as fractions of the total duration T with a local helper, wrapping each in a <Sequence> + <FadeScene>:

const seg = (T:number, a:number, b:number) =>
  ({from: Math.floor(a*T), durationInFrames: Math.floor((b-a)*T)});

export const PotatoVideo = () => {
  const {durationInFrames: T} = useVideoConfig();
  return (
    <DossierFrame fileId="FBI / FD-302" readout="FORT CARSON, CO · 15 FEB 2022">
      <Narration src="potato/vo.mp3" />
      <MusicBed src="audio/drone.mp3" volume={0.13} />
      <Sequence {...seg(T,0,0.12)}><FadeScene><Mountain/></FadeScene></Sequence>
      <Sequence {...seg(T,0.12,0.24)}><FadeScene><MapPing map="us" nx={0.335} ny={0.5} .../></FadeScene></Sequence>
      {/* ...more scenes... */}
      <Captions cues={cues} />
      <Grain intensity={0.1} />
    </DossierFrame>
  );
};

4. useCurrentFrame animation (the procedural visuals)

All motion is useCurrentFrame() + interpolate/spring. The glowing UAP orb is pure CSS radial-gradients with frame-driven flicker/drift and a seeded random():

const flicker = 0.85 + 0.15*Math.sin(frame/4) + 0.05*(random(`${seed}-${Math.floor(frame/3)}`)-0.5);
const dx = Math.sin(frame/40)*drift + (random(`${seed}x`)-0.5)*10;
// core: radial-gradient(circle at 38% 35%, #fff 0%, ${color} 45% ...) + boxShadow bloom

Film grain is an animated SVG feTurbulence whose seed shifts every frame so the grain crawls (note the per-frame filter id to force re-render):

<filter id={`noise-${frame}`}>
  <feTurbulence type="fractalNoise" baseFrequency="0.9" numOctaves="2" seed={seed} stitchTiles="stitch"/>
  <feColorMatrix type="saturate" values="0"/>
</filter>

The RedactedDoc reveals memo lines one-by-one (interpolate(frame, [i*revealPerLine, i*revealPerLine+6], [0,1])) then springs in a rotated UNRESOLVED stamp; redaction bars are {redact: 0..1} width spans.

5. ElevenLabs TTS → caption cues (the real teachable)

tools/gen_vo.mjs calls the with-timestamps endpoint, decodes base64 audio, then chunks the char-level alignment into caption cues (end at sentence punctuation, or at a comma past 58 chars):

const res = await fetch(
  `https://api.elevenlabs.io/v1/text-to-speech/${voice}/with-timestamps?output_format=mp3_44100_128`,
  {method:'POST', headers:{'xi-api-key': KEY, 'Content-Type':'application/json'},
   body: JSON.stringify({
     text,
     model_id: 'eleven_multilingual_v2',
     voice_settings: {stability:0.5, similarity_boost:0.9, style:0.18, use_speaker_boost:true},
   })});
const data = await res.json();
fs.writeFileSync(`${outDir}/vo.mp3`, Buffer.from(data.audio_base64, 'base64'));
const {characters, character_start_times_seconds: st, character_end_times_seconds: en} = data.alignment;
// chunk text -> [{start, end, text}] cues using st[]/en[]

Caption rendering boils down to "find the active cue for the current second and fade it":

const t = frame / fps;
const active = cues.find(c => t >= c.start && t < c.end);
// opacity = interpolate(local, [0,0.12,dur-0.12,dur], [0,1,1,0])

Key extraction is sanitized. The technique is to grep the env file and never hardcode: KEY=$(grep -m1 ELEVENLABS_API_KEY <YOUR_ENV_FILE> | sed -E 's/.*=//' | tr -d '\"'). The pattern generalizes by swapping voiceId/model_id: uap-channel = Adam (pNInz6obpgDQGcFmaJgB, multilingual_v2); ohtani = Brian; ufo-shorts = josh on eleven_v3. WSL is TLS-blackholed to api.elevenlabs.io, so the ohtani gen_vo.py runs on Windows Python and reads the key over the \\wsl.localhost share.

6. Audio bed: what's actually in source

The drone bed is a baked, looped mp3 mixed under VO with a fade envelope. There is NO synthesis command checked into the repo (README literally calls it "procedural, ffmpeg, unclaimable"; the generator was a one-off and was not committed). The Remotion side only loops and fades:

// MusicBed: <Audio loop> with an interpolate volume envelope
const v = interpolate(frame, [0, fadeFrames, durationInFrames-fadeFrames, durationInFrames],
                      [0, volume, volume, 0], {extrapolateLeft:'clamp', extrapolateRight:'clamp'});
return <Audio src={staticFile(src)} volume={v} loop />;

The pure-ffmpeg variant mixes a looped bed the same way (add_music.sh):

ffmpeg -i "$IN" -stream_loop -1 -i "$BED" \
  -filter_complex "[1:a]volume=0.12[m];[0:a][m]amix=inputs=2:duration=first:normalize=0[a]" \
  -map 0:v -map "[a]" -c:v copy -c:a aac -b:a 192k "$OUT"

If you need to *generate* an eerie bed (illustrative, not in this repo): layer detuned low sines plus filtered noise, e.g. ffmpeg -f lavfi -i "sine=f=55:d=120" -f lavfi -i "anoisesrc=d=120:c=brown:a=0.08" -filter_complex "[0][1]amix,lowpass=f=200,tremolo=f=0.2:d=0.6,aformat=..." drone.mp3.

7. The ffmpeg-only contrast (ufo-shorts build_shorts.py)

Same documentary genre, no React. One config dict per short, then a single big filtergraph. The techniques worth stealing: blurred-fill vertical framing (undistorted footage centered over a blurred cover of itself, so 16:9 source doesn't stretch in 9:16) plus a gentle zoompan push:

f"[0:v]trim={ip}:{ip+ln},setpts=PTS-STARTPTS,fps={FPS},split=2[bz][fz];"
f"[bz]scale={W}:{H}:force_original_aspect_ratio=increase,crop={W}:{H},boxblur=26:2[bgb];"
f"[fz]scale={W}:-2:flags=lanczos[fgb];"
f"[bgb][fgb]overlay=(W-w)/2:(H-h)/2[ov];"
f"[ov]zoompan=z='min(1.001+{rate}*on,1.09)':d=1:s={W}x{H}:fps={FPS}[s];"

Per-block VO is delayed onto the timeline and mixed with SFX:

f"[{2+i}:a]adelay={ms}|{ms}[n{i}];"          # each narration block at its offset
# ... boom/whoosh/bed ...
f"amix=inputs={n+3}:duration=longest:normalize=0,aresample=44100[aout]"

Captions/title/UNRESOLVED stamp are drawtext with enable='between(t,a,b)', and a deterministic glyph-aware fit_pt() (via PIL ImageFont.getbbox) sizes the title so it never overflows the 960px safe width.

Gotchas & hard-won lessons

Prompts to build it yourself

The actual kind of instructions you'd hand an AI coding agent (Claude Code) to build this from scratch.

prompt

Build a Remotion 4 project that renders 1920x1080 30fps documentary shorts in a 'classified dossier' style. Make a shared src/lib/ barrel of reusable components (a persistent CLASSIFIED frame with corner brackets + timecode, a glowing UAP orb with CSS-gradient bloom and frame-driven flicker, a recon map with a pulsing crosshair at normalized coords, a redacted memo that reveals lines one-by-one and stamps UNRESOLVED, burned-in captions, film grain). Each video gets its OWN folder under src/videos/<name>/ with its own index.ts that calls registerRoot, so renders never cross-couple.

prompt

Wire ElevenLabs TTS so the video length auto-fits the narration: write a node script that POSTs narration.txt to the /v1/text-to-speech/{voiceId}/with-timestamps endpoint (Adam voice, eleven_multilingual_v2), decodes the base64 audio to vo.mp3, and chunks the char-level alignment into a cues.json of {start,end,text} caption segments. In Root.tsx use calculateMetadata + getAudioDurationInSeconds(staticFile('vo.mp3')) to set durationInFrames = ceil(d*30)+60.

prompt

Lay out the timeline as fractions of total duration: a seg(T,a,b) helper feeding <Sequence> blocks wrapped in a FadeScene, with a low looped drone bed mixed under the VO via <Audio loop> and an interpolate volume fade. Animate everything with useCurrentFrame + interpolate/spring, no stock footage. Then render headless with npx remotion render <entry> <Id> out.mp4.

prompt

Make a pure-ffmpeg vertical (1080x1920) Shorts variant of the same documentary format: one Python config dict per short (narration blocks, captions, b-roll in-points), ElevenLabs VO per block, blurred-fill framing (undistorted footage centered over a blurred cover of itself) with a zoompan push, drawtext captions/title/UNRESOLVED stamp gated by enable='between(t,a,b)', and real declassified-document insets on the file-citing beats.

From the renders

Rendered UAP archive frame: a red orb beside a redacted Department-of-War memo
A real render frame. The whole archive look, CLASSIFIED // PURSUE header, corner brackets, a live REC timecode, a witness-video red orb, and a Department-of-War memo with blacked-out lines and a springing UNRESOLVED stamp, is drawn entirely in React and rendered by Remotion.
Handheld night witness footage of a red orb with a white plasma core
The "handheld witness video" treatment: a vertical night clip composited inside the dossier frame, VO-synced lower-third caption burned in from the timestamped narration.
A 1948 Department-of-War memo with a redacted line and a red FLYING DISC stamp
The redacted-document component, a procedurally typeset memo, one line blacked out, and a rotated rubber stamp that springs in on a spring() animation keyed to the frame.
Daytime digital recreation of a metallic potato-shaped object over mountains
A "digital recreation" scene over real terrain, same composition system, daytime palette, caption box pinned bottom-center. One Remotion project, many looks.
07

Editor-API Experiments: CapCut/JianYing Draft Automation (and the Canva Non-Story)

I wanted to see whether I could drive a consumer video editor (CapCut International, or its China twin 剪映 JianYing) entirely from code, letting an AI agent assemble a full timeline of video, audio, text, subtitles, stickers, effects, and keyframes without ever opening the editor UI. The tool I used is the open-source CapCutAPI (sun-guannan/CapCutAPI), a Python project that exposes the same capability set two ways: a Flask HTTP API (capcut_server.py, ~9 endpoints on port 9000/9001) and an MCP server (mcp_server.py, 11 tools) that plugs straight into an MCP client like Claude. Internally it wraps a fork of the pyJianYingDraft library, which reads and writes CapCut/JianYing's native project file format.

The thing that kept this an experiment is what CapCutAPI actually produces. It does not render a finished MP4. Instead it builds a draft, meaning an editable project (a draft_info.json plus a folder of downloaded media assets) that you drop into CapCut/JianYing's drafts directory and open. The agent can compose the whole edit, but a human still presses Export. The repo can optionally zip the draft and upload it to Aliyun OSS to hand back a download URL, while the actual "turn this draft into a video automatically" capability lives in a closed-source cloud-render module that the maintainer never open-sourced.

That closed module is the blocker. The task framing called it a "WeChat dependency," and once I dug into the code the real picture turned out sharper: there is no WeChat API anywhere in the codebase. WeChat shows up exactly once, as the maintainer's contact handle in the Chinese README. The README says plainly that three pieces (the MCP editing agent, the web editing client, and 云渲染 (cloud rendering)) are not open source, and the only way to get at them is to message the author. WeChat is the human gate to a closed feature rather than a programmatic integration. Automated export was blocked by a capability that isn't in the box, not by an auth wall I could write code against.

APIs & services

Service / APIWhat it does hereDocs
CapCutAPI (sun-guannan/CapCutAPI)Open-source Python tool that programmatically builds CapCut/JianYing drafts; exposes a Flask HTTP API and an MCP server. The core of this experiment.
pyJianYingDraftUnderlying library (vendored into the repo) that models and serializes CapCut/JianYing's native draft_info.json, tracks, segments, materials, keyframes, plus a Windows UI-automation export controller.
Model Context Protocol (MCP)Protocol used by mcp_server.py to expose 11 editing tools (create_draft, add_video, add_text, save_draft, ...) over stdio JSON-RPC to an AI client.
CapCut / JianYing (剪映) desktop appThe target editor. The generated draft folder is copied into its drafts directory; the final MP4 export happens here (manually, or via the orphaned UI-automation controller).
FFmpeg / ffprobeCalled by save_draft to probe real width/height/duration of each downloaded media file and fix segment timeranges before writing the draft.
Alibaba Cloud OSS (oss2 SDK)Optional draft hosting: when is_upload_draft=true, the zipped draft is uploaded and a 24h pre-signed URL is returned. Credentials are config-only and were not populated here.
FlaskHTTP server framework for capcut_server.py (the REST interface mirroring the MCP tools).
Canva Connect MCP (claude.ai connector)Available-but-unused. Surfaces only as a deferred MCP connector (authenticate/complete_authentication, OAuth-gated). It is NOT wired into any pipeline, see gotchas for the truth.

How it's built, step by step

  1. INSTALL: clone CapCutAPI and run pip install -r requirements.txt (plus requirements-mcp.txt if you want MCP). Copy config.json.example to config.json, then set is_capcut_env (true for CapCut International, false for JianYing China, which swaps every effect/transition/animation enum table), set port, and leave is_upload_draft:false for the local-only flow.
  2. START: run python capcut_server.py for the HTTP API, or python mcp_server.py for the stdio MCP server. Register the MCP server in the client's mcp_config.json with command: python, args: [mcp_server.py], and cwd/PYTHONPATH pointed at the repo.
  3. CREATE: call create_draft(width, height). This builds an in-memory Script_file and registers it in a process-global DRAFT_CACHE under a generated id like dfd_cat_<unixtime>_<uuid8>. Nothing is written to disk yet.
  4. COMPOSE: call add_video / add_audio / add_image / add_text / add_subtitle / add_effect / add_sticker / add_video_keyframe, each passing the same draft_id so they mutate the cached script. Each media call records a remote_url (the source) and a content-hashed material_name (e.g. video_<sha256[:16]>.mp4). The bytes are NOT downloaded yet. Times are given in seconds and converted to microseconds internally.
  5. SAVE: call save_draft(draft_id, draft_folder). The server (a) duplicates the bundled template/template_jianying folder, (b) runs ffprobe/imageio over every remote_url to fill real dimensions/durations and repair overlapping segments + pending keyframes, (c) downloads all assets concurrently (ThreadPoolExecutor, 16 workers) into dfd_cat_*/assets/{audio,image,video}/, and (d) dumps draft_info.json.
  6. DELIVER (local): the resulting dfd_cat_* folder is copied into the CapCut/JianYing 'drafts' directory. Open the app → the draft appears as a fully editable project.
  7. DELIVER (hosted, optional): if is_upload_draft:true, the draft is zipped and pushed to Aliyun OSS; save_draft returns a 24h signed draft_url. The MCP path instead returns a URL into the maintainer's hosted downloader service (install-ai-guider.top/draft/downloader).
  8. EXPORT (the wall): to get an MP4 you either (a) press Export by hand in the desktop app, (b) use the unwired Jianying_controller.export_draft() UI-automation path (Windows-only, JianYing 6-and-below, VIP-gated), or (c) use the closed-source cloud-render module, which is not in the repo and is only reachable by contacting the maintainer via WeChat. This step is where automation stops.

Under the hood

The draft JSON model (what actually gets written)

A CapCut/JianYing project is one draft_info.json. pyJianYingDraft's Script_file loads a template, mutates it, and serializes it via dumps(). Here's the real top-level shape (from script_file.py):

def dumps(self) -> str:
    self.content["fps"]           = self.fps
    self.content["duration"]      = self.duration            # microseconds
    self.content["canvas_config"] = {"width": self.width, "height": self.height, "ratio": "original"}
    self.content["materials"]     = self.materials.export_json()
    self.content["platform"]      = {"app_id": 359289, "app_source": "cc", ...}  # "cc" = CapCut
    # imported materials/tracks merged, tracks sorted by render_index
    self.content["tracks"] = [t.export_json() for t in track_list]
    return json.dumps(self.content, ensure_ascii=False, indent=4)

materials is a bag of typed lists (notice how many empty arrays the format demands):

result = {
    "audios":   [...], "videos": [...],        # videos also hold material_type=="photo" images
    "texts":    [...], "stickers": [...],
    "effects":  [...],                          # filters + text bubbles/flowers
    "masks":    [...], "transitions": [...],
    "speeds":   [...], "canvases": [...],       # canvases = background fill/blur
    "audio_effects": [...], "audio_fades": [...], "animations": [...], "video_effects": [...],
    "ai_translates": [], "beats": [], "chromas": [], ...   # required-but-empty keys
}

The key invariant: everything is microseconds. A segment carries a source_timerange (where it sits in the source clip) and a target_timerange (where it sits on the timeline), and the two diverge by speed:

source_duration = video_end - start
target_duration = source_duration / speed
source_timerange = trange(f"{start}s", f"{source_duration}s")
target_timerange = trange(f"{target_start}s", f"{target_duration}s")
video_segment = draft.Video_segment(video_material,
    target_timerange=target_timerange, source_timerange=source_timerange,
    speed=speed, clip_settings=Clip_settings(transform_x, transform_y, scale_x, scale_y),
    volume=volume)
script.add_segment(video_segment, track_name=track_name)

Drafts live in a process-global cache, keyed by a generated id

draft_id = f"dfd_cat_{int(time.time())}_{uuid.uuid4().hex[:8]}"
script = draft.Script_file(width, height)
update_cache(draft_id, script)        # DRAFT_CACHE[draft_id] = script

Every subsequent add_* call does get_or_create_draft(draft_id) to fetch the same in-memory script. This is why a draft only hits disk at save_draft time, and why the server is stateful (restart = lose all in-flight drafts).

save_draft: metadata repair + concurrent download, then optional upload

save_draft_background is the workhorse. It duplicates a template, probes media, downloads everything, and dumps the JSON. The media probe uses ffprobe and back-patches durations into the segments (materials are created with duration=0, width=0, height=0 and get filled in here):

command = ['ffprobe','-v','error','-select_streams','v:0',
           '-show_entries','stream=width,height,duration',
           '-show_entries','format=duration','-of','json', remote_url]
# ... video.width/height/duration set; segment source/target timeranges clamped to real duration

Downloads are parallel:

with ThreadPoolExecutor(max_workers=16) as executor:
    future_to_task = {executor.submit(t['func'], *t['args']): t for t in download_tasks}

The upload branch is gated and was left OFF (is_upload_draft:false):

if IS_UPLOAD_DRAFT:
    zip_path  = zip_draft(draft_id)            # shutil.make_archive
    draft_url = upload_to_oss(zip_path)        # oss2, returns 24h sign_url('GET', ...)

OSS credentials are read from config.json into OSS_CONFIG/MP4_OSS_CONFIG and were never populated; they're placeholders only (access_key_id: <YOUR_OSS_KEY>, etc.).

The MCP surface

mcp_server.py is a hand-rolled JSON-RPC loop over stdin/stdout (no SDK). It declares 11 tools, captures stdout so debug prints don't corrupt the JSON stream, and dispatches tools/call to the same impl functions the Flask server uses. Its create_draft and save_draft hand back URLs into the maintainer's hosted service:

result = {"draft_id": str(draft_id),
          "draft_url": f"https://www.install-ai-guider.top/draft/downloader?draft_id={draft_id}"}

The export wall (why automation stops)

There IS a real desktop-export automation. pyJianYingDraft/jianying_controller.py has Jianying_controller.export_draft(), which drives the JianYing window via uiautomation (clicks the draft, the export button, sets resolution/framerate enums, polls the % progress text, moves the output file). Three facts make it a dead end for this pipeline:

  1. It is orphaned: no HTTP or MCP endpoint calls it, and it ships as library dead code.
  2. Its own docstring limits it: 目前仅支持剪映6及以下版本 (JianYing v6 and below only) and 需要确认有导出草稿的权限(不使用VIP功能或已开通VIP), 否则可能陷入死循环 (must have export rights / VIP or it can infinite-loop).
  3. It targets the China JianYing build and assumes the app is already open at the home page.

So the only automated render path is the closed-source cloud-render module, which the README says was never released. The hosted endpoints the code points at (install-ai-guider.top/draft/downloader and a query-script Alibaba Function-Compute URL, ...fcapp.run/query_script) are the maintainer's own servers, and the example client even sends a trial license_key (redacted here as <LICENSE_KEY>) on every request, which confirms the hosted flow is license-gated. Contact for that closed capability is the maintainer's WeChat handle published in the Chinese README. *(Inference, not proven by code: because JianYing China login is account-based via WeChat/Douyin, even the manual desktop-export fallback effectively rides on that same account auth, though the repo contains no login automation, so treat this as context rather than a code fact.)*

Gotchas & hard-won lessons

Prompts to build it yourself

The actual kind of instructions you'd hand an AI coding agent (Claude Code) to build this from scratch.

prompt

Set up the open-source CapCutAPI repo as an MCP server I can call from Claude. Install the Python deps, copy config.json.example to config.json with is_capcut_env=true and is_upload_draft=false, start mcp_server.py, and register it in my MCP client config so I get the create_draft / add_video / add_text / save_draft tools.

prompt

Using the CapCutAPI HTTP endpoints, write a script that builds a 1080x1920 vertical draft: create_draft, add a background video from a URL with a fade-in transition, overlay a styled title with a drop shadow for the first 5 seconds, add an SRT subtitle track, then save_draft into my CapCut drafts folder so I can open it and review.

prompt

Investigate why I can build a CapCut draft programmatically but can't export an MP4 from the API. Read save_draft_impl.py, mcp_server.py, settings/local.py, and jianying_controller.py, and tell me exactly which part of the render/export path is open source vs. closed, and what the draft_url actually points to.

prompt

Audit this repo for whether it can fully automate video export end-to-end. Confirm whether the export automation is reachable from any HTTP/MCP endpoint, what version and licensing constraints it has, and whether any cloud-render service is included or has to be obtained from the maintainer.

08

The Auto-Poster: One Render, Three Platforms (YouTube + Instagram Reels + TikTok)

This is the upload-and-schedule layer underneath every short-form factory in the workspace. A render finishes as <slug>_shorts.mp4 plus a sibling <slug>.json metadata sidecar ({snippet:{title,description,tags,categoryId}, status:{privacyStatus,selfDeclaredMadeForKids}}). From that single pair I fan the clip out to YouTube Shorts, Instagram Reels, and TikTok, each through its own platform API and OAuth slot, with the caption reshaped to fit. I wrote it all as stdlib-only Python and dependency-free Node so it runs on a bare WSL box, and every secret lives in one ~/.nemoclaw_env file that gets parsed but never committed.

The main piece is ~/scripts/yt-shorts-upload.py, a YouTube Data API v3 uploader that doubles as the OAuth re-auth helper. The channel posters (kickclips daily, the baseball dual-channel dispatcher, the UFO channel) never re-implement upload. They stage a <slug>_shorts.mp4 + <slug>.json and shell out to the shared uploader with --account <slot>, so dedup, the WSL-blackhole retry loop, the frozen-tail guard, and thumbnail logic all live in one place. Posting runs on cron with flock guards so a hung network call can never wedge the next slot, and a two-pass adversarial vision gate vets the riskiest (creator) clips before any of them reach a queue.

The theme I kept hitting was defensive networking. WSL's outbound TLS to Google/Meta blackholes at random, so every external call is wrapped in a bounded timeout plus a re-dial retry. And because a multipart upload is not idempotent, every retry first re-checks the channel by title, so a stalled *response* (where the upload actually landed server-side) never double-posts.

APIs & services

Service / APIWhat it does hereDocs
YouTube Data API v3Primary upload target. videos.insert via multipart/related POST (uploadType=multipart, part=snippet,status); search.list (forMine) for same-title dedup; thumbnails.set, playlists.insert, playlistItems.insert, channels.list (verify auth), videos.delete (take down a bad cut).
Google OAuth 2.0 (installed/desktop app)Auth for YouTube. Authorization-code flow with access_type=offline + prompt=consent to mint a refresh token; refresh_token grant for each upload. Loopback-server flow and a WSL-safe manual paste/exchange flow.
Instagram Graph API, Content Publishing (Reels)Cross-post to Reels. Resumable upload: create REELS container (upload_type=resumable) -> POST raw bytes to rupload.facebook.com -> media_publish. Uses a long-lived FB Page token (FB_PAGE_TOKEN) + IG_USER_ID.
TikTok Content Posting APICross-post to TikTok feed via PULL_FROM_URL (TikTok fetches a public MP4 URL), then poll publish/status until PUBLISH_COMPLETE. OAuth 2.0 with PKCE (S256), refresh-token grant, scopes user.info.basic,video.upload.
ffmpeg / ffprobePre-upload media sanity: ffprobe compares audio vs video stream duration; ffmpeg hard-cuts a frozen audio-only tail (stream copy + -t <video_dur>). Also extracts frames for vision-gate contact sheets.
Discord REST API (channels/messages)Cron job result notifications to an #errors channel after each IG post (success/failure with remaining-queue count).

How it's built, step by step

  1. Render step (upstream factory) drops <slug>_shorts.mp4 and a sibling <slug>.json metadata file into a per-channel directory. The JSON is the universal contract: {snippet:{title,description,tags,categoryId}, status:{privacyStatus,selfDeclaredMadeForKids}}.
  2. One-time per channel: create an OAuth slot. Run python3 yt-shorts-upload.py --auth --account <slot> (loopback) or the WSL-safe --authurl then --exchange '<pasted redirect URL>'. The refresh token is written to ~/.nemoclaw_env under a per-account key (default YOUTUBE_OAUTH_REFRESH_TOKEN, or _UFO, _sportsstats, _KICKCLIPS). client_id/secret are shared across all channels; only the refresh token differs.
  3. For creator (face-on-camera) content only: build contact-sheet montages (ffmpeg extracts ~37 frames/clip, tiled into one grid PNG per clip; map.json records slug->sheet). A vision model judges every sheet against a strict boolean schema (pass 1 -> gate_result.json), then adversarially re-checks only the clips it marked safe (pass 2 -> adv_result.json). Only survivors are written into the posting queue.
  4. Stage + upload to YouTube: poster copies the approved MP4 to <slug>_shorts.mp4, writes <slug>.json, then calls yt-shorts-upload.py --upload <slug> --dir <staging> --account <slot>. The uploader runs the frozen-tail guard, a same-title dedup search, then the multipart POST with timeout+retry+adopt-on-stall.
  5. Cross-post the same MP4: IG via post-to-ig-direct.js --video ... --metadata ... (resumable to rupload.facebook.com, then publish-with-retry) and TikTok via tt-shorts-upload.py --upload <slug> --url <public_mp4_url> (PULL_FROM_URL + status poll). Captions are reshaped from the same sidecar (title + intro + a capped hashtag stack).
  6. Schedule via cron, each wrapped in a flock so overlapping runs no-op: baseball dual-channel dispatcher every 4h to @sportsstats (0 */4) and offset 2h to SportsTwo (0 2,6,10,14,18,22) -> combined ~every 2h alternating channels; kickclips daily-ish (0 */6). The dispatcher picks the lane opposite the last post and shares one playId/title ledger across both channels so a clip is never posted to both.
  7. Record + dedup: on a confirmed post, append the slug/title to an append-only ledger (posted.txt) and the cross-channel titles.txt/templates.txt. The next pick reads these immediately (the YouTube search index lags minutes), so back-to-back slots never repeat a title or phrasing.

Under the hood

1. YouTube OAuth, the setup how-to (and the 7-day trap)

The OAuth client is a Desktop/installed app in a Google Cloud project, shared across every channel; only the per-channel *refresh token* differs. The authorize step asks for offline access + forced consent so Google returns a refresh token:

SCOPES = ("https://www.googleapis.com/auth/youtube "
          "https://www.googleapis.com/auth/youtube.upload "
          "https://www.googleapis.com/auth/youtube.readonly")

auth_url = "https://accounts.google.com/o/oauth2/v2/auth?" + urllib.parse.urlencode({
    "client_id": cid,
    "redirect_uri": redirect,          # http://localhost:<port>/callback
    "response_type": "code",
    "scope": SCOPES,
    "access_type": "offline",          # <- required for a refresh_token
    "prompt": "consent",               # <- force it even on re-auth
})

--auth spins a loopback HTTP server on a random free port, prints the URL, and captures ?code=... on the callback. It then exchanges the code at https://oauth2.googleapis.com/token (grant_type=authorization_code) and verifies the channel via channels?part=snippet&mine=true. Per-account token key:

def token_key(account=None):
    if not account or account.lower() in ("default", "sportstwo"):
        return "YOUTUBE_OAUTH_REFRESH_TOKEN"
    return f"YOUTUBE_OAUTH_REFRESH_TOKEN_{account.upper()}"   # e.g. _UFO, _KICKCLIPS

The trap (teachable): while the OAuth app's publishing status is Testing, refresh tokens silently expire after 7 days. Symptom: one channel uploads fine while another dies with 400 invalid_grant "Token has been expired or revoked". Diagnose by hitting the token endpoint with each channel's refresh token plus the *shared* client_id/secret. If one channel refreshes OK, the client is alive and only that token is dead. Re-auth may itself be blocked by 401 invalid_client "OAuth client was not found" even though the client shows in the Console. The fix for both is the same one-time action: Google Auth Platform -> Audience -> Publish App (Testing -> Production). Production removes the 7-day expiry permanently and clears the authorize block; YouTube scopes are "sensitive", so an *unverified* Production app still works (you click past the "Google hasn't verified this app" warning). Do not recreate the client first. A new client inherits the same consent screen and fails identically until you publish.

WSL-safe manual exchange (loopback callbacks don't forward from a Windows browser): --authurl prints the URL against a fixed http://localhost:8080/callback, you sign in, the browser fails to load localhost (fine), you copy the address-bar URL and feed it to --exchange, which parses the code= out. The redirect_uri in the exchange must exactly match the one in the authorize URL, and the code has a ~10-minute TTL (invalid_grant on exchange = expired or reused code).

2. The YouTube upload, multipart, not resumable

YouTube uses a single multipart/related POST (metadata part + raw MP4 part), hand-rolled with urllib so there is zero SDK dependency:

boundary = "yt-shorts-boundary-mlb"
json_part = ("Content-Type: application/json; charset=UTF-8\r\n\r\n"
             + json.dumps({"snippet": meta["snippet"], "status": meta["status"]}) + "\r\n").encode()
mp4_hdr = b"Content-Type: video/mp4\r\n\r\n"
body = sep + json_part + sep + mp4_hdr + video_path.read_bytes() + end

url = "https://www.googleapis.com/upload/youtube/v3/videos?uploadType=multipart&part=snippet,status"
req = urllib.request.Request(url, data=body, method="POST", headers={
    "Authorization": f"Bearer {access}",
    "Content-Type": f"multipart/related; boundary={boundary}",
    "Content-Length": str(len(body)),
})

3. Defensive networking, flock-hang fix + non-idempotent retry

WSL -> Google edge IPs intermittently blackhole the TLS handshake. A naive blocking POST then hangs forever and holds the cron flock, killing every later slot. The fix is a bounded timeout plus a re-dial retry. But because a multipart upload is not idempotent, a stalled *response* may mean the upload actually landed, so each retry first re-checks the channel by title and adopts the existing video id instead of uploading a second copy:

for i in range(1, attempts + 1):
    try:
        with urllib.request.urlopen(req, timeout=180) as r:
            resp = json.load(r); break
    except (urllib.error.URLError, socket.timeout, TimeoutError) as e:
        landed = find_dup_by_title(title, access)      # did the prior attempt actually post?
        if landed:
            vid = landed; break                        # adopt — do NOT re-upload
        if i < attempts:
            time.sleep(5)                              # re-dial: usually a healthier IP
        else:
            sys.exit("abort so the flock releases and the next run can retry")

Token refresh gets the same treatment (timeout=60, 3 tries) so a blackholed token endpoint can't hold the lock either. The cron wrappers take a non-blocking flock and no-op if a previous run is still active:

exec 9>"$LOCK"; flock -n 9 || exit 0

Note these are three *distinct* fallback strategies: (a) the in-WSL timeout+retry above is the only automatic one; (b) a separate manual technique runs the API call from Windows PowerShell (Invoke-RestMethod over TLS 1.2 writing into \\wsl.localhost\...) when WSL's TLS is fully blackholed; (c) yt-mobile-upload.sh pushes the MP4 to a phone over ADB and drives the YouTube app's UI to reach the *trending-sound* picker (an algo boost only available in-app).

4. Frozen-tail guard (ffprobe/ffmpeg)

If audio runs meaningfully longer than video, the player freezes on the last frame, a real defect that shipped on ~111 reposts once. Before the bytes are read into the POST, compare stream durations and hard-cut. -c copy -shortest is unreliable here, so an explicit -t <video_dur> ceiling is used; the guard fails open (any probe/encode hiccup returns the original) so tooling trouble never blocks a legit upload:

v = _stream_dur(path, "v:0"); a = _stream_dur(path, "a:0")
if a - v > 0.3:
    subprocess.run([FFMPEG, "-y", "-i", path, "-t", f"{v:.3f}",
                    "-map", "0:v:0", "-map", "0:a:0", "-c", "copy",
                    "-movflags", "+faststart", fixed])

5. Dual-channel round-robin + shared ledger

post_sportsstats_4h.py posts to @sportsstats; the SportsTwo cron runs the *same* dispatcher with --account sportstwo, offset 2h. The lane is chosen as the opposite of the last post (read straight from the shared ledger, so it self-corrects and needs no extra state), with a fallback so a slot is never wasted:

last = last_lane()                       # reads tail of posted.txt
chosen = "assist" if last == "nasty" else "nasty"
for lane in (chosen, other):
    if run(lane, dry, account): return 0  # posted

Both channels share one posted.txt keyed by playId/per-short hash, so a clip posted to either channel is never posted to the other. A second title_ledger.py prevents duplicate *titles/phrasings* cross-channel, written the instant a post succeeds (YouTube's search-based dedup lags minutes). Titles are normalized (lowercase, strip emoji/punctuation) so a different trailing emoji isn't treated as a different title, and the last N template-keys are avoided so "X Made Y Look Silly" can't fire twice in a row.

6. Cross-post: IG resumable + TikTok PULL_FROM_URL

IG Reels uses the resumable path straight to Meta (no Drive dependency in the current post-to-ig-direct.js): create container -> POST bytes to rupload.facebook.com -> publish. The working header set is finicky; extra or upper-cased headers trigger ProcessingFailedError:

// 1) create container
graphPost(`/v21.0/${IG_USER_ID}/media`, {media_type:"REELS", upload_type:"resumable", caption, share_to_feed:"true"})
// 2) raw bytes to the returned uri (rupload.facebook.com)
headers = {Authorization:`OAuth ${TOKEN}`, "Accept":"*/*", "offset":"0", "file_size":String(size), ...}
// 3) publish (this page token can PUBLISH but cannot READ the container status,
//    so skip the status poll and retry media_publish while IG finishes processing)
graphPost(`/v21.0/${IG_USER_ID}/media_publish`, {creation_id: containerId})

The duration guard skips cleanly (exit 0, not a failure) when a source exceeds the IG Reels API cap (~60s on this account). TikTok uses PULL_FROM_URL (TikTok pulls a public MP4 URL), then polls publish/status/fetch until PUBLISH_COMPLETE; its OAuth uses PKCE (96-char verifier, S256 challenge).

7. Two-pass adversarial vision gate

The unattended cron is public, so creator clips pass a gate before they can enter any queue. It runs as a workflow rather than a standalone binary: ffmpeg extracts ~37 frames/clip tiled into one grid PNG per clip (sheets/map.json maps slug->sheet), then a vision LLM judges each sheet against a strict boolean schema and writes gate_result.json:

{"second_person_in_camera": true, "sexual_movement_or_framing": false,
 "minor": false, "onscreen_profanity_or_slur": false,
 "male_friend_or_white_adidas": true}

The rule that makes it work is "genuine uncertainty resolves against safe": if it can't tell a draped jacket from a second person, it flags unsafe. Pass 2 is the adversarial step. It re-examines *only* the clips pass 1 called safe and tries to refute each one (single-cell crops, brightness/contrast zooms to clear the riskiest frames). On one 35-clip batch, pass 1 rejected 19 outright and pass 2 refuted 4 of the 16 "safe", so 23/35 were killed, and those 4 are exactly the false-safes a single pass would have shipped.

Gotchas & hard-won lessons

Prompts to build it yourself

The actual kind of instructions you'd hand an AI coding agent (Claude Code) to build this from scratch.

prompt

Build a stdlib-only Python uploader for YouTube Shorts via Data API v3. It should read client_id/secret/refresh-token from ~/.nemoclaw_env, support per-channel token slots (default + named accounts), upload an MP4 with its JSON metadata sidecar via a hand-rolled multipart/related POST, and include an --auth mode that runs a loopback OAuth flow plus a WSL-safe --authurl/--exchange manual paste flow.

prompt

Make the uploader survive WSL's flaky outbound TLS: wrap every Google call in a bounded timeout with a re-dial retry, and because multipart upload isn't idempotent, before each retry re-search the channel by exact title and adopt the existing video id instead of uploading a duplicate. Wrap the cron job in a non-blocking flock so a hung call can't wedge later slots.

prompt

Write a dual-channel round-robin dispatcher that posts one short every 4h to each of two YouTube channels offset by 2h, alternating two content lanes. Use one shared append-only ledger keyed by clip id so a video is never posted to both channels, plus a separate normalized-title ledger so no title or phrasing repeats back-to-back across channels.

prompt

Add a two-pass adversarial vision safety gate before queuing creator clips: extract ~37 frames per clip into one contact-sheet PNG, have a vision model judge each sheet against a boolean schema (second person, sexual framing, minor, on-screen profanity) with 'uncertainty resolves against unsafe', then run a second pass that re-checks only the clips marked safe and tries to refute each one. Write gate_result.json and adv_result.json.

prompt

Add Instagram Reels and TikTok cross-posting that reuse the same MP4 + JSON sidecar. For IG, use the Graph API resumable upload (create REELS container, POST bytes to rupload.facebook.com, then media_publish with retry, skipping the status poll). For TikTok, use the Content Posting API PULL_FROM_URL flow with PKCE OAuth and poll publish status to completion. Reshape captions from the sidecar with a capped hashtag stack.

09

A live-chat AI bot that runs a real browser

Most of this page is about making videos. This one is the odd one out. Instead of building a clip, it sits inside a YouTube live chat and talks to people while the stream is happening, the way a regular viewer would. It also listens to the streamer's own voice through the broadcast audio, so when they say something out loud the bot can react to it in chat.

The awkward part is that YouTube gives you no clean way to post into a live chat from a script. So the bot does the human thing. It runs a real Chrome browser, opens the chat, reads the messages on the screen, types a reply, and clicks send. The actual thinking runs through the Claude command-line tool, which means there is no extra AI bill on top of it.

I built this for a friend's stream. The streamer gave the bot a nickname, and after a while the chat treated it like one of the regulars, which was the whole point.

APIs & services

Service / APIWhat it does hereDocs
PlaywrightDrives a real logged-in Chrome so the bot can read the live chat off the page and type replies. There is no API for posting to a YouTube live chat, so the bot uses the page like a person.
Claude Code CLIThe brain. Run in print mode, it reads one chat message plus the context and writes back a single reply. Uses the CLI login, so no API key.
faster-whisperTurns the stream's audio into text on the GPU so the bot knows what the streamer just said.
yt-dlpResolves the live stream's audio so ffmpeg can read it in real time.
ffmpegCuts the live audio into short chunks for transcription, and grabs a video frame so the bot can glance at the screen.
YouTube live chatNot an API. The actual web page the bot drives like a human, reading messages and clicking send.

How it's built, step by step

  1. Sign in once by hand. A small helper opens a visible Chrome pointed at a fresh profile folder. You log into the YouTube account the bot will speak as, then close the window, and the login stays saved in that folder.
  2. Copy the signed-in profile into a second folder just for the bot, so its browser never fights your other browsers over the profile lock.
  3. Install the browser engine with npm i playwright and npx playwright install chromium, then start the bot pointed at the live video's ID.
  4. Each cycle the bot opens the live-chat popout page, waits for the message box to load, and reads the last twenty-five messages off the page.
  5. It drops messages it has already seen and anything from its own account, then picks the newest message worth answering, preferring one that names the bot or asks a question.
  6. It builds a prompt from its personality, the streamer's current speech, the recent chat, and that one message, and pipes it into the Claude CLI to get back a single line.
  7. It types the reply into the box, clicks send, closes the browser, and waits for the next cycle.
  8. Alongside all of this the audio listener follows the same stream, so the bot always knows what the streamer is talking about.

Under the hood

The browser opens and closes every single cycle

This is the one choice everything else hangs on. A long-running headless Chrome falls over after about thirty seconds on my machine, no matter which flags I throw at it. A browser that opens, does one job, and closes never gets the chance to. So the bot works in short cycles. Every fifteen seconds or so it launches Chrome with the bot's saved login, opens the live-chat page, reads the latest messages, posts at most one reply, and shuts the whole thing down again.

To set up that saved login you run a small helper once that opens a visible Chrome. You sign in as the account you want the bot to speak as, close it, and the cookies stay in a profile folder on disk. A second script copies that folder into a separate one for the bot, so the bot's browser and any other browser you have open never fight over the same lock file.

Reading the chat and picking who to answer

Once the chat page is open, the bot pulls the last twenty-five messages straight out of the page. It keeps a list of message IDs it has already handled so it never answers the same line twice, and it always skips its own account so it cannot talk to itself. Out of the new messages it favours one that names the bot or ends in a question mark, and falls back to the newest otherwise. Idle chatter is rate-limited, but a direct question is allowed to jump that gap so the bot stays responsive.

The prompt and the reply

For the chosen message the bot writes a single prompt: a short description of its personality, whatever the streamer is saying right now from the audio side, the last few chat lines for context, and the one message to answer. It pipes that into the Claude CLI in print mode and reads back one line. The personality tells it to keep replies short and casual, and to answer with the word (skip) when a message is not worth a reply, like a dropped link or a lone emoji. The stronger model sometimes writes its reasoning first, so a small cleanup step keeps only the last real line before it gets typed.

Listening to the stream's voice

The audio half runs as its own program. yt-dlp resolves the live stream's audio, ffmpeg reads it live and chops it into fifteen-second pieces, and faster-whisper turns each piece into text on the GPU. The running transcript goes into a small file that the chat bot reads for context. When the transcription suggests the streamer is talking to the bot directly, the audio side drops a tiny trigger file, and the next chat cycle picks it up and answers. It can also grab one video frame every twenty-five seconds, so the bot can look at what is on screen before it replies.

One-time setup

npm i playwright
npx playwright install chromium
pip install yt-dlp faster-whisper

Gotchas & hard-won lessons

Prompts to build it yourself

The actual kind of instructions you'd hand an AI coding agent (Claude Code) to build this from scratch.

prompt

I want a bot that sits in a YouTube live chat and replies to viewers in real time. There is no API for posting to live chat, so use Playwright to drive a real logged-in Chrome: open the live-chat popout, read the recent messages off the page, type a reply, and click send. Use the Claude CLI in print mode as the brain so there is no API key to manage. Open a fresh browser each cycle instead of keeping one open, because a long-lived headless browser is unstable. Skip the bot's own messages, never answer the same line twice, and prefer messages that ask a question. First walk me through signing the bot's account in once with a saved browser profile.

prompt

Now add a listener for the same stream's audio. Use yt-dlp to pull the live audio, ffmpeg to cut it into short chunks, and faster-whisper on the GPU to transcribe each chunk into a rolling transcript file. When the streamer addresses the bot out loud, write a small trigger file that the chat bot reads so it answers in chat. Drop whisper's silence hallucinations, and match the bot's name loosely since transcription garbles it.

More stl-radar experiments →