I Spent 200+ Hours Editing Videos Last Year
Every content creator faces the same problem: recording once means editing three times. You need a TikTok short (9:16, 60 seconds, karaoke captions), a LinkedIn version (16:9, burned subs), a blog post for SEO, and social posts for five platforms. Every asset needs scheduling, previewing, and tracking.
That was five weeks of my life last year — chopping silence, syncing captions, extracting highlights, writing posts. The content was valuable. The process was brutal.
So I automated it. I built an agentic video pipeline that takes one recording and outputs everything a creator needs. Then I used it for my own work. It processed the video where I built it.
Today, I’m open-sourcing it. It’s called vidpipe, and it’s on GitHub.
What vidpipe Actually Does
vidpipe is a CLI tool that watches a folder and automatically generates 15 outputs from every video you drop in. It’s the “record once, publish everywhere” workflow industry guides keep recommending, but actually built.
Here’s the 15-stage pipeline:
| Stage | What It Produces |
|---|---|
| 1. Ingestion | Validates codecs, extracts metadata |
| 2. Transcription | Full word-level transcript via Whisper |
| 3. Video Cleaning | ProducerAgent + Gemini vision trims dead air, filler, bad takes (max 20%) |
| 4. Captions | Timed SRT/VTT subtitles with word-level sync |
| 5. Caption Burn | FFmpeg hard-codes karaoke-style captions |
| 6. Shorts | 6 vertical clips (15-60s, multi-segment) |
| 7. Medium Clips | 3 clips (1-3 min, crossfade transitions) |
| 8. Chapters | Detects topic shifts, generates timestamps |
| 9. Summary | Concise video summary |
| 10. Social Media | Post variants for 5 platforms (TikTok, YouTube, Instagram, LinkedIn, X) |
| 11. Short Posts | Platform-native text for each short clip (30 posts total) |
| 12. Medium Clip Posts | Platform-native text for medium clips (15 posts total) |
| 13. Blog | 800–1200 word article with web-sourced links via Exa AI |
| 14. Queue Build | Generates posting schedule for Late API |
| 15. Git Push | Auto-commits outputs and posts to GitHub |
Drop a 20-minute recording in the watched folder. Walk away. Come back to 54 assets ready to publish.
The Agentic Architecture
I wrote about the future of agentic video editing last week. vidpipe is what that looks like in production.
It’s built on the GitHub Copilot SDK, which exposes the same production-tested agent runtime behind Copilot CLI. Every stage is a specialized AI agent:
- ProducerAgent — cleans video by trimming dead air, filler words, bad takes, and redundant content
- ShortsAgent — analyzes transcripts + Gemini clip direction for viral moments, plans 6 multi-segment shorts
- MediumVideoAgent — identifies chapter-length content with Gemini clip direction, assembles clips with transitions
- ChapterAgent, SummaryAgent, SocialMediaAgent, BlogAgent
Each agent uses structured tool calls like detect_silence, plan_shorts, write_summary, search_links. The Copilot SDK handles JSON-RPC communication and multi-turn conversations automatically.
This follows the multi-agent collaboration pattern Microsoft Research outlined in their Magentic-One paper. Recent surveys position the Copilot SDK alongside CrewAI and LangChain as production-ready agentic platforms.
Transcription That Actually Works
vidpipe uses OpenAI Whisper, trained on 680,000 hours of audio. Benchmarks show Whisper achieves 8.06% word error rate, outperforming Google Speech-to-Text (16.51%–20.63% WER).
Whisper supports 99+ languages and provides word-level timestamps — essential for karaoke captions. The turbo model processes ~8x faster than large, so vidpipe defaults to turbo.
Word-level accuracy is the foundation. If the transcript is wrong, everything downstream breaks. Whisper’s 9.7/10 grammar accuracy means clean input for AI agents.
Karaoke Captions That Look Good
Over 90% of Gen Z and Millennials watch videos on mute. Karaoke captions — where each word highlights in sync — significantly improve engagement and accessibility.
vidpipe generates SRT/VTT caption files with word-level timestamps and uses FFmpeg filters to burn them into video. FFmpeg is the industry standard for media processing. The implementation follows the same timing approach AI Engineer documented, but runs server-side for batch processing.
AI Video Cleaning (Not Just Silence Removal)
Most silence removal tools are binary: detect silence, cut it out. This destroys pacing and misses the bigger picture — filler words, false starts, and repeated explanations are just as wasteful as dead air.
vidpipe’s ProducerAgent takes a two-AI approach. First, Gemini 2.5 Flash watches the raw video and produces editorial notes — frame-by-frame observations about what’s happening: where the speaker stumbles, repeats themselves, stares at the screen, or fills time. Gemini sees the video; LLMs can’t.
Then the ProducerAgent (running on your configured LLM) interprets those notes alongside the transcript. It decides what to cut:
- Dead air and long pauses between topics
- Filler words and false starts (“um”, “so basically”, “let me try that again”)
- Bad takes and repeated explanations
- Redundant content that adds nothing
The agent is conservative — it caps removal at 10–20% of total runtime and preserves intentional pauses, demonstrations, and natural rhythm. Cuts are executed via singlePassEdit() for frame-accurate results in a single FFmpeg re-encode.
Shorts Extraction (6 Per Video)
Short-form video generates 12x more shares than traditional content. With TikTok holding ~40% market share, Reels at ~20%, and YouTube Shorts at ~20%, vertical content is essential.
After cleaning, Gemini makes a second pass over the cleaned video to generate clip direction — suggested hooks, timestamps, platform fit, and engagement potential for short-form content. This is multimodal analysis: Gemini sees the visuals, not just the transcript.
The ShortsAgent receives this clip direction as supplementary context but makes its own decisions. Each short runs 15–60 seconds, can span multiple non-contiguous segments from the cleaned video, renders in 9:16 with captions, and comes with 5 platform-native posts. That’s 30 ready-to-publish assets from shorts alone.
Medium Clips (Chapter-Length Content)
The MediumVideoAgent also receives Gemini’s clip direction and identifies complete ideas that need 1–3 minutes. Each clip is cut from the cleaned video, spans a single topic, includes crossfade transitions if needed, renders in 16:9, and generates 5 platform-native posts. That’s 15 more assets.
Face Detection & Blog Posts
If you record with a webcam, vidpipe uses ONNX-based face detection to automatically crop the webcam region for clean split-screen layouts.
The BlogAgent writes 800–1200 word articles enriched with links from Exa AI’s embeddings-based search. Not SEO spam — real content with proper attribution, ready for your blog or Dev.to.
Auto-Publish Scheduling
vidpipe integrates with Late API, supporting 13+ platforms from a single endpoint. Approve the queue in vidpipe’s review web app; Late handles publication across TikTok, YouTube, Instagram, LinkedIn, X, Facebook, and more.
Built for Modern Node.js
vidpipe is TypeScript (ES2022, ESM) with modern packaging practices. Requires Node.js 20+ per Commander.js v14’s requirement. Uses TypeScript’s "module": "nodenext" and pure ESM for simplified packaging.
Core dependencies:
- Commander.js — most widely used CLI framework
- Sharp — 4x-5x faster than ImageMagick, 31,891 stars
- Winston — structured logging
- Chokidar — file system watching
Full stack documented in the GitHub repo.
Getting Started (3 Commands)
Install from npm:
npm install -g vidpipe
Run the interactive setup:
vidpipe init
Process a video:
vidpipe path/to/video.mp4
That’s it. The first run will check prerequisites (ffmpeg, ffprobe, API keys) and guide you through configuration. After that, it’s just vidpipe [video-path].
Want continuous processing? Use watch mode:
vidpipe --watch-dir ~/Videos/Recordings
vidpipe monitors the folder and processes new videos automatically as they appear.
Check system health:
vidpipe --doctor
Review generated posts before publishing:
vidpipe review
View the posting schedule:
vidpipe schedule
Full documentation is at htekdev.github.io/vidpipe.
Multi-LLM Support
vidpipe defaults to GitHub Copilot with Claude Opus 4.6, but supports BYOK (Bring Your Own Key) for OpenAI and Anthropic — useful for teams with existing AI infrastructure.
Gemini handles video understanding separately via the @google/genai SDK. Two dedicated passes — editorial notes for cleaning and clip direction for shorts/mediums — use gemini-2.5-flash by default. Gemini is purpose-built for multimodal video analysis, so it runs outside the Copilot/OpenAI/Claude provider abstraction.
Why I’m Open-Sourcing This
I built vidpipe for me. It saved 150+ hours in the first month.
The creator economy is exploding: 74% of organizations increased creator marketing investment in 2024, campaigns up 17% YoY, $79M paid to creators (47% increase). Brands published 9.5 social posts per day in 2024 — an “uncomfortable math problem.”
Nearly half of B2B marketers say inability to scale content blocks productivity. The “record once, publish everywhere” strategy is widely recommended but rarely automated.
vidpipe automates it. Now it’s yours.
What I Hope You Build
This is production code. It’s ISC licensed. You can:
- Use it as-is for your own content creation
- Fork it and add your own agents (thumbnail generation with DALL-E? YouTube upload automation? Podcast conversion?)
- Study the agentic architecture and build your own pipelines
- Contribute back improvements, bug fixes, and new features
I’m particularly interested in seeing:
- Integration with more platforms (Bluesky, Mastodon, Threads native API)
- Support for multi-speaker detection (interviews, panels, podcasts)
- AI-generated B-roll and visual overlays
- Voice cloning for dubbed translations
- A hosted version for non-technical creators
The GitHub repo has contribution guidelines and a roadmap. The npm package is published. The docs are live.
Your recordings shouldn’t sit on your hard drive. Now they won’t.