Skip to main content

What It Does

Clip Hand is an AI-powered shorts factory that takes any video URL or file and transforms it into 3-5 viral short clips (30-90 seconds each) with burned-in captions, vertical formatting (9:16 for TikTok/Reels/Shorts), thumbnails, and optional AI voice-over. This is an 8-phase pipeline: Download → Transcribe → Analyze Content → Pick Viral Segments → Extract & Crop → Add Captions → Generate Thumbnails → Optionally Publish to Telegram/WhatsApp.

Key Features

  • Content-based clipping: Reads transcript to pick segments based on hooks, emotional peaks, and insight density—not just visual scene changes
  • 5 STT backends: YouTube auto-subs, Groq Whisper (fast/free), OpenAI Whisper, Deepgram Nova-2, local Whisper
  • Vertical formatting: Auto-crops to 1080x1920 (9:16) for mobile
  • Styled captions: Burned-in SRT subtitles with customizable fonts and positioning
  • Optional TTS: AI voice-over with Edge TTS (free), OpenAI TTS, or ElevenLabs
  • Auto-publish: Send finished clips to Telegram channels or WhatsApp contacts

Activation

Clip Hand requires FFmpeg, ffprobe, and yt-dlp. See the Requirements section below.
# Activate Clip Hand
openfang hand activate clip

# Activate with specific STT provider
openfang hand activate clip --settings "stt_provider=groq_whisper"

# Activate with TTS voice-over
openfang hand activate clip --settings "tts_provider=edge_tts"

Requirements

1

Install FFmpeg

macOS:
brew install ffmpeg
Windows:
winget install Gyan.FFmpeg
Linux (Debian/Ubuntu):
sudo apt install ffmpeg
Or download from ffmpeg.org/download.html.Estimated time: 2-5 minutesNote: ffprobe ships bundled with FFmpeg.
2

Install yt-dlp

macOS:
brew install yt-dlp
Windows:
winget install yt-dlp.yt-dlp
Linux (Debian/Ubuntu):
sudo apt install yt-dlp
Or via pip:
pip install yt-dlp
Estimated time: 1-2 minutes
3

(Optional) Install Local Whisper

Only needed if you want local transcription (no API keys):
pip install openai-whisper
Note: Requires GPU for fast transcription. CPU-only is very slow.
4

Verify Installation

ffmpeg -version
ffprobe -version
yt-dlp --version
All should return version numbers.

Configuration Settings

stt_provider
select
default:"auto"
How audio is transcribed to text for captions and clip selection:
  • auto: Auto-detect (tries YouTube subs first, then Groq/OpenAI/local Whisper)
  • whisper_local: Local Whisper (requires GPU for speed)
  • groq_whisper: Groq Whisper API (fast, free tier) - requires GROQ_API_KEY
  • openai_whisper: OpenAI Whisper API - requires OPENAI_API_KEY
  • deepgram: Deepgram Nova-2 - requires DEEPGRAM_API_KEY
tts_provider
select
default:"none"
Optional voice-over or narration generation for clips:
  • none: Disabled (captions only) - default
  • edge_tts: Edge TTS (free, no API key)
  • openai_tts: OpenAI TTS - requires OPENAI_API_KEY
  • elevenlabs: ElevenLabs - requires ELEVENLABS_API_KEY
elevenlabs_api_key
text
API key from elevenlabs.io for high-quality text-to-speech. Required when ElevenLabs TTS is selected.
publish_target
select
default:"local_only"
Where to send finished clips after processing:
  • local_only: Local files only (no publishing) - default
  • telegram: Telegram channel
  • whatsapp: WhatsApp contact/group
  • both: Telegram + WhatsApp
telegram_bot_token
text
Bot token from @BotFather on Telegram (e.g., 123456:ABC-DEF...). Bot must be admin in the target channel.
telegram_chat_id
text
Channel: -100XXXXXXXXXX or @channelname. Group: numeric ID. Get it via @userinfobot.
whatsapp_token
text
Permanent access token from Meta Business Settings > System Users. Temporary tokens expire in 24h.
whatsapp_phone_id
text
Phone Number ID from Meta Developer Portal > WhatsApp > API Setup (e.g., 1234567890).
whatsapp_recipient
text
Phone number in international format, no + or spaces (e.g., 14155551234).

Required Tools

Clip Hand requires access to these tools (all built-in):
  • shell_exec — Platform detection and FFmpeg/yt-dlp commands
  • file_read, file_write, file_list — Transcript and clip files
  • web_fetch — Metadata extraction
  • memory_store, memory_recall — State persistence

System Prompt Overview

Clip Hand operates in 8 phases:
1

Platform Detection

Detects OS (Windows/macOS/Linux) to adapt command syntax. Verifies FFmpeg, ffprobe, and yt-dlp are installed.
2

Intake

Detects input type (URL or local file). For URLs, extracts metadata with yt-dlp --dump-json. For files, analyzes with ffprobe. Warns if video >2 hours.
3

Download

For URLs: downloads video with yt-dlp (up to 1080p). Attempts to grab existing YouTube auto-subs (saves transcription time). For local files: verifies playability.
4

Transcribe

Tries 5 paths in order: (A) YouTube auto-subs if available, (B) Groq Whisper API, (C) OpenAI Whisper API, (D) Deepgram Nova-2, (E) Local Whisper, (F) Scene/silence detection fallback. Produces word-level timing.
5

Analyze & Pick Segments

This is the core value. Reads full transcript, identifies 3-5 segments worth clipping based on: hook in first 3 seconds, self-contained story, emotional peaks, controversial takes, insight density, clean ending. Each 30-90 seconds.
6

Extract & Process

For each segment: (1) Extract clip with FFmpeg, (2) Crop to vertical 9:16, (3) Generate SRT captions from transcript, (4) Burn captions onto video with styled text, (5) Optionally add TTS voice-over, (6) Generate thumbnail.
7

Publish (Optional)

If publishing is configured: uploads clips to Telegram (max 50MB) and/or WhatsApp (max 16MB). Re-encodes if needed. Respects rate limits.
8

Report

Generates summary table: clip #, title, file path, duration, file size, thumbnail path. Updates dashboard statistics.

Usage Examples

Basic Clipping (YouTube URL)

openfang chat clip
> "Turn this video into shorts: https://youtube.com/watch?v=dQw4w9WgXcQ"
Clip Hand will:
  1. Download the video
  2. Grab YouTube auto-subs (if available) or transcribe with Groq Whisper
  3. Analyze transcript and pick 3-5 viral segments
  4. Extract clips, crop to vertical, add captions, generate thumbnails
  5. Save to clip_1_final.mp4, clip_2_final.mp4, etc.

With Voice-Over

openfang hand configure clip --set tts_provider="edge_tts"
openfang chat clip
> "Create shorts with AI voice-over from: https://youtube.com/watch?v=..."
Each clip will have the original audio (reduced to 30% volume) mixed with AI narration reading the captions.

Local File

openfang chat clip
> "Clip this file into shorts: /path/to/recording.mp4"
Works the same, but skips the download step.

Custom Clip Count

openfang chat clip
> "Create exactly 5 clips from this video: https://..."

Custom Timestamps

openfang chat clip
> "Extract clips at these timestamps: 1:23-2:15, 5:40-6:30, 10:00-11:00"
Skips the analysis phase, uses your exact timestamps.

Publish to Telegram

openfang hand configure clip \
  --set publish_target="telegram" \
  --set telegram_bot_token="123456:ABC-DEF..." \
  --set telegram_chat_id="@mychannel"

openfang chat clip
> "Clip this and send to Telegram: https://..."
After processing, clips are auto-uploaded to your Telegram channel.

Viral Segment Selection

Clip Hand identifies viral segments using these criteria:
A surprising claim, question, or emotional statement that makes people watch.Good hooks:
  • “I almost quit 3 years ago. Then I discovered…”
  • “90% of startups fail because of this one mistake”
  • “This changed everything:”
Bad hooks:
  • “Hey guys, welcome back to my channel”
  • “So, um, today I want to talk about…”
Makes sense without the full video context. Doesn’t require “you had to be there” knowledge.
Moments of laughter, surprise, anger, vulnerability, or triumph. Emotion drives shares.
Things people want to share or argue about. “Unpopular opinion: …” format.
High ratio of interesting ideas per second. No filler, no rambling.
Ends on a punchline, conclusion, or dramatic pause. Doesn’t trail off mid-sentence.

Dashboard Metrics

Clip Hand tracks five key metrics:

Jobs Completed

Total video processing jobs finished.

Clips Generated

Total short clips produced.

Total Duration

Cumulative duration of all clips (in seconds).

Published to Telegram

Clips successfully sent to Telegram.

Published to WhatsApp

Clips successfully sent to WhatsApp.
View in the dashboard at http://localhost:4200/hands/clip.

STT Provider Comparison

ProviderSpeedCostQualityAPI Key Required
YouTube auto-subsInstantFreeGoodNo
Groq WhisperVery fastFree tierExcellentYes (GROQ_API_KEY)
OpenAI WhisperFast$0.006/minExcellentYes (OPENAI_API_KEY)
Deepgram Nova-2FastestPaidExcellentYes (DEEPGRAM_API_KEY)
Local WhisperSlow (CPU) / Fast (GPU)FreeExcellentNo
Recommendation: Use auto (default). It tries YouTube subs first (instant), then Groq (fast + free), then falls back to others.

TTS Provider Comparison

ProviderQualityCostAPI Key Required
Edge TTSGoodFreeNo
OpenAI TTSExcellent$15/1M charsYes (OPENAI_API_KEY)
ElevenLabsOutstandingPaidYes (ELEVENLABS_API_KEY)
Recommendation: Start with Edge TTS (free, no setup). Upgrade to ElevenLabs for premium voice quality.

Output Files

For each clip, Clip Hand produces:
  • clip_N_final.mp4: The finished clip (1080x1920, captions burned in, optional TTS)
  • clip_N.srt: SRT subtitle file (word-level timing)
  • thumb_N.jpg: Thumbnail (frame at 2 seconds)
All files saved in the same directory as the source video (or current directory for URLs).

Publishing

Telegram

Requires:
  1. Bot Token: Create a bot via @BotFather
  2. Chat ID: Your channel ID (e.g., -100123456789 or @channelname)
  3. Bot must be admin in the channel
File size limit: 50MB. Clips larger than 50MB are automatically re-encoded.

WhatsApp

Requires:
  1. Access Token: Permanent token from Meta Business Settings > System Users
  2. Phone Number ID: From Meta Developer Portal > WhatsApp > API Setup
  3. Recipient: Phone number in international format (e.g., 14155551234)
File size limit: 16MB. Clips larger than 16MB are automatically re-encoded. 24-hour window: WhatsApp requires the recipient to have messaged you within the last 24 hours (for non-template messages).

Best Practices

Clip Hand will never fabricate command output. All FFmpeg/yt-dlp operations are run with actual commands. If a command fails, it reports the real error.
For long videos (>1 hour), specify which segment to focus on: “Clip the first 30 minutes” or “Focus on the Q&A section starting at 45:00”.
YouTube auto-subs (when available) are instant and free. Enable stt_provider=auto to try them first.
If you’re clipping the same creator repeatedly, save their YouTube channel URL and let Clip Hand pull the latest video:
openfang chat clip
> "Clip the latest video from @MrBeast"

Advanced Configuration

Custom Caption Styling

Edit ~/.openfang/hands/clip.toml to customize caption appearance:
[agent.captions]
font_size = 24
font_name = "Arial"
primary_color = "&H00FFFFFF"  # White
outline_color = "&H00000000"  # Black
outline_thickness = 2
alignment = 2  # Bottom center
margin_v = 40  # Pixels from bottom

Batch Processing

Process multiple videos in one command:
openfang chat clip
> "Clip these videos: https://youtube.com/watch?v=1, https://youtube.com/watch?v=2, https://youtube.com/watch?v=3"
Clip Hand will process them sequentially.

Integration with Content Calendar

Schedule daily clipping:
openfang hand configure clip --set publish_target="telegram"

# Schedule daily at 6 AM
echo "Clip the latest video from @channel and publish" | \
  openfang schedule create --hand clip --cron "0 6 * * *"

Example Output

# Clip Job: "How I Built a $1M SaaS in 6 Months"
**Source**: https://youtube.com/watch?v=dQw4w9WgXcQ
**Duration**: 45:23 | **STT**: YouTube auto-subs | **TTS**: None
**Clips Generated**: 4

| # | Title | File | Duration | Size |
|---|-------|------|----------|------|
| 1 | "The $1M Idea" | clip_1_final.mp4 | 42s | 8.2MB |
| 2 | "Biggest Mistake" | clip_2_final.mp4 | 51s | 9.8MB |
| 3 | "First $100K" | clip_3_final.mp4 | 38s | 7.1MB |
| 4 | "Advice for Founders" | clip_4_final.mp4 | 46s | 8.9MB |

## Publishing
- Telegram: 4/4 sent successfully
- WhatsApp: Not configured

All clips saved to: /Users/you/clips/

Troubleshooting

Error: “Unable to extract video data”Fix: Update yt-dlp:
pip install --upgrade yt-dlp
Or:
brew upgrade yt-dlp
Issue: Local Whisper on CPU is 10-50x slower than real-time.Fix: Use Groq Whisper API (fast + free) instead:
export GROQ_API_KEY="your_key_here"
openfang hand configure clip --set stt_provider="groq_whisper"
Issue: Text too long for 1080px width.Fix: Reduce font_size or enable word wrapping in SRT generation.
Fix: Clip Hand will automatically re-encode to <50MB. If it still fails, manually lower CRF:
ffmpeg -i clip_N_final.mp4 -fs 49M -c:v libx264 -crf 30 -preset fast -c:a aac -y clip_N_tg.mp4

Next Steps

Twitter Hand

Share your clips on Twitter/X

Researcher Hand

Research trending topics to clip

Build docs developers (and LLMs) love