Prerequisites
Before starting, ensure you have:- Python 3.8 or higher
- FFmpeg (required for audio extraction)
- 4GB+ available RAM (8GB+ recommended for GPU acceleration)
- Optional: NVIDIA GPU with CUDA support for faster processing
System Dependencies
Install FFmpeg
FFmpeg is required for extracting audio from video files.macOS:Ubuntu/Debian:Windows:
Download from ffmpeg.org and add to PATH.Verify installation:
Install Python Dependencies
The project uses PyTorch, CLIP, Whisper, and FastAPI. Install all dependencies:
For GPU acceleration, install PyTorch with CUDA support following instructions at pytorch.org
Directory Structure
The project expects the following directory layout:Organize Your Videos
Place your TikTok videos in the appropriate locations:Labeled Videos (for training):
- Create a subfolder for each category:
data/Favorites/videos/[category-name]/ - Move videos into their respective category folders
- Examples:
soccer/,cooking/,funny/,motivational/
- Place directly in
data/Favorites/videos/ - These will be automatically sorted by the model
Configuration
The scripts use hardcoded paths relative to the script location. If you need to customize paths, edit these constants: extract_features.py:Model Configuration Options
Model Configuration Options
CLIP Models:
ViT-B/32- Default, balanced speed/accuracy (512-d embeddings)ViT-B/16- Higher accuracy, slowerRN50- ResNet-50 backbone alternative
tiny- Fastest, least accurate (~1GB VRAM)base- Default, good balance (~1GB VRAM)small- Better transcription (~2GB VRAM)medium- High accuracy (~5GB VRAM)large- Best quality (~10GB VRAM)
GPU Acceleration
The scripts automatically detect and use CUDA if available:- CPU: ~45-60 minutes for feature extraction
- GPU (RTX 3060): ~8-12 minutes for feature extraction
Training is fast (~5-30 seconds) regardless of device since it only trains on extracted features.
Troubleshooting
ImportError: No module named 'clip'
ImportError: No module named 'clip'
CLIP must be installed from GitHub, not PyPI:
FFmpeg not found error
FFmpeg not found error
Ensure FFmpeg is in your PATH:If not found, reinstall and restart your terminal.
CUDA out of memory
CUDA out of memory
Reduce batch size or use smaller models:
- Switch Whisper from
basetotiny - Use CPU instead: set
device = "cpu"in scripts - Process videos in smaller batches
ModuleNotFoundError: No module named 'cv2'
ModuleNotFoundError: No module named 'cv2'
Install OpenCV:
Next Steps
With your environment set up, proceed to:Feature Extraction
Extract multimodal embeddings from your video collection