Skip to main content
This guide covers setting up your development environment, installing dependencies, and organizing your data directory structure.

Prerequisites

Before starting, ensure you have:
  • Python 3.8 or higher
  • FFmpeg (required for audio extraction)
  • 4GB+ available RAM (8GB+ recommended for GPU acceleration)
  • Optional: NVIDIA GPU with CUDA support for faster processing

System Dependencies

1

Install FFmpeg

FFmpeg is required for extracting audio from video files.macOS:
brew install ffmpeg
Ubuntu/Debian:
sudo apt update
sudo apt install ffmpeg
Windows: Download from ffmpeg.org and add to PATH.Verify installation:
ffmpeg -version
2

Install Python Dependencies

The project uses PyTorch, CLIP, Whisper, and FastAPI. Install all dependencies:
pip install torch torchvision torchaudio
pip install openai-whisper
pip install git+https://github.com/openai/CLIP.git
pip install opencv-python pillow numpy
pip install scikit-learn
pip install fastapi uvicorn
pip install tqdm
For GPU acceleration, install PyTorch with CUDA support following instructions at pytorch.org
3

Verify Installation

Test that all models can be loaded:
import torch
import clip
import whisper

# Check device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Load CLIP
clip_model, preprocess = clip.load("ViT-B/32", device=device)
print("CLIP loaded successfully")

# Load Whisper
whisper_model = whisper.load_model("base", device=device)
print("Whisper loaded successfully")
Expected output:
Using device: cuda  # or cpu
CLIP loaded successfully
Whisper loaded successfully

Directory Structure

The project expects the following directory layout:
tiktok-sorter/
├── extract_features.py
├── train.py
├── predict.py
├── server.py
├── index.html
├── data/
│   └── Favorites/
│       └── videos/
│           ├── 123456789.mp4        # Unsorted videos (root level)
│           ├── 987654321.mp4
│           ├── soccer/               # Category folder
│           │   ├── 111111111.mp4
│           │   └── 222222222.mp4
│           ├── cooking/
│           │   └── 333333333.mp4
│           └── funny/
│               └── 444444444.mp4
└── artifacts/                        # Created automatically
    ├── labeled_embeddings.pt
    ├── unlabeled_embeddings.pt
    ├── transcripts.json
    ├── model.pt
    ├── model_config.json
    └── predictions.json
1

Create Directory Structure

mkdir -p data/Favorites/videos
mkdir -p artifacts
2

Organize Your Videos

Place your TikTok videos in the appropriate locations:Labeled Videos (for training):
  • Create a subfolder for each category: data/Favorites/videos/[category-name]/
  • Move videos into their respective category folders
  • Examples: soccer/, cooking/, funny/, motivational/
Unlabeled Videos (for prediction):
  • Place directly in data/Favorites/videos/
  • These will be automatically sorted by the model
You need at least 5-10 labeled videos per category for meaningful training results. Categories with fewer examples may not be learned effectively.
3

Verify Data Structure

Check your setup:
ls -R data/Favorites/videos/
Expected output:
data/Favorites/videos/:
123456789.mp4  987654321.mp4  soccer/  cooking/  funny/

data/Favorites/videos/soccer:
111111111.mp4  222222222.mp4

data/Favorites/videos/cooking:
333333333.mp4

data/Favorites/videos/funny:
444444444.mp4

Configuration

The scripts use hardcoded paths relative to the script location. If you need to customize paths, edit these constants: extract_features.py:
DATA_DIR = Path(__file__).parent / "data" / "Favorites" / "videos"
OUTPUT_DIR = Path(__file__).parent / "artifacts"
N_FRAMES = 5              # Number of frames to sample per video
CLIP_MODEL = "ViT-B/32"   # CLIP model variant
WHISPER_MODEL = "base"    # Whisper model size (tiny/base/small/medium/large)
train.py, predict.py, server.py:
ARTIFACTS_DIR = Path(__file__).parent / "artifacts"
DATA_DIR = Path(__file__).parent / "data" / "Favorites" / "videos"
CLIP Models:
  • ViT-B/32 - Default, balanced speed/accuracy (512-d embeddings)
  • ViT-B/16 - Higher accuracy, slower
  • RN50 - ResNet-50 backbone alternative
Whisper Models:
  • tiny - Fastest, least accurate (~1GB VRAM)
  • base - Default, good balance (~1GB VRAM)
  • small - Better transcription (~2GB VRAM)
  • medium - High accuracy (~5GB VRAM)
  • large - Best quality (~10GB VRAM)
Larger models improve feature quality but increase extraction time significantly.

GPU Acceleration

The scripts automatically detect and use CUDA if available:
# Check if PyTorch can see your GPU
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
python -c "import torch; print(f'GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}')"
Performance comparison (600 videos):
  • CPU: ~45-60 minutes for feature extraction
  • GPU (RTX 3060): ~8-12 minutes for feature extraction
Training is fast (~5-30 seconds) regardless of device since it only trains on extracted features.

Troubleshooting

CLIP must be installed from GitHub, not PyPI:
pip install git+https://github.com/openai/CLIP.git
Ensure FFmpeg is in your PATH:
which ffmpeg  # macOS/Linux
where ffmpeg  # Windows
If not found, reinstall and restart your terminal.
Reduce batch size or use smaller models:
  • Switch Whisper from base to tiny
  • Use CPU instead: set device = "cpu" in scripts
  • Process videos in smaller batches
Install OpenCV:
pip install opencv-python

Next Steps

With your environment set up, proceed to:

Feature Extraction

Extract multimodal embeddings from your video collection

Build docs developers (and LLMs) love