Quickstart

Overview

This guide will walk you through installing dependencies, organizing your videos, extracting features, training a classifier, and generating your first predictions.

Prerequisites

You’ll need:

Python 3.8 or higher
CUDA-compatible GPU (optional but recommended for faster processing)
FFmpeg for audio extraction
TikTok videos saved locally

Installation

Install FFmpeg

FFmpeg is required for audio extraction from videos.

brew install ffmpeg

Install Python dependencies

Install the required Python packages:

pip install torch torchvision
pip install git+https://github.com/openai/CLIP.git
pip install openai-whisper
pip install opencv-python pillow
pip install scikit-learn numpy
pip install fastapi uvicorn pydantic
pip install tqdm

For GPU support, install PyTorch with CUDA following the official instructions.

Set up project structure

Create the following directory structure:

mkdir -p tiktok-sorter/data/Favorites/videos
mkdir -p tiktok-sorter/artifacts
cd tiktok-sorter

Your project should look like:

tiktok-sorter/
├── extract_features.py
├── train.py
├── predict.py
├── server.py
├── artifacts/          # Model checkpoints, embeddings
└── data/
    └── Favorites/
        └── videos/      # Your TikTok videos go here

Organize your videos

Place your TikTok videos in the data/Favorites/videos/ directory.To create labeled training data, organize some videos into subfolders by category:

data/Favorites/videos/
├── soccer/           # Your category folders
│   ├── 1234567890.mp4
│   └── 9876543210.mp4
├── cooking/
│   └── 5555555555.mp4
├── funny/
│   └── 7777777777.mp4
└── 1111111111.mp4    # Unlabeled videos stay in root

Start with at least 20-30 videos per category for good results. The system handles class imbalance well, so you don’t need perfectly balanced data.

Extract Features

Now extract multimodal features from your videos using CLIP and Whisper:

python extract_features.py

This script:

Samples 5 frames uniformly from each video
Encodes frames with CLIP (ViT-B/32) visual encoder
Extracts audio and transcribes with Whisper
Encodes transcripts with CLIP text encoder
Combines visual and audio features into 1024-d vectors

# Sample frames uniformly from video
def extract_visual_features(video_path, clip_model, preprocess, device, n_frames=5):
    cap = cv2.VideoCapture(str(video_path))
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    
    indices = np.linspace(0, total_frames - 1, n_frames, dtype=int)
    embeddings = []
    
    for idx in indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
        ret, frame = cap.read()
        if not ret:
            continue
        img = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
        img_input = preprocess(img).unsqueeze(0).to(device)
        with torch.no_grad():
            emb = clip_model.encode_image(img_input)
        embeddings.append(emb.cpu())
    
    # Average pool across frames → single 512-d vector
    stacked = torch.cat(embeddings, dim=0)
    return stacked.mean(dim=0)

Processing time: Expect ~10 minutes for 600 videos on a modern GPU, or ~30-45 minutes on CPU.

The script saves:

artifacts/labeled_embeddings.pt - Features for videos in category folders
artifacts/unlabeled_embeddings.pt - Features for videos in root directory
artifacts/transcripts.json - Whisper transcriptions for inspection

Train the Classifier

Train a classifier on your labeled videos:

python train.py

The training script:

Compares three approaches (k-NN, Logistic Regression, MLP)
Uses stratified k-fold cross-validation
Selects the best model based on validation accuracy
Retrains on all labeled data

class MLP(nn.Module):
    def __init__(self, input_dim, num_classes, hidden_dim=256):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_dim // 2, num_classes),
        )

Expected output:

Loaded 213 samples, 8 classes: ['cooking', 'funny', 'gaming', 'news', 'quran', 'soccer', 'tech', 'travel']
Feature dimension: 1024

Using 5-fold stratified cross-validation
  Fold 1: kNN=87.2%  LogReg=88.4%  MLP=90.7%
  Fold 2: kNN=86.0%  LogReg=89.5%  MLP=91.9%
  Fold 3: kNN=85.4%  LogReg=87.2%  MLP=89.5%
  Fold 4: kNN=88.1%  LogReg=90.7%  MLP=93.0%
  Fold 5: kNN=87.2%  LogReg=88.4%  MLP=90.7%

Cross-validation results (mean accuracy):
      knn: 86.8% (+/- 1.0%)
   logreg: 88.8% (+/- 1.3%)
      mlp: 91.2% (+/- 1.4%)

Best model: mlp (91.2%)

Training typically completes in under 30 seconds. The MLP usually outperforms k-NN and Logistic Regression on multimodal features.

The script saves:

artifacts/model.pt or artifacts/model.pkl - Trained model weights
artifacts/model_config.json - Model metadata and label mappings

Generate Predictions

Predict folders for your unlabeled videos:

python predict.py

This generates predictions with confidence scores for all unsorted videos in the root directory.

python predict.py

Example output:

Predicting folders for 47 unsorted videos
Model: mlp | Categories: ['cooking', 'funny', 'gaming', 'news', 'quran', 'soccer', 'tech', 'travel']

  [ASSIGN] 7234567890123456.mp4 → soccer (94%)  [soccer: 94% | gaming: 3% | tech: 2%]
  [ASSIGN] 7234567890234567.mp4 → cooking (87%)  [cooking: 87% | travel: 8% | funny: 3%]
  [SKIP  ] 7234567890345678.mp4 → funny (45%)  [funny: 45% | gaming: 32% | tech: 18%]
  [ASSIGN] 7234567890456789.mp4 → quran (99%)  [quran: 99% | news: 1% | travel: 0%]

Summary:
  soccer         :   18 videos
  cooking        :   12 videos
  gaming         :    8 videos
  quran          :    5 videos
  tech           :    3 videos
  travel         :    1 videos
  SKIPPED        :   14 videos (below 0% threshold)
  TOTAL          :   47 videos

Use --threshold to only auto-assign videos where the model is confident. Videos below the threshold can be manually reviewed.

The script saves artifacts/predictions.json with detailed predictions for all videos.

Launch Interactive UI

Start the web interface for interactive labeling and active learning:

python server.py

Then open http://localhost:8000 in your browser. The UI provides:

Full-screen video player modeled after TikTok’s interface
Real-time model predictions with top-3 confidence scores
Keyboard shortcuts (1-8) for rapid labeling
Visual highlighting of predicted folder
One-click retraining that triggers the full pipeline

Next Steps

Feature Extraction Deep Dive

Learn how CLIP and Whisper work together to extract multimodal features

Training Configuration

Customize the training pipeline and hyperparameters

Active Learning

Improve accuracy through iterative labeling and retraining

Deployment

Deploy as a browser extension or API service

Get Started

Core Concepts

Guides

Advanced

Overview

Prerequisites

Installation

Extract Features

Train the Classifier

Generate Predictions

Launch Interactive UI

Next Steps

Feature Extraction Deep Dive

Training Configuration

Active Learning

Deployment

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced

​Overview

​Prerequisites

​Installation

​Extract Features

​Train the Classifier

​Generate Predictions

​Launch Interactive UI

​Next Steps

Feature Extraction Deep Dive

Training Configuration

Active Learning

Deployment

Build docs developers (and LLMs) love

Overview

Prerequisites

Installation

Extract Features

Train the Classifier

Generate Predictions

Launch Interactive UI

Next Steps