Skip to main content

Overview

This guide will walk you through installing dependencies, organizing your videos, extracting features, training a classifier, and generating your first predictions.

Prerequisites

You’ll need:
  • Python 3.8 or higher
  • CUDA-compatible GPU (optional but recommended for faster processing)
  • FFmpeg for audio extraction
  • TikTok videos saved locally

Installation

1

Install FFmpeg

FFmpeg is required for audio extraction from videos.
brew install ffmpeg
2

Install Python dependencies

Install the required Python packages:
pip install torch torchvision
pip install git+https://github.com/openai/CLIP.git
pip install openai-whisper
pip install opencv-python pillow
pip install scikit-learn numpy
pip install fastapi uvicorn pydantic
pip install tqdm
For GPU support, install PyTorch with CUDA following the official instructions.
3

Set up project structure

Create the following directory structure:
mkdir -p tiktok-sorter/data/Favorites/videos
mkdir -p tiktok-sorter/artifacts
cd tiktok-sorter
Your project should look like:
tiktok-sorter/
├── extract_features.py
├── train.py
├── predict.py
├── server.py
├── artifacts/          # Model checkpoints, embeddings
└── data/
    └── Favorites/
        └── videos/      # Your TikTok videos go here
4

Organize your videos

Place your TikTok videos in the data/Favorites/videos/ directory.To create labeled training data, organize some videos into subfolders by category:
data/Favorites/videos/
├── soccer/           # Your category folders
│   ├── 1234567890.mp4
│   └── 9876543210.mp4
├── cooking/
│   └── 5555555555.mp4
├── funny/
│   └── 7777777777.mp4
└── 1111111111.mp4    # Unlabeled videos stay in root
Start with at least 20-30 videos per category for good results. The system handles class imbalance well, so you don’t need perfectly balanced data.

Extract Features

Now extract multimodal features from your videos using CLIP and Whisper:
python extract_features.py
This script:
  1. Samples 5 frames uniformly from each video
  2. Encodes frames with CLIP (ViT-B/32) visual encoder
  3. Extracts audio and transcribes with Whisper
  4. Encodes transcripts with CLIP text encoder
  5. Combines visual and audio features into 1024-d vectors
# Sample frames uniformly from video
def extract_visual_features(video_path, clip_model, preprocess, device, n_frames=5):
    cap = cv2.VideoCapture(str(video_path))
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    
    indices = np.linspace(0, total_frames - 1, n_frames, dtype=int)
    embeddings = []
    
    for idx in indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
        ret, frame = cap.read()
        if not ret:
            continue
        img = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
        img_input = preprocess(img).unsqueeze(0).to(device)
        with torch.no_grad():
            emb = clip_model.encode_image(img_input)
        embeddings.append(emb.cpu())
    
    # Average pool across frames → single 512-d vector
    stacked = torch.cat(embeddings, dim=0)
    return stacked.mean(dim=0)
Processing time: Expect ~10 minutes for 600 videos on a modern GPU, or ~30-45 minutes on CPU.
The script saves:
  • artifacts/labeled_embeddings.pt - Features for videos in category folders
  • artifacts/unlabeled_embeddings.pt - Features for videos in root directory
  • artifacts/transcripts.json - Whisper transcriptions for inspection

Train the Classifier

Train a classifier on your labeled videos:
python train.py
The training script:
  1. Compares three approaches (k-NN, Logistic Regression, MLP)
  2. Uses stratified k-fold cross-validation
  3. Selects the best model based on validation accuracy
  4. Retrains on all labeled data
class MLP(nn.Module):
    def __init__(self, input_dim, num_classes, hidden_dim=256):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_dim // 2, num_classes),
        )
Expected output:
Loaded 213 samples, 8 classes: ['cooking', 'funny', 'gaming', 'news', 'quran', 'soccer', 'tech', 'travel']
Feature dimension: 1024

Using 5-fold stratified cross-validation
  Fold 1: kNN=87.2%  LogReg=88.4%  MLP=90.7%
  Fold 2: kNN=86.0%  LogReg=89.5%  MLP=91.9%
  Fold 3: kNN=85.4%  LogReg=87.2%  MLP=89.5%
  Fold 4: kNN=88.1%  LogReg=90.7%  MLP=93.0%
  Fold 5: kNN=87.2%  LogReg=88.4%  MLP=90.7%

Cross-validation results (mean accuracy):
      knn: 86.8% (+/- 1.0%)
   logreg: 88.8% (+/- 1.3%)
      mlp: 91.2% (+/- 1.4%)

Best model: mlp (91.2%)
Training typically completes in under 30 seconds. The MLP usually outperforms k-NN and Logistic Regression on multimodal features.
The script saves:
  • artifacts/model.pt or artifacts/model.pkl - Trained model weights
  • artifacts/model_config.json - Model metadata and label mappings

Generate Predictions

Predict folders for your unlabeled videos:
python predict.py
This generates predictions with confidence scores for all unsorted videos in the root directory.
python predict.py
Example output:
Predicting folders for 47 unsorted videos
Model: mlp | Categories: ['cooking', 'funny', 'gaming', 'news', 'quran', 'soccer', 'tech', 'travel']

  [ASSIGN] 7234567890123456.mp4 → soccer (94%)  [soccer: 94% | gaming: 3% | tech: 2%]
  [ASSIGN] 7234567890234567.mp4 → cooking (87%)  [cooking: 87% | travel: 8% | funny: 3%]
  [SKIP  ] 7234567890345678.mp4 → funny (45%)  [funny: 45% | gaming: 32% | tech: 18%]
  [ASSIGN] 7234567890456789.mp4 → quran (99%)  [quran: 99% | news: 1% | travel: 0%]

Summary:
  soccer         :   18 videos
  cooking        :   12 videos
  gaming         :    8 videos
  quran          :    5 videos
  tech           :    3 videos
  travel         :    1 videos
  SKIPPED        :   14 videos (below 0% threshold)
  TOTAL          :   47 videos
Use --threshold to only auto-assign videos where the model is confident. Videos below the threshold can be manually reviewed.
The script saves artifacts/predictions.json with detailed predictions for all videos.

Launch Interactive UI

Start the web interface for interactive labeling and active learning:
python server.py
Then open http://localhost:8000 in your browser. The UI provides:
  • Full-screen video player modeled after TikTok’s interface
  • Real-time model predictions with top-3 confidence scores
  • Keyboard shortcuts (1-8) for rapid labeling
  • Visual highlighting of predicted folder
  • One-click retraining that triggers the full pipeline
Interactive UI Preview

Next Steps

Feature Extraction Deep Dive

Learn how CLIP and Whisper work together to extract multimodal features

Training Configuration

Customize the training pipeline and hyperparameters

Active Learning

Improve accuracy through iterative labeling and retraining

Deployment

Deploy as a browser extension or API service

Build docs developers (and LLMs) love