What is TikTok Auto Collection Sorter?
A multimodal machine learning system that automatically categorizes TikTok videos into your personal folders by analyzing visual and audio content. The system achieves ~90% accuracy on personal video collections using transfer learning from foundation models.The Problem
TikTok’s folder organization requires three taps (save → view folders → select folder), creating enough friction that most users abandon the feature. This project solves that by predicting the correct folder at save time, reducing the flow to a single confirmation tap.The Solution
By extracting multimodal features from videos and training a lightweight classifier, the system learns your personal organizational taxonomy. Each folder category has consistent enough signals across visual, audio, and speech modalities that a classifier can accurately predict where videos belong.Key Features
Multimodal Analysis
Combines visual features (CLIP) and audio transcription (Whisper) for robust classification
90% Accuracy
Achieves ~90% accuracy on 213 labeled videos across 8 categories with minimal training data
Fast Training
Trains in seconds using transfer learning from foundation models (CLIP + Whisper)
Interactive UI
Full-screen video player with real-time predictions and keyboard shortcuts for rapid labeling
How It Works
The system uses a three-stage pipeline:- Feature Extraction: Sample 5 frames from each video → encode with CLIP → extract audio → transcribe with Whisper → combine into 1024-d vector
- Training: Train a lightweight MLP classifier on labeled examples with class-weighted loss for imbalanced data
- Prediction: Generate folder predictions with confidence scores for unlabeled videos
Architecture Overview
Multimodal Feature Extraction
- Visual Features: Sample 5 frames uniformly from each video → encode with CLIP (ViT-B/32) → average-pool to 512-d vector
- Audio Features: Extract audio track → transcribe with Whisper → encode transcript with CLIP’s text encoder → 512-d vector
- Combined Representation: Concatenate both modalities → 1024-d vector, L2-normalized per modality to prevent dominance
Classification Pipeline
- Two-layer MLP (256 → 128 → N classes)
- Class-weighted cross-entropy loss to handle imbalanced data
- Trains in seconds; feature extraction (~10 min for 600 videos) is the bottleneck
Tech Stack
ML/AI
PyTorch, CLIP (OpenAI), Whisper (OpenAI)
Backend
FastAPI with automatic retraining API
Frontend
Vanilla HTML/CSS/JavaScript (single file)
Language
Python end-to-end (~900 lines total)
Get Started
Quickstart
Get up and running in minutes with installation and first prediction
Feature Extraction
Learn how CLIP and Whisper extract multimodal features from videos
Training
Understand the classification pipeline and model architecture
Interactive UI
Explore the labeling interface and active learning workflow
Performance Highlights
- Accuracy: ~90% on 213 labeled videos across 8 categories
- Initial Performance: 93.8% cross-validation accuracy on 128 videos (6 categories)
- Per-Category Performance: Categories with strong audiovisual signatures (e.g., Quran recitation with distinct visual framing and Arabic speech) achieve near-perfect recall
- Data Efficiency: Strong performance with relatively small labeled datasets
Why This Approach?
Instead of training a video model from scratch, this project leverages pretrained foundation models (CLIP, Whisper) for feature extraction and trains a lightweight classifier on top. This provides:- Efficiency: Fast training and inference
- Effectiveness: Strong performance with minimal labeled data
- Transferability: Foundation models capture rich semantic representations