Skip to main content
Hero Light

What is TikTok Auto Collection Sorter?

A multimodal machine learning system that automatically categorizes TikTok videos into your personal folders by analyzing visual and audio content. The system achieves ~90% accuracy on personal video collections using transfer learning from foundation models.

The Problem

TikTok’s folder organization requires three taps (save → view folders → select folder), creating enough friction that most users abandon the feature. This project solves that by predicting the correct folder at save time, reducing the flow to a single confirmation tap.

The Solution

By extracting multimodal features from videos and training a lightweight classifier, the system learns your personal organizational taxonomy. Each folder category has consistent enough signals across visual, audio, and speech modalities that a classifier can accurately predict where videos belong.

Key Features

Multimodal Analysis

Combines visual features (CLIP) and audio transcription (Whisper) for robust classification

90% Accuracy

Achieves ~90% accuracy on 213 labeled videos across 8 categories with minimal training data

Fast Training

Trains in seconds using transfer learning from foundation models (CLIP + Whisper)

Interactive UI

Full-screen video player with real-time predictions and keyboard shortcuts for rapid labeling

How It Works

The system uses a three-stage pipeline:
  1. Feature Extraction: Sample 5 frames from each video → encode with CLIP → extract audio → transcribe with Whisper → combine into 1024-d vector
  2. Training: Train a lightweight MLP classifier on labeled examples with class-weighted loss for imbalanced data
  3. Prediction: Generate folder predictions with confidence scores for unlabeled videos
# Combined representation: [visual_512 | audio_512] = 1024-d
vis_emb = vis_emb / vis_emb.norm()
audio_emb = audio_emb / (audio_emb.norm() + 1e-8)
combined = torch.cat([vis_emb, audio_emb], dim=0)

Architecture Overview

Multimodal Feature Extraction

  • Visual Features: Sample 5 frames uniformly from each video → encode with CLIP (ViT-B/32) → average-pool to 512-d vector
  • Audio Features: Extract audio track → transcribe with Whisper → encode transcript with CLIP’s text encoder → 512-d vector
  • Combined Representation: Concatenate both modalities → 1024-d vector, L2-normalized per modality to prevent dominance

Classification Pipeline

  • Two-layer MLP (256 → 128 → N classes)
  • Class-weighted cross-entropy loss to handle imbalanced data
  • Trains in seconds; feature extraction (~10 min for 600 videos) is the bottleneck

Tech Stack

ML/AI

PyTorch, CLIP (OpenAI), Whisper (OpenAI)

Backend

FastAPI with automatic retraining API

Frontend

Vanilla HTML/CSS/JavaScript (single file)

Language

Python end-to-end (~900 lines total)

Get Started

Quickstart

Get up and running in minutes with installation and first prediction

Feature Extraction

Learn how CLIP and Whisper extract multimodal features from videos

Training

Understand the classification pipeline and model architecture

Interactive UI

Explore the labeling interface and active learning workflow

Performance Highlights

  • Accuracy: ~90% on 213 labeled videos across 8 categories
  • Initial Performance: 93.8% cross-validation accuracy on 128 videos (6 categories)
  • Per-Category Performance: Categories with strong audiovisual signatures (e.g., Quran recitation with distinct visual framing and Arabic speech) achieve near-perfect recall
  • Data Efficiency: Strong performance with relatively small labeled datasets

Why This Approach?

Instead of training a video model from scratch, this project leverages pretrained foundation models (CLIP, Whisper) for feature extraction and trains a lightweight classifier on top. This provides:
  • Efficiency: Fast training and inference
  • Effectiveness: Strong performance with minimal labeled data
  • Transferability: Foundation models capture rich semantic representations

Build docs developers (and LLMs) love