TikTok Auto Collection Sorter

What is TikTok Auto Collection Sorter?

A multimodal machine learning system that automatically categorizes TikTok videos into your personal folders by analyzing visual and audio content. The system achieves ~90% accuracy on personal video collections using transfer learning from foundation models.

The Problem

TikTok’s folder organization requires three taps (save → view folders → select folder), creating enough friction that most users abandon the feature. This project solves that by predicting the correct folder at save time, reducing the flow to a single confirmation tap.

The Solution

By extracting multimodal features from videos and training a lightweight classifier, the system learns your personal organizational taxonomy. Each folder category has consistent enough signals across visual, audio, and speech modalities that a classifier can accurately predict where videos belong.

Key Features

Multimodal Analysis

Combines visual features (CLIP) and audio transcription (Whisper) for robust classification

90% Accuracy

Achieves ~90% accuracy on 213 labeled videos across 8 categories with minimal training data

Fast Training

Trains in seconds using transfer learning from foundation models (CLIP + Whisper)

Interactive UI

Full-screen video player with real-time predictions and keyboard shortcuts for rapid labeling

How It Works

The system uses a three-stage pipeline:

Feature Extraction: Sample 5 frames from each video → encode with CLIP → extract audio → transcribe with Whisper → combine into 1024-d vector
Training: Train a lightweight MLP classifier on labeled examples with class-weighted loss for imbalanced data
Prediction: Generate folder predictions with confidence scores for unlabeled videos

# Combined representation: [visual_512 | audio_512] = 1024-d
vis_emb = vis_emb / vis_emb.norm()
audio_emb = audio_emb / (audio_emb.norm() + 1e-8)
combined = torch.cat([vis_emb, audio_emb], dim=0)

Architecture Overview

Multimodal Feature Extraction

Visual Features: Sample 5 frames uniformly from each video → encode with CLIP (ViT-B/32) → average-pool to 512-d vector
Audio Features: Extract audio track → transcribe with Whisper → encode transcript with CLIP’s text encoder → 512-d vector
Combined Representation: Concatenate both modalities → 1024-d vector, L2-normalized per modality to prevent dominance

Classification Pipeline

Two-layer MLP (256 → 128 → N classes)
Class-weighted cross-entropy loss to handle imbalanced data
Trains in seconds; feature extraction (~10 min for 600 videos) is the bottleneck

Tech Stack

ML/AI

PyTorch, CLIP (OpenAI), Whisper (OpenAI)

Backend

FastAPI with automatic retraining API

Frontend

Vanilla HTML/CSS/JavaScript (single file)

Language

Python end-to-end (~900 lines total)

Get Started

Quickstart

Get up and running in minutes with installation and first prediction

Feature Extraction

Learn how CLIP and Whisper extract multimodal features from videos

Training

Understand the classification pipeline and model architecture

Interactive UI

Explore the labeling interface and active learning workflow

Performance Highlights

Accuracy: ~90% on 213 labeled videos across 8 categories
Initial Performance: 93.8% cross-validation accuracy on 128 videos (6 categories)
Per-Category Performance: Categories with strong audiovisual signatures (e.g., Quran recitation with distinct visual framing and Arabic speech) achieve near-perfect recall
Data Efficiency: Strong performance with relatively small labeled datasets

Why This Approach?

Instead of training a video model from scratch, this project leverages pretrained foundation models (CLIP, Whisper) for feature extraction and trains a lightweight classifier on top. This provides:

Efficiency: Fast training and inference
Effectiveness: Strong performance with minimal labeled data
Transferability: Foundation models capture rich semantic representations

Get Started

Core Concepts

Guides

Advanced

What is TikTok Auto Collection Sorter?

The Problem

The Solution

Key Features

Multimodal Analysis

90% Accuracy

Fast Training

Interactive UI

How It Works

Architecture Overview

Multimodal Feature Extraction

Classification Pipeline

Tech Stack

ML/AI

Backend

Frontend

Language

Get Started

Quickstart

Feature Extraction

Training

Interactive UI

Performance Highlights

Why This Approach?

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced

​What is TikTok Auto Collection Sorter?

​The Problem

​The Solution

​Key Features

Multimodal Analysis

90% Accuracy

Fast Training

Interactive UI

​How It Works

​Architecture Overview

​Multimodal Feature Extraction

​Classification Pipeline

​Tech Stack

ML/AI

Backend

Frontend

Language

​Get Started

Quickstart

Feature Extraction

Training

Interactive UI

​Performance Highlights

​Why This Approach?

Build docs developers (and LLMs) love

What is TikTok Auto Collection Sorter?

The Problem

The Solution

Key Features

How It Works

Architecture Overview

Multimodal Feature Extraction

Classification Pipeline

Tech Stack

Get Started

Performance Highlights

Why This Approach?