Overview
The TikTok Auto Collection Sorter is a multimodal AI system that automatically categorizes TikTok videos into user-defined folders. The system combines visual content analysis (CLIP) and audio transcription (Whisper) to create rich feature representations, then trains a classifier to predict appropriate categories.Architecture Components
The system consists of three main stages:1. Feature Extraction (extract_features.py)
Transforms raw videos into fixed-dimensional embeddings by combining visual and audio modalities.
Key Parameters:
CLIP_MODEL:"ViT-B/32"- Vision Transformer backboneWHISPER_MODEL:"base"- Audio transcription modelN_FRAMES:5- Number of frames sampled per video
2. Model Training (train.py)
Trains multiple classifier architectures using cross-validation and selects the best performer.
Models Evaluated:
- k-Nearest Neighbors (k=5, cosine distance)
- Logistic Regression (L2 regularization, balanced class weights)
- Multi-Layer Perceptron (2 hidden layers)
3. Prediction (predict.py)
Applies the trained model to unsorted videos, providing confidence scores and optional automatic file organization.
Data Flow
Directory Structure
Key Design Decisions
Multimodal Fusion Strategy
The system uses late fusion by concatenating normalized embeddings from each modality. Both visual and audio embeddings are L2-normalized before concatenation to ensure balanced contributions.
Graceful Degradation
Dataset Discovery
The system automatically discovers:- Labeled data: Videos in subdirectories (folder name = category label)
- Unlabeled data: Videos in the root
videos/directory - Category names: Automatically extracted from folder names
Processing Pipeline
- Discovery Phase: Scan directory structure to identify labeled and unlabeled videos
- Feature Extraction: Process all videos through CLIP + Whisper pipeline
- Training Phase: Cross-validation with multiple model architectures
- Model Selection: Choose best performer based on CV accuracy
- Prediction Phase: Generate category predictions with confidence scores
- Organization (optional): Automatically move files to predicted folders
Performance Considerations
GPU Acceleration: Both CLIP and Whisper support CUDA acceleration. The system automatically detects available GPUs and falls back to CPU if necessary.Batch Processing: Feature extraction processes videos sequentially, but model training uses mini-batch gradient descent (batch_size=32).
Output Artifacts
| File | Description | Format |
|---|---|---|
labeled_embeddings.pt | Training data features | PyTorch tensor dict |
unlabeled_embeddings.pt | Unsorted video features | PyTorch tensor dict |
model.pt / model.pkl | Trained classifier weights | PyTorch / Pickle |
model_config.json | Model metadata & hyperparameters | JSON |
predictions.json | Prediction results with confidence | JSON |
transcripts.json | Audio transcriptions for inspection | JSON |
Next Steps
- Learn about Multimodal Features to understand CLIP and Whisper integration
- Explore the Training Pipeline for model architecture and optimization details