Skip to main content

Overview

The TikTok Auto Collection Sorter is a multimodal AI system that automatically categorizes TikTok videos into user-defined folders. The system combines visual content analysis (CLIP) and audio transcription (Whisper) to create rich feature representations, then trains a classifier to predict appropriate categories.

Architecture Components

The system consists of three main stages:

1. Feature Extraction (extract_features.py)

Transforms raw videos into fixed-dimensional embeddings by combining visual and audio modalities. Key Parameters:
  • CLIP_MODEL: "ViT-B/32" - Vision Transformer backbone
  • WHISPER_MODEL: "base" - Audio transcription model
  • N_FRAMES: 5 - Number of frames sampled per video

2. Model Training (train.py)

Trains multiple classifier architectures using cross-validation and selects the best performer. Models Evaluated:
  • k-Nearest Neighbors (k=5, cosine distance)
  • Logistic Regression (L2 regularization, balanced class weights)
  • Multi-Layer Perceptron (2 hidden layers)

3. Prediction (predict.py)

Applies the trained model to unsorted videos, providing confidence scores and optional automatic file organization.

Data Flow

Directory Structure

workspace/source/
├── data/Favorites/videos/
│   ├── soccer/              # Labeled videos for training
│   │   └── *.mp4
│   ├── funny/
│   │   └── *.mp4
│   └── *.mp4                # Unlabeled videos to sort
├── artifacts/               # Generated outputs
│   ├── labeled_embeddings.pt
│   ├── unlabeled_embeddings.pt
│   ├── model.pt / model.pkl
│   ├── model_config.json
│   ├── predictions.json
│   └── transcripts.json
├── extract_features.py
├── train.py
└── predict.py

Key Design Decisions

Multimodal Fusion Strategy

The system uses late fusion by concatenating normalized embeddings from each modality. Both visual and audio embeddings are L2-normalized before concatenation to ensure balanced contributions.

Graceful Degradation

If audio extraction fails (corrupted audio, silent videos), the system automatically uses a zero vector fallback for the audio modality, allowing the classifier to rely solely on visual features.

Dataset Discovery

The system automatically discovers:
  • Labeled data: Videos in subdirectories (folder name = category label)
  • Unlabeled data: Videos in the root videos/ directory
  • Category names: Automatically extracted from folder names

Processing Pipeline

  1. Discovery Phase: Scan directory structure to identify labeled and unlabeled videos
  2. Feature Extraction: Process all videos through CLIP + Whisper pipeline
  3. Training Phase: Cross-validation with multiple model architectures
  4. Model Selection: Choose best performer based on CV accuracy
  5. Prediction Phase: Generate category predictions with confidence scores
  6. Organization (optional): Automatically move files to predicted folders

Performance Considerations

GPU Acceleration: Both CLIP and Whisper support CUDA acceleration. The system automatically detects available GPUs and falls back to CPU if necessary.Batch Processing: Feature extraction processes videos sequentially, but model training uses mini-batch gradient descent (batch_size=32).

Output Artifacts

FileDescriptionFormat
labeled_embeddings.ptTraining data featuresPyTorch tensor dict
unlabeled_embeddings.ptUnsorted video featuresPyTorch tensor dict
model.pt / model.pklTrained classifier weightsPyTorch / Pickle
model_config.jsonModel metadata & hyperparametersJSON
predictions.jsonPrediction results with confidenceJSON
transcripts.jsonAudio transcriptions for inspectionJSON

Next Steps

Build docs developers (and LLMs) love