System Architecture

Overview

The TikTok Auto Collection Sorter is a multimodal AI system that automatically categorizes TikTok videos into user-defined folders. The system combines visual content analysis (CLIP) and audio transcription (Whisper) to create rich feature representations, then trains a classifier to predict appropriate categories.

Architecture Components

The system consists of three main stages:

1. Feature Extraction (`extract_features.py`)

Transforms raw videos into fixed-dimensional embeddings by combining visual and audio modalities. Key Parameters:

CLIP_MODEL: "ViT-B/32" - Vision Transformer backbone
WHISPER_MODEL: "base" - Audio transcription model
N_FRAMES: 5 - Number of frames sampled per video

2. Model Training (`train.py`)

Trains multiple classifier architectures using cross-validation and selects the best performer. Models Evaluated:

k-Nearest Neighbors (k=5, cosine distance)
Logistic Regression (L2 regularization, balanced class weights)
Multi-Layer Perceptron (2 hidden layers)

3. Prediction (`predict.py`)

Applies the trained model to unsorted videos, providing confidence scores and optional automatic file organization.

Data Flow

Directory Structure

workspace/source/
├── data/Favorites/videos/
│   ├── soccer/              # Labeled videos for training
│   │   └── *.mp4
│   ├── funny/
│   │   └── *.mp4
│   └── *.mp4                # Unlabeled videos to sort
├── artifacts/               # Generated outputs
│   ├── labeled_embeddings.pt
│   ├── unlabeled_embeddings.pt
│   ├── model.pt / model.pkl
│   ├── model_config.json
│   ├── predictions.json
│   └── transcripts.json
├── extract_features.py
├── train.py
└── predict.py

Key Design Decisions

Multimodal Fusion Strategy

The system uses late fusion by concatenating normalized embeddings from each modality. Both visual and audio embeddings are L2-normalized before concatenation to ensure balanced contributions.

Graceful Degradation

If audio extraction fails (corrupted audio, silent videos), the system automatically uses a zero vector fallback for the audio modality, allowing the classifier to rely solely on visual features.

Dataset Discovery

The system automatically discovers:

Labeled data: Videos in subdirectories (folder name = category label)
Unlabeled data: Videos in the root videos/ directory
Category names: Automatically extracted from folder names

Processing Pipeline

Discovery Phase: Scan directory structure to identify labeled and unlabeled videos
Feature Extraction: Process all videos through CLIP + Whisper pipeline
Training Phase: Cross-validation with multiple model architectures
Model Selection: Choose best performer based on CV accuracy
Prediction Phase: Generate category predictions with confidence scores
Organization (optional): Automatically move files to predicted folders

Performance Considerations

GPU Acceleration: Both CLIP and Whisper support CUDA acceleration. The system automatically detects available GPUs and falls back to CPU if necessary.Batch Processing: Feature extraction processes videos sequentially, but model training uses mini-batch gradient descent (batch_size=32).

Output Artifacts

File	Description	Format
`labeled_embeddings.pt`	Training data features	PyTorch tensor dict
`unlabeled_embeddings.pt`	Unsorted video features	PyTorch tensor dict
`model.pt` / `model.pkl`	Trained classifier weights	PyTorch / Pickle
`model_config.json`	Model metadata & hyperparameters	JSON
`predictions.json`	Prediction results with confidence	JSON
`transcripts.json`	Audio transcriptions for inspection	JSON

Next Steps

Learn about Multimodal Features to understand CLIP and Whisper integration
Explore the Training Pipeline for model architecture and optimization details

Get Started

Core Concepts

Guides

Advanced

Overview

Architecture Components

1. Feature Extraction (`extract_features.py`)

2. Model Training (`train.py`)

3. Prediction (`predict.py`)

Data Flow

Directory Structure

Key Design Decisions

Multimodal Fusion Strategy

Graceful Degradation

Dataset Discovery

Processing Pipeline

Performance Considerations

Output Artifacts

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced

​Overview

​Architecture Components

​1. Feature Extraction (extract_features.py)

​2. Model Training (train.py)

​3. Prediction (predict.py)

​Data Flow

​Directory Structure

​Key Design Decisions

​Multimodal Fusion Strategy

​Graceful Degradation

​Dataset Discovery

​Processing Pipeline

​Performance Considerations

​Output Artifacts

​Next Steps

Build docs developers (and LLMs) love

Overview

Architecture Components

1. Feature Extraction (`extract_features.py`)

2. Model Training (`train.py`)

3. Prediction (`predict.py`)

Data Flow

Directory Structure

Key Design Decisions

Multimodal Fusion Strategy

Graceful Degradation

Dataset Discovery

Processing Pipeline

Performance Considerations

Output Artifacts

Next Steps