Overview
This guide will walk you through installing dependencies, organizing your videos, extracting features, training a classifier, and generating your first predictions.
Prerequisites
You’ll need:
Python 3.8 or higher
CUDA-compatible GPU (optional but recommended for faster processing)
FFmpeg for audio extraction
TikTok videos saved locally
Installation
Install FFmpeg
FFmpeg is required for audio extraction from videos. macOS
Ubuntu/Debian
Windows
Install Python dependencies
Install the required Python packages: pip install torch torchvision
pip install git+https://github.com/openai/CLIP.git
pip install openai-whisper
pip install opencv-python pillow
pip install scikit-learn numpy
pip install fastapi uvicorn pydantic
pip install tqdm
Set up project structure
Create the following directory structure: mkdir -p tiktok-sorter/data/Favorites/videos
mkdir -p tiktok-sorter/artifacts
cd tiktok-sorter
Your project should look like: tiktok-sorter/
├── extract_features.py
├── train.py
├── predict.py
├── server.py
├── artifacts/ # Model checkpoints, embeddings
└── data/
└── Favorites/
└── videos/ # Your TikTok videos go here
Organize your videos
Place your TikTok videos in the data/Favorites/videos/ directory. To create labeled training data, organize some videos into subfolders by category: data/Favorites/videos/
├── soccer/ # Your category folders
│ ├── 1234567890.mp4
│ └── 9876543210.mp4
├── cooking/
│ └── 5555555555.mp4
├── funny/
│ └── 7777777777.mp4
└── 1111111111.mp4 # Unlabeled videos stay in root
Start with at least 20-30 videos per category for good results. The system handles class imbalance well, so you don’t need perfectly balanced data.
Now extract multimodal features from your videos using CLIP and Whisper:
python extract_features.py
This script:
Samples 5 frames uniformly from each video
Encodes frames with CLIP (ViT-B/32) visual encoder
Extracts audio and transcribes with Whisper
Encodes transcripts with CLIP text encoder
Combines visual and audio features into 1024-d vectors
extract_features.py
Feature normalization
# Sample frames uniformly from video
def extract_visual_features ( video_path , clip_model , preprocess , device , n_frames = 5 ):
cap = cv2.VideoCapture( str (video_path))
total_frames = int (cap.get(cv2. CAP_PROP_FRAME_COUNT ))
indices = np.linspace( 0 , total_frames - 1 , n_frames, dtype = int )
embeddings = []
for idx in indices:
cap.set(cv2. CAP_PROP_POS_FRAMES , idx)
ret, frame = cap.read()
if not ret:
continue
img = Image.fromarray(cv2.cvtColor(frame, cv2. COLOR_BGR2RGB ))
img_input = preprocess(img).unsqueeze( 0 ).to(device)
with torch.no_grad():
emb = clip_model.encode_image(img_input)
embeddings.append(emb.cpu())
# Average pool across frames → single 512-d vector
stacked = torch.cat(embeddings, dim = 0 )
return stacked.mean( dim = 0 )
Processing time : Expect ~10 minutes for 600 videos on a modern GPU, or ~30-45 minutes on CPU.
The script saves:
artifacts/labeled_embeddings.pt - Features for videos in category folders
artifacts/unlabeled_embeddings.pt - Features for videos in root directory
artifacts/transcripts.json - Whisper transcriptions for inspection
Train the Classifier
Train a classifier on your labeled videos:
The training script:
Compares three approaches (k-NN, Logistic Regression, MLP)
Uses stratified k-fold cross-validation
Selects the best model based on validation accuracy
Retrains on all labeled data
MLP Architecture
Class-weighted loss
class MLP ( nn . Module ):
def __init__ ( self , input_dim , num_classes , hidden_dim = 256 ):
super (). __init__ ()
self .net = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Dropout( 0.3 ),
nn.Linear(hidden_dim, hidden_dim // 2 ),
nn.ReLU(),
nn.Dropout( 0.2 ),
nn.Linear(hidden_dim // 2 , num_classes),
)
Expected output:
Loaded 213 samples, 8 classes: ['cooking', 'funny', 'gaming', 'news', 'quran', 'soccer', 'tech', 'travel']
Feature dimension: 1024
Using 5-fold stratified cross-validation
Fold 1: kNN=87.2% LogReg=88.4% MLP=90.7%
Fold 2: kNN=86.0% LogReg=89.5% MLP=91.9%
Fold 3: kNN=85.4% LogReg=87.2% MLP=89.5%
Fold 4: kNN=88.1% LogReg=90.7% MLP=93.0%
Fold 5: kNN=87.2% LogReg=88.4% MLP=90.7%
Cross-validation results (mean accuracy):
knn: 86.8% (+/- 1.0%)
logreg: 88.8% (+/- 1.3%)
mlp: 91.2% (+/- 1.4%)
Best model: mlp (91.2%)
Training typically completes in under 30 seconds. The MLP usually outperforms k-NN and Logistic Regression on multimodal features.
The script saves:
artifacts/model.pt or artifacts/model.pkl - Trained model weights
artifacts/model_config.json - Model metadata and label mappings
Generate Predictions
Predict folders for your unlabeled videos:
This generates predictions with confidence scores for all unsorted videos in the root directory.
Basic prediction
With confidence threshold
Auto-sort videos
Example output:
Predicting folders for 47 unsorted videos
Model: mlp | Categories: ['cooking', 'funny', 'gaming', 'news', 'quran', 'soccer', 'tech', 'travel']
[ASSIGN] 7234567890123456.mp4 → soccer (94%) [soccer: 94% | gaming: 3% | tech: 2%]
[ASSIGN] 7234567890234567.mp4 → cooking (87%) [cooking: 87% | travel: 8% | funny: 3%]
[SKIP ] 7234567890345678.mp4 → funny (45%) [funny: 45% | gaming: 32% | tech: 18%]
[ASSIGN] 7234567890456789.mp4 → quran (99%) [quran: 99% | news: 1% | travel: 0%]
Summary:
soccer : 18 videos
cooking : 12 videos
gaming : 8 videos
quran : 5 videos
tech : 3 videos
travel : 1 videos
SKIPPED : 14 videos (below 0% threshold)
TOTAL : 47 videos
Use --threshold to only auto-assign videos where the model is confident. Videos below the threshold can be manually reviewed.
The script saves artifacts/predictions.json with detailed predictions for all videos.
Launch Interactive UI
Start the web interface for interactive labeling and active learning:
Then open http://localhost:8000 in your browser.
The UI provides:
Full-screen video player modeled after TikTok’s interface
Real-time model predictions with top-3 confidence scores
Keyboard shortcuts (1-8) for rapid labeling
Visual highlighting of predicted folder
One-click retraining that triggers the full pipeline
Next Steps
Feature Extraction Deep Dive Learn how CLIP and Whisper work together to extract multimodal features
Training Configuration Customize the training pipeline and hyperparameters
Active Learning Improve accuracy through iterative labeling and retraining
Deployment Deploy as a browser extension or API service