Skip to main content

Overview

Active learning is a future direction for the TikTok Auto Collection Sorter that will enable more efficient labeling by prioritizing the most informative videos. Instead of labeling videos randomly, the system identifies samples where the model is least confident and presents those for labeling first.

Why Active Learning?

Labeling data is time-consuming, and active learning helps you get the most value from each labeled sample by focusing on:
  • Uncertain predictions: Videos where the model has low confidence
  • Decision boundaries: Samples near the boundary between classes
  • Underrepresented categories: Minority classes that need more examples
This approach can significantly reduce the number of labels needed to achieve high accuracy.

Current Implementation

The system already outputs confidence scores for all predictions (see predict.py:71-73):
logits = model(torch.FloatTensor(features).to(device))
probs = F.softmax(logits, dim=1).cpu().numpy()
These probability distributions contain the information needed for active learning strategies.

Implementation Strategy

1. Uncertainty Sampling

Select videos with the lowest maximum confidence:
import numpy as np
import json

# Load predictions
with open('artifacts/predictions.json') as f:
    predictions = json.load(f)

# Sort by confidence (ascending)
uncertain_videos = sorted(
    predictions,
    key=lambda x: x['confidence']
)

# Present top 10 most uncertain videos for labeling
for video in uncertain_videos[:10]:
    print(f"{video['video']}: {video['confidence']:.2%} confident")
    print(f"  Top predictions: {video['top_predictions'][:3]}")

2. Margin Sampling

Select videos where the top two predictions are very close:
def calculate_margin(predictions):
    """Calculate margin between top 2 predictions."""
    top_probs = predictions['top_predictions'][:2]
    margin = top_probs[0]['confidence'] - top_probs[1]['confidence']
    return margin

# Sort by smallest margin
close_calls = sorted(
    predictions,
    key=calculate_margin
)

for video in close_calls[:10]:
    margin = calculate_margin(video)
    print(f"{video['video']}: margin={margin:.2%}")
    print(f"  {video['top_predictions'][0]['folder']} vs {video['top_predictions'][1]['folder']}")

3. Class Balance Strategy

Combine uncertainty with class distribution to ensure balanced labeling:
from collections import Counter
import torch

# Load current labels
data = torch.load('artifacts/labeled_embeddings.pt', weights_only=False)
label_counts = Counter(data['labels'].tolist())
label_names = data['label_names']

# Calculate sampling weights
def get_priority_score(video, label_counts):
    confidence = video['confidence']
    predicted_idx = label_names.index(video['predicted_folder'])
    class_count = label_counts.get(predicted_idx, 0)
    
    # Lower confidence = higher priority
    # Lower class count = higher priority
    uncertainty_score = 1 - confidence
    balance_score = 1 / (class_count + 1)
    
    return uncertainty_score * 0.7 + balance_score * 0.3

prioritized = sorted(
    predictions,
    key=lambda x: get_priority_score(x, label_counts),
    reverse=True
)

Integration with UI

To integrate active learning into the labeling interface (server.py), modify the /unlabeled endpoint:
@app.get("/unlabeled")
def get_unlabeled_videos():
    # Load predictions
    with open(ARTIFACTS_DIR / "predictions.json") as f:
        predictions = json.load(f)
    
    # Sort by uncertainty (lowest confidence first)
    predictions.sort(key=lambda x: x['confidence'])
    
    # Return sorted list
    return {"videos": predictions}
Active learning works best with batch labeling. Label at least 10-20 videos before retraining to ensure the model learns meaningful patterns. Single-label updates can lead to overfitting.

Measuring Impact

Track these metrics to evaluate active learning effectiveness:
  1. Accuracy vs. labeled samples: Plot model accuracy against the number of labels
  2. Average confidence: Monitor how model confidence changes over iterations
  3. Per-class accuracy: Ensure minority classes improve
Example tracking:
# After each retrain, log metrics
metrics = {
    "num_labeled": len(labeled_data),
    "cv_accuracy": cv_results['mean_accuracy'],
    "avg_confidence": np.mean([p['confidence'] for p in predictions]),
    "class_distribution": dict(Counter(labels))
}

with open('artifacts/training_history.json', 'a') as f:
    f.write(json.dumps(metrics) + '\n')

Next Steps

  1. Implement uncertainty-based sorting in server.py
  2. Add UI indicators for high-priority videos
  3. Track labeling efficiency metrics
  4. Experiment with different sampling strategies

Class Imbalance

Learn how class weighting helps with imbalanced datasets

Custom Models

Extend the system with custom architectures

Build docs developers (and LLMs) love