Active learning is a future direction for the TikTok Auto Collection Sorter that will enable more efficient labeling by prioritizing the most informative videos. Instead of labeling videos randomly, the system identifies samples where the model is least confident and presents those for labeling first.
import numpy as npimport json# Load predictionswith open('artifacts/predictions.json') as f: predictions = json.load(f)# Sort by confidence (ascending)uncertain_videos = sorted( predictions, key=lambda x: x['confidence'])# Present top 10 most uncertain videos for labelingfor video in uncertain_videos[:10]: print(f"{video['video']}: {video['confidence']:.2%} confident") print(f" Top predictions: {video['top_predictions'][:3]}")
To integrate active learning into the labeling interface (server.py), modify the /unlabeled endpoint:
@app.get("/unlabeled")def get_unlabeled_videos(): # Load predictions with open(ARTIFACTS_DIR / "predictions.json") as f: predictions = json.load(f) # Sort by uncertainty (lowest confidence first) predictions.sort(key=lambda x: x['confidence']) # Return sorted list return {"videos": predictions}
Active learning works best with batch labeling. Label at least 10-20 videos before retraining to ensure the model learns meaningful patterns. Single-label updates can lead to overfitting.
# After each retrain, log metricsmetrics = { "num_labeled": len(labeled_data), "cv_accuracy": cv_results['mean_accuracy'], "avg_confidence": np.mean([p['confidence'] for p in predictions]), "class_distribution": dict(Counter(labels))}with open('artifacts/training_history.json', 'a') as f: f.write(json.dumps(metrics) + '\n')