Flask web app
API-based inference with a browser UI. Start the server and classify videos through HTTP endpoints.
Command-line script
Interactive script for classifying individual videos or running batch accuracy tests from the terminal.
Prerequisites
Before running inference, ensure the following are available. Python packages| File | Purpose |
|---|---|
features_enhanced/test_features_multiscale.h5 | Pre-extracted test features (412 videos, shape [412, 73, 1280]) |
models_enhanced/best_ensemble_model_{1-4}.pt | Trained model checkpoints |
data/processed/test/{category}/{subcategory}/processed_data.pt | Video name index used to map filenames to h5 positions |
Directory layout
Place files relative toapp.py (or test_already_extracted.py):
Pre-extracted features approach
Both inference paths use a pre-extracted features strategy rather than running the full CNN backbone at inference time. Features were extracted during training with a multi-scale EfficientNet-V2 pipeline and stored in a compressed HDF5 file.- Feature shape per video:
[73, 1280](73 frames, 1280-dim per frame) - Multi-scale extraction (scales 1.0, 0.85, 1.15 averaged) for spatial robustness
- Loading a single video’s features reads only the relevant slice of the
.h5file
SuperEnhancedTemporalModel) runs in milliseconds on GPU, and feature loading from disk is the dominant cost.
Selected checkpoints
Both inference paths load the same four checkpoints defined inSELECTED_CHECKPOINTS:
SuperEnhancedTemporalModel with the following architecture:
| Parameter | Value |
|---|---|
feature_dim | 1280 |
hidden_dim | 768 |
num_classes | 4 |
num_lstm_layers | 4 |
num_attention_heads | 12 |
dropout | 0.4 |
bidirectional | True |
configuration_analysis.json:
| Checkpoint | Best val accuracy | Weighted F1 |
|---|---|---|
best_ensemble_model_1.pt | 72.7% | 64.8% |
best_ensemble_model_2.pt | 92.1% | 92.1% |
best_ensemble_model_3.pt | 91.9% | 91.9% |
best_ensemble_model_4.pt | 91.9% | 91.9% |
Model 1 was saved at an earlier training epoch (epoch 52) with lower accuracy. The ensemble averages probabilities across all four models, so models 2–4 dominate the final prediction.
GPU vs CPU inference
Device selection is automatic. The classifier checkstorch.cuda.is_available() at initialization:
get_classifier()) and the CLI script (SingleVideoClassifier.__init__). If no CUDA device is detected, inference falls back to CPU automatically.
Lazy singleton loading
The Flask app uses lazy singletons to avoid loading models on every request:/api/videos or /api/classify triggers model loading. Subsequent requests reuse the same objects.
Next steps
Flask app
Configure and start the Flask server, then call the API endpoints.
CLI script
Run interactive single-video classification or batch testing from the terminal.