Skip to main content
The project provides two ways to run inference on pre-classified video content. Both paths use the same underlying ensemble of four trained checkpoints and the same pre-extracted feature pipeline.

Flask web app

API-based inference with a browser UI. Start the server and classify videos through HTTP endpoints.

Command-line script

Interactive script for classifying individual videos or running batch accuracy tests from the terminal.

Prerequisites

Before running inference, ensure the following are available. Python packages
pip install torch torchvision flask h5py numpy opencv-python requests yt-dlp
Required files
FilePurpose
features_enhanced/test_features_multiscale.h5Pre-extracted test features (412 videos, shape [412, 73, 1280])
models_enhanced/best_ensemble_model_{1-4}.ptTrained model checkpoints
data/processed/test/{category}/{subcategory}/processed_data.ptVideo name index used to map filenames to h5 positions
All four checkpoint files must be present. The classifier will skip any missing checkpoints and log a warning, but inference quality degrades with fewer ensemble members.

Directory layout

Place files relative to app.py (or test_already_extracted.py):
Flask Local New/
├── app.py
├── test_already_extracted.py
├── model_train_new.py
├── index.html
├── features_enhanced/
│   └── test_features_multiscale.h5
├── models_enhanced/
│   ├── best_ensemble_model_1.pt
│   ├── best_ensemble_model_2.pt
│   ├── best_ensemble_model_3.pt
│   └── best_ensemble_model_4.pt
└── data/
    └── processed/
        └── test/
            ├── Animation/
            │   └── {subcategory}/
            │       └── processed_data.pt
            ├── Flat_Content/
            │   └── {subcategory}/
            │       └── processed_data.pt
            ├── Gaming/
            │   └── {subcategory}/
            │       └── processed_data.pt
            └── Natural_Content/
                └── {subcategory}/
                    └── processed_data.pt

Pre-extracted features approach

Both inference paths use a pre-extracted features strategy rather than running the full CNN backbone at inference time. Features were extracted during training with a multi-scale EfficientNet-V2 pipeline and stored in a compressed HDF5 file.
  • Feature shape per video: [73, 1280] (73 frames, 1280-dim per frame)
  • Multi-scale extraction (scales 1.0, 0.85, 1.15 averaged) for spatial robustness
  • Loading a single video’s features reads only the relevant slice of the .h5 file
This makes inference fast: the temporal model (SuperEnhancedTemporalModel) runs in milliseconds on GPU, and feature loading from disk is the dominant cost.
For external videos (URLs or local files not in the test set), the script falls back to on-the-fly EfficientNet-B0 feature extraction. See Single-video classification for details.

Selected checkpoints

Both inference paths load the same four checkpoints defined in SELECTED_CHECKPOINTS:
SELECTED_CHECKPOINTS = [
    "best_ensemble_model_1.pt",
    "best_ensemble_model_2.pt",
    "best_ensemble_model_3.pt",
    "best_ensemble_model_4.pt",
]
Each checkpoint is a SuperEnhancedTemporalModel with the following architecture:
ParameterValue
feature_dim1280
hidden_dim768
num_classes4
num_lstm_layers4
num_attention_heads12
dropout0.4
bidirectionalTrue
Checkpoint validation accuracies from configuration_analysis.json:
CheckpointBest val accuracyWeighted F1
best_ensemble_model_1.pt72.7%64.8%
best_ensemble_model_2.pt92.1%92.1%
best_ensemble_model_3.pt91.9%91.9%
best_ensemble_model_4.pt91.9%91.9%
Model 1 was saved at an earlier training epoch (epoch 52) with lower accuracy. The ensemble averages probabilities across all four models, so models 2–4 dominate the final prediction.

GPU vs CPU inference

Device selection is automatic. The classifier checks torch.cuda.is_available() at initialization:
device_arg = 'cuda' if torch.cuda.is_available() and device == 'cuda' else 'cpu'
_classifier = user_module.SingleVideoClassifier(checkpoint_paths=paths, device=device_arg)
This applies to both the Flask app (get_classifier()) and the CLI script (SingleVideoClassifier.__init__). If no CUDA device is detected, inference falls back to CPU automatically.

Lazy singleton loading

The Flask app uses lazy singletons to avoid loading models on every request:
_feature_loader = None
_classifier = None

def get_feature_loader():
    global _feature_loader
    if _feature_loader is None:
        _feature_loader = user_module.SingleVideoFeatureLoader(
            features_dir=str(FEATURES_DIR),
            processed_test_dir=str(PROCESSED_TEST_DIR)
        )
    return _feature_loader

def get_classifier(device="cuda"):
    global _classifier
    if _classifier is None:
        # ... build checkpoint paths ...
        device_arg = 'cuda' if torch.cuda.is_available() and device == 'cuda' else 'cpu'
        _classifier = user_module.SingleVideoClassifier(checkpoint_paths=paths, device=device_arg)
        _classifier.load_models()
    return _classifier
The first request to /api/videos or /api/classify triggers model loading. Subsequent requests reuse the same objects.

Next steps

Flask app

Configure and start the Flask server, then call the API endpoints.

CLI script

Run interactive single-video classification or batch testing from the terminal.

Build docs developers (and LLMs) love