test_already_extracted.py provides an interactive command-line interface for classifying videos. It supports dataset videos (via pre-extracted features), external video URLs, and local video files outside the dataset.
Running the script
Interactive mode
After listing available videos, the script prompts:| Input | Behavior |
|---|---|
A number (e.g. 42) | Selects the video at that position in the displayed list |
A filename (e.g. video_001.mp4) | Looks up the video by name in the feature index |
| A URL or local path | Downloads/reads the video and extracts features on the fly |
test | Runs a batch accuracy test on a random sample |
Quotes around filenames are stripped automatically, so pasting
'video_001.mp4' or "video_001.mp4" both work.SingleVideoFeatureLoader
SingleVideoFeatureLoader manages reading pre-extracted features from the HDF5 file and building the filename-to-index mapping from processed_data.pt files.
Initialization
Path to the directory containing the HDF5 features file. The loader looks for any file matching
test_features*.h5 inside this directory.Path to
data/processed/test. The loader iterates category and subcategory subdirectories in sorted order to map video filenames to their position in the .h5 file.- Finds
test_features*.h5infeatures_dir(exits if not found). - Reads
category_mappingfrom the file’s HDF5 attributes. - Calls
_build_video_index()to scanprocessed_data.ptfiles and constructvideo_index: dict[str, (int, int, str)]— mapping video filename to(h5_index, label, category_name).
list_available_videos()
all_videos— flat list of all video filenames in sorted-category order.videos_by_category—dict[str, list[str]]mapping category name to sorted video filenames.
load_video_features(video_name)
Exact filename as it appears in
video_index. Returns (None, None, None) if the name is not found.features—torch.Tensorof shape[T, 1280]whereTis the number of valid frames for that video (up to 73).label— integer class index from the HDF5 labels array.category_name— string category name (e.g."Gaming").
SingleVideoClassifier
SingleVideoClassifier loads the ensemble of trained checkpoints and provides standard and TTA prediction methods.
Initialization
List of paths to
.pt checkpoint files. Paths can be strings or Path objects. Order affects the per_model_scores breakdown but not the final ensemble result.Preferred compute device. Actual device used is
cuda only when torch.cuda.is_available() returns True; otherwise falls back to cpu regardless of this parameter.load_models() before calling any prediction method.
load_models()
Loads each checkpoint, reconstructs theSuperEnhancedTemporalModel from the saved model_config, and calls model.eval().
classifier.models— list of loadedSuperEnhancedTemporalModelinstances.classifier.class_names—['Animation', 'Flat_Content', 'Gaming', 'Natural_Content'].classifier.feature_dim—1280.classifier.num_classes—4.
predict_standard(features)
Runs standard ensemble inference without augmentation.features independently. Softmax probabilities are averaged across all models:
torch.Tensor of shape [num_classes] with averaged softmax probabilities.
predict_with_tta(features)
Runs Test-Time Augmentation using 4 augmentation modes. Each mode independently callspredict_standard, then probabilities are averaged.
| TTA mode | Description |
|---|---|
| Mode 1: Original | Features as-is |
| Mode 2: Reverse | torch.flip(features, dims=[0]) — reversed temporal order |
| Mode 3: Speed up | Subsample to T//2 frames uniformly (requires T > 10) |
| Mode 4: Speed down | Upsample to T*1.5 frames by interpolating indices (requires T > 10) |
torch.Tensor of shape [num_classes] with TTA-averaged probabilities.
classify_video(features, true_label, video_name, use_tta)
Runs inference and returns a structured result dictionary.Frame features tensor of shape
[T, feature_dim].Ground-truth class index. Pass
None for external videos where the true label is unknown. When provided, the result includes is_correct.Display name used in console output and the result dictionary.
When
True, calls predict_with_tta; otherwise calls predict_standard.The value passed as
video_name.Class name corresponding to
true_label, or null if true_label was None.Class name with the highest ensemble probability.
Confidence of the predicted class as a percentage.
Whether the prediction matched
true_label. null when true_label was not provided.Ensemble probability for every class, keyed by class name, as percentages.
ISO 8601 timestamp of when inference ran.
Whether TTA was applied.
Number of models used in the ensemble.
External video inference
When the input is a URL or a local file path not present in the feature index, the script downloads or reads the video and extracts features on the fly before classifying.download_video_if_needed(url_or_path)
url_or_path is an existing local file, it is returned unchanged. Otherwise:
- yt-dlp — attempts to download the best MP4 using
yt_dlp.YoutubeDL. Works with YouTube URLs and other yt-dlp-supported platforms. - requests fallback — if yt-dlp fails or is not installed, streams the URL directly via
requests.get. This only works for direct.mp4links.
extract_video_features(video_path, model_feature_dim, num_frames, device)
Extracts[T, 1280] features from a raw video file using an EfficientNet-B0 backbone.
Path to the local video file.
Expected output feature dimension. Should match
classifier.feature_dim (1280). If the backbone output dimension differs, a linear projection is applied automatically.Number of frames to sample uniformly from the video. Uses the same count as the pre-extracted test features.
Compute device for the EfficientNet-B0 backbone. Accepts
'cpu', 'cuda', or a torch.device object.- Opens the video with OpenCV and counts total frames.
- Computes
num_framesuniformly spaced frame indices. - Loads pretrained EfficientNet-B0 (
torchvision.models.efficientnet_b0(pretrained=True)). - Applies the same preprocessing as EfficientNet defaults: resize to 256, center crop to 224, normalize with ImageNet mean/std.
- Passes frames through
eff.features(convolutional backbone) and applies global average pooling to get[T, 1280]. - Returns features on CPU as a
torch.Tensor.
EfficientNet-B0 feature extraction can take 1–5 minutes for a typical video depending on hardware. The pre-extracted HDF5 approach is significantly faster for test-set videos.
Full external video example
Batch test mode
Entertest at the video selection prompt to run an accuracy test on a random sample of the test set.