Single-video classification

test_already_extracted.py provides an interactive command-line interface for classifying videos. It supports dataset videos (via pre-extracted features), external video URLs, and local video files outside the dataset.

Running the script

python test_already_extracted.py

The script starts in interactive mode and displays the full list of available test videos grouped by category.

Interactive mode

After listing available videos, the script prompts:

Enter video number or filename (or 'test' to run accuracy test, or paste a URL/local path):

You can respond in one of four ways:

Input	Behavior
A number (e.g. `42`)	Selects the video at that position in the displayed list
A filename (e.g. `video_001.mp4`)	Looks up the video by name in the feature index
A URL or local path	Downloads/reads the video and extracts features on the fly
`test`	Runs a batch accuracy test on a random sample

Quotes around filenames are stripped automatically, so pasting 'video_001.mp4' or "video_001.mp4" both work.

SingleVideoFeatureLoader

SingleVideoFeatureLoader manages reading pre-extracted features from the HDF5 file and building the filename-to-index mapping from processed_data.pt files.

Initialization

from test_already_extracted import SingleVideoFeatureLoader

loader = SingleVideoFeatureLoader(
    features_dir="/path/to/features_enhanced",
    processed_test_dir="/path/to/data/processed/test"
)

Parameters

features_dir

string

required

Path to the directory containing the HDF5 features file. The loader looks for any file matching test_features*.h5 inside this directory.

processed_test_dir

string

required

Path to data/processed/test. The loader iterates category and subcategory subdirectories in sorted order to map video filenames to their position in the .h5 file.

During initialization the loader:

Finds test_features*.h5 in features_dir (exits if not found).
Reads category_mapping from the file’s HDF5 attributes.
Calls _build_video_index() to scan processed_data.pt files and construct video_index: dict[str, (int, int, str)] — mapping video filename to (h5_index, label, category_name).

list_available_videos()

all_videos, videos_by_category = loader.list_available_videos()

Returns

all_videos — flat list of all video filenames in sorted-category order.
videos_by_category — dict[str, list[str]] mapping category name to sorted video filenames.

Example output:

Animation (87 videos):
------------------------------------------------------------
     1. animation_clip_001.mp4
     2. animation_clip_002.mp4
   ...
Gaming (103 videos):
------------------------------------------------------------
    88. gaming_clip_001.mp4

load_video_features(video_name)

features, label, category_name = loader.load_video_features("video_001.mp4")

Parameters

video_name

string

required

Exact filename as it appears in video_index. Returns (None, None, None) if the name is not found.

Returns

features — torch.Tensor of shape [T, 1280] where T is the number of valid frames for that video (up to 73).
label — integer class index from the HDF5 labels array.
category_name — string category name (e.g. "Gaming").

# Load features and inspect
features, label, category_name = loader.load_video_features("video_020.mp4")
print(features.shape)   # torch.Size([73, 1280])
print(category_name)    # Gaming

SingleVideoClassifier

SingleVideoClassifier loads the ensemble of trained checkpoints and provides standard and TTA prediction methods.

Initialization

from test_already_extracted import SingleVideoClassifier

classifier = SingleVideoClassifier(
    checkpoint_paths=[
        "/path/to/models_enhanced/best_ensemble_model_1.pt",
        "/path/to/models_enhanced/best_ensemble_model_2.pt",
        "/path/to/models_enhanced/best_ensemble_model_3.pt",
        "/path/to/models_enhanced/best_ensemble_model_4.pt",
    ],
    device="cuda"
)
classifier.load_models()

Parameters

checkpoint_paths

string[]

required

List of paths to .pt checkpoint files. Paths can be strings or Path objects. Order affects the per_model_scores breakdown but not the final ensemble result.

device

string

default:"cuda"

Preferred compute device. Actual device used is cuda only when torch.cuda.is_available() returns True; otherwise falls back to cpu regardless of this parameter.

After construction, call load_models() before calling any prediction method.

load_models()

Loads each checkpoint, reconstructs the SuperEnhancedTemporalModel from the saved model_config, and calls model.eval().

classifier.load_models()
# Output:
# Loading 4 model(s)...
#    Loading model 1/4: best_ensemble_model_1.pt
#       Val accuracy: 72.73%
#    Loading model 2/4: best_ensemble_model_2.pt
#       Val accuracy: 92.13%
# ...
# Feature dim: 1280
# Classes: ['Animation', 'Flat_Content', 'Gaming', 'Natural_Content']

After loading, the following attributes are set:

classifier.models — list of loaded SuperEnhancedTemporalModel instances.
classifier.class_names — ['Animation', 'Flat_Content', 'Gaming', 'Natural_Content'].
classifier.feature_dim — 1280.
classifier.num_classes — 4.

predict_standard(features)

Runs standard ensemble inference without augmentation.

probs = classifier.predict_standard(features)  # torch.Tensor, shape [4]

Each model processes features independently. Softmax probabilities are averaged across all models:

def predict_standard(self, features):
    all_model_predictions = []
    with torch.no_grad():
        features_batch = features.unsqueeze(0).to(self.device)  # [1, T, D]
        lengths = torch.tensor([features.shape[0]], device=self.device)
        for model in self.models:
            outputs = model(features_batch, lengths)  # [1, num_classes]
            probs = F.softmax(outputs, dim=1)
            all_model_predictions.append(probs.squeeze(0).cpu())
    ensemble_probs = torch.stack(all_model_predictions).mean(dim=0)
    return ensemble_probs

Returns torch.Tensor of shape [num_classes] with averaged softmax probabilities.

predict_with_tta(features)

Runs Test-Time Augmentation using 4 augmentation modes. Each mode independently calls predict_standard, then probabilities are averaged.

probs = classifier.predict_with_tta(features)  # torch.Tensor, shape [4]

TTA mode	Description
Mode 1: Original	Features as-is
Mode 2: Reverse	`torch.flip(features, dims=[0])` — reversed temporal order
Mode 3: Speed up	Subsample to `T//2` frames uniformly (requires `T > 10`)
Mode 4: Speed down	Upsample to `T*1.5` frames by interpolating indices (requires `T > 10`)

def predict_with_tta(self, features):
    tta_predictions = []

    # Mode 1: Original
    tta_predictions.append(self.predict_standard(features))

    # Mode 2: Reverse
    features_reversed = torch.flip(features, dims=[0])
    tta_predictions.append(self.predict_standard(features_reversed))

    # Mode 3: Speed up (skip frames)
    if features.shape[0] > 10:
        indices = torch.linspace(0, features.shape[0]-1, features.shape[0]//2).long()
        tta_predictions.append(self.predict_standard(features[indices]))

    # Mode 4: Speed down (interpolate frames)
    if features.shape[0] > 10:
        indices = torch.linspace(0, features.shape[0]-1, int(features.shape[0]*1.5)).long()
        indices = indices.clamp(max=features.shape[0]-1)
        tta_predictions.append(self.predict_standard(features[indices]))

    return torch.stack(tta_predictions).mean(dim=0)

Returns torch.Tensor of shape [num_classes] with TTA-averaged probabilities.

classify_video(features, true_label, video_name, use_tta)

Runs inference and returns a structured result dictionary.

result = classifier.classify_video(
    features=features,
    true_label=true_label_for_model,  # or None for external videos
    video_name="video_020.mp4",
    use_tta=False
)

Parameters

features

Tensor

required

Frame features tensor of shape [T, feature_dim].

true_label

int

Ground-truth class index. Pass None for external videos where the true label is unknown. When provided, the result includes is_correct.

video_name

string

default:"Unknown"

Display name used in console output and the result dictionary.

use_tta

boolean

default:"false"

When True, calls predict_with_tta; otherwise calls predict_standard.

Result dictionary

result = {
    'video_name': 'video_020.mp4',
    'true_class': 'Gaming',             # None if true_label was None
    'predicted_class': 'Gaming',
    'predicted_confidence': 94.37,      # float, percentage
    'is_correct': True,                 # None if true_label was None
    'all_scores': {
        'Animation': 1.65,
        'Flat_Content': 2.45,
        'Gaming': 94.37,
        'Natural_Content': 1.53
    },
    'timestamp': '2026-03-17T10:23:45.123456',
    'tta_used': False,
    'ensemble_size': 4
}

video_name

string

The value passed as video_name.

true_class

string | null

Class name corresponding to true_label, or null if true_label was None.

predicted_class

string

Class name with the highest ensemble probability.

predicted_confidence

number

Confidence of the predicted class as a percentage.

is_correct

boolean | null

Whether the prediction matched true_label. null when true_label was not provided.

all_scores

object

Ensemble probability for every class, keyed by class name, as percentages.

timestamp

string

ISO 8601 timestamp of when inference ran.

tta_used

boolean

Whether TTA was applied.

ensemble_size

number

Number of models used in the ensemble.

External video inference

When the input is a URL or a local file path not present in the feature index, the script downloads or reads the video and extracts features on the fly before classifying.

download_video_if_needed(url_or_path)

local_path = download_video_if_needed(url_or_path, tmp_dir=None)

If url_or_path is an existing local file, it is returned unchanged. Otherwise:

yt-dlp — attempts to download the best MP4 using yt_dlp.YoutubeDL. Works with YouTube URLs and other yt-dlp-supported platforms.
requests fallback — if yt-dlp fails or is not installed, streams the URL directly via requests.get. This only works for direct .mp4 links.

try:
    import yt_dlp
    ydl_opts = {
        'format': 'bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]/best',
        'outtmpl': out_path,
        'quiet': True,
        'merge_output_format': 'mp4',
        'noplaylist': True,
    }
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        ydl.download([url_or_path])
except Exception as e:
    print(f"yt_dlp failed or not available: {e}")
    # fallback to requests ...

extract_video_features(video_path, model_feature_dim, num_frames, device)

Extracts [T, 1280] features from a raw video file using an EfficientNet-B0 backbone.

features = extract_video_features(
    video_path=local_video,
    model_feature_dim=1280,
    num_frames=73,
    device=classifier.device
)
print(features.shape)  # torch.Size([73, 1280])

Parameters

video_path

string

required

Path to the local video file.

model_feature_dim

number

required

Expected output feature dimension. Should match classifier.feature_dim (1280). If the backbone output dimension differs, a linear projection is applied automatically.

num_frames

number

default:"73"

Number of frames to sample uniformly from the video. Uses the same count as the pre-extracted test features.

device

string

default:"cpu"

Compute device for the EfficientNet-B0 backbone. Accepts 'cpu', 'cuda', or a torch.device object.

The function:

Opens the video with OpenCV and counts total frames.
Computes num_frames uniformly spaced frame indices.
Loads pretrained EfficientNet-B0 (torchvision.models.efficientnet_b0(pretrained=True)).
Applies the same preprocessing as EfficientNet defaults: resize to 256, center crop to 224, normalize with ImageNet mean/std.
Passes frames through eff.features (convolutional backbone) and applies global average pooling to get [T, 1280].
Returns features on CPU as a torch.Tensor.

EfficientNet-B0 feature extraction can take 1–5 minutes for a typical video depending on hardware. The pre-extracted HDF5 approach is significantly faster for test-set videos.

Full external video example

from test_already_extracted import (
    SingleVideoClassifier,
    download_video_if_needed,
    extract_video_features,
)
import tempfile

# Setup classifier
checkpoint_paths = [
    "models_enhanced/best_ensemble_model_1.pt",
    "models_enhanced/best_ensemble_model_2.pt",
    "models_enhanced/best_ensemble_model_3.pt",
    "models_enhanced/best_ensemble_model_4.pt",
]
classifier = SingleVideoClassifier(checkpoint_paths=checkpoint_paths, device="cuda")
classifier.load_models()

# Download or locate video
tmp_dir = tempfile.mkdtemp()
local_video = download_video_if_needed("https://example.com/video.mp4", tmp_dir=tmp_dir)

# Extract features
features = extract_video_features(
    local_video,
    model_feature_dim=classifier.feature_dim,
    num_frames=73,
    device=classifier.device
)

# Classify
result = classifier.classify_video(
    features=features,
    true_label=None,
    video_name=local_video,
    use_tta=False
)
print(result["predicted_class"], result["predicted_confidence"])

Batch test mode

Enter test at the video selection prompt to run an accuracy test on a random sample of the test set.

Enter video number or filename (or 'test' to run accuracy test, or paste a URL/local path): test

How many random videos to test? (default=20): 50
Use TTA for all tests? (y/n, default=n): n

The script samples the requested number of videos, classifies each one, and prints a summary:

======================================================================
BATCH TEST SUMMARY
======================================================================
Total tested: 50
Correct: 46
Incorrect: 4
Accuracy: 92.00%
======================================================================

At the end you are prompted to save all individual results to a timestamped JSON file:

Save batch results to JSON? (y/n, default=n): y
Result saved to: batch_test_results_20260317_102345.json

The saved file structure:

{
  "summary": {
    "total": 50,
    "correct": 46,
    "accuracy": 0.92
  },
  "results": [
    {
      "video_name": "video_020.mp4",
      "true_class": "Gaming",
      "predicted_class": "Gaming",
      "predicted_confidence": 94.37,
      "is_correct": true,
      "all_scores": { "Animation": 1.65, "Flat_Content": 2.45, "Gaming": 94.37, "Natural_Content": 1.53 },
      "timestamp": "2026-03-17T10:23:45.123456",
      "tta_used": false,
      "ensemble_size": 4
    }
  ]
}

Get Started

Concepts

Training Guide

Inference & Deployment

Model Cards

Evaluation

Running the script

Interactive mode

SingleVideoFeatureLoader

Initialization

list_available_videos()

load_video_features(video_name)

SingleVideoClassifier

Initialization

load_models()

predict_standard(features)

predict_with_tta(features)

classify_video(features, true_label, video_name, use_tta)

External video inference

download_video_if_needed(url_or_path)

extract_video_features(video_path, model_feature_dim, num_frames, device)

Full external video example

Batch test mode

Build docs developers (and LLMs) love

Get Started

Concepts

Training Guide

Inference & Deployment

Model Cards

Evaluation

​Running the script

​Interactive mode

​SingleVideoFeatureLoader

​Initialization

​list_available_videos()

​load_video_features(video_name)

​SingleVideoClassifier

​Initialization

​load_models()

​predict_standard(features)

​predict_with_tta(features)

​classify_video(features, true_label, video_name, use_tta)

​External video inference

​download_video_if_needed(url_or_path)

​extract_video_features(video_path, model_feature_dim, num_frames, device)

​Full external video example

​Batch test mode

Build docs developers (and LLMs) love

Running the script

Interactive mode

SingleVideoFeatureLoader

Initialization

list_available_videos()

load_video_features(video_name)

SingleVideoClassifier

Initialization

load_models()

predict_standard(features)

predict_with_tta(features)

classify_video(features, true_label, video_name, use_tta)

External video inference

download_video_if_needed(url_or_path)

extract_video_features(video_path, model_feature_dim, num_frames, device)

Full external video example

Batch test mode