Preprocessing

Preprocessing is split into two sequential stages: converting raw video files into .pt tensors, then running a CNN backbone to extract frame-level features stored in HDF5 files for fast training.

Stage 1 — Video to Tensor

Configuration

Parameter	Value
`frames_per_video`	64
`image_size`	256 × 256
`processing_mode`	gpu
`clip_duration`	10 seconds

Running the Preprocessor

python video_classification_project/src/data/preprocess_videos.py

This launches an interactive CLI that prompts for processing mode (CPU or GPU) and a per-subcategory frame sampling strategy.

Frame Sampling Strategies

Uniform
Middle Clip
Adaptive Multi-Clip

Samples frames evenly across the entire video duration. Best for content with consistent visual characteristics throughout.

start_frame = 0
end_frame = total_frames
step = sampling_frames / max_frames
frame_indices = [start_frame + int(i * step) for i in range(max_frames)]

Extracts a 10-second window centred on the video midpoint. Useful for Natural Content where the subject is often in the middle of the clip.

clip_frames = min(int(self.clip_duration * fps), total_frames)
start_frame = max(0, (total_frames - clip_frames) // 2)
end_frame = start_frame + clip_frames

Recommended for Gaming and Animation. Samples three temporal segments (20%, 50%, 80% positions) to capture distinct action phases.

clip_positions = [0.2, 0.5, 0.8]
frames_per_clip = max_frames // num_clips  # 64 // 3

for position in clip_positions:
    clip_center = int(total_frames * position)
    start_frame = max(0, clip_center - clip_duration_frames // 2)
    # ... sample frames_per_clip frames from window

GPU Batch Processing

On the NVIDIA A100 MIG partition the preprocessor batches frame tensors onto the GPU for normalization, using ImageNet statistics:

normalize_mean = torch.tensor([0.485, 0.456, 0.406], device='cuda:0').view(3, 1, 1)
normalize_std  = torch.tensor([0.229, 0.224, 0.225], device='cuda:0').view(3, 1, 1)

def normalize_on_gpu(self, tensor_batch):
    return (tensor_batch - self.normalize_mean) / self.normalize_std

Batch size is computed dynamically from available GPU memory (using 40% of free VRAM) with a hard cap of 4 for MIG safety.

Output Format

Each subcategory produces a processed_data.pt file:

data_dict = {
    'videos': processed_data,        # Tensor [N, T, C, H, W]
    'labels': torch.tensor(labels),  # Tensor [N]
    'filenames': filenames,           # List[str]
    'category_mapping': {category: category_idx}
}
torch.save(data_dict, split_output_dir / 'processed_data.pt')

A companion metadata.json records frames per video, image size, sampling strategy, processing mode, and augmentation flag.

Stage 2 — CNN Feature Extraction

EnhancedFeatureExtractor runs a frozen pretrained CNN backbone over every .pt file and writes per-split HDF5 feature files.

Supported Backbones

Backbone	Feature Dim	Weights
ResNet-50	2048	IMAGENET1K_V2
ResNet-101	2048	IMAGENET1K_V2
EfficientNet-V2-S	1280	IMAGENET1K_V1
EfficientNet-V2-M	1280	IMAGENET1K_V1

The recommended backbone is EfficientNet-V2-S (feature dim 1280), which is what the production checkpoints were trained on.

Instantiating the Extractor

from model_train_new import EnhancedFeatureExtractor

extractor = EnhancedFeatureExtractor(
    data_dir="data/processed",
    output_dir="data/features",
    backbone="efficientnet_v2_s",
    device="cuda",
    multi_scale=True
)

# Extract features for all three splits
extractor.extract_all_splits(batch_size=24)

extract_all_splits() calls extract_features_from_split() sequentially for train, val, and test.

Multi-Scale Feature Extraction

With multi_scale=True (the default), the extractor runs the backbone at three temporal scales per video and averages the resulting frame features:

scales = [1.0, 0.85, 1.15]
scale_features = []

for scale in scales:
    if scale != 1.0:
        new_length = max(int(num_frames * scale), 5)
        indices = np.linspace(0, num_frames - 1, new_length).astype(int)
        scaled_video = video[indices]
    else:
        scaled_video = video

    # ... run CNN on scaled_video in mini-batches of 24 frames ...
    scale_features.append(scale_feat)

# Pad to the same length, then average across scales
max_len = max(sf.shape[0] for sf in scale_features)
padded_scales = []  # zero-pad shorter sequences
for sf in scale_features:
    if sf.shape[0] < max_len:
        padding = torch.zeros(max_len - sf.shape[0], sf.shape[1])
        sf = torch.cat([sf, padding], dim=0)
    padded_scales.append(sf)

video_features = torch.stack(padded_scales).mean(dim=0)  # [T, feature_dim]

Scales 0.85 and 1.15 simulate slower and faster playback, giving the model exposure to temporal variations at extraction time.

HDF5 Output Format

Features are zero-padded to a common frame length and saved with gzip compression:

padded_features = np.zeros(
    (num_videos, max_frames, self.feature_dim),
    dtype=np.float32
)  # shape: [N_videos, max_frames, feature_dim]

h5_file = h5py.File(output_file, 'w')
h5_file.create_dataset('features',   data=padded_features,
                        compression='gzip', compression_opts=4)
h5_file.create_dataset('labels',     data=np.array(all_labels, dtype=np.int64))
h5_file.create_dataset('num_frames', data=np.array(all_num_frames, dtype=np.int32))

h5_file.attrs['num_videos']       = num_videos
h5_file.attrs['max_frames']       = max_frames
h5_file.attrs['feature_dim']      = self.feature_dim
h5_file.attrs['category_mapping'] = json.dumps(category_mapping)
h5_file.attrs['multi_scale']      = self.multi_scale
h5_file.close()

For the test split the resulting file (test_features_multiscale.h5) contains 412 videos at shape [412, 73, 1280].

GPU cache is cleared every 50 files during extraction (torch.cuda.empty_cache()) to prevent memory fragmentation on the MIG partition.

Get Started

Concepts

Training Guide

Inference & Deployment

Model Cards

Evaluation

Stage 1 — Video to Tensor

Configuration

Running the Preprocessor

Frame Sampling Strategies

GPU Batch Processing

Output Format

Stage 2 — CNN Feature Extraction

Supported Backbones

Instantiating the Extractor

Multi-Scale Feature Extraction

HDF5 Output Format

Build docs developers (and LLMs) love

Get Started

Concepts

Training Guide

Inference & Deployment

Model Cards

Evaluation

​Stage 1 — Video to Tensor

​Configuration

​Running the Preprocessor

​Frame Sampling Strategies

​GPU Batch Processing

​Output Format

​Stage 2 — CNN Feature Extraction

​Supported Backbones

​Instantiating the Extractor

​Multi-Scale Feature Extraction

​HDF5 Output Format

Build docs developers (and LLMs) love

Stage 1 — Video to Tensor

Configuration

Running the Preprocessor

Frame Sampling Strategies

GPU Batch Processing

Output Format

Stage 2 — CNN Feature Extraction

Supported Backbones

Instantiating the Extractor

Multi-Scale Feature Extraction

HDF5 Output Format