Skip to main content
Preprocessing is split into two sequential stages: converting raw video files into .pt tensors, then running a CNN backbone to extract frame-level features stored in HDF5 files for fast training.

Stage 1 — Video to Tensor

Configuration

ParameterValue
frames_per_video64
image_size256 × 256
processing_modegpu
clip_duration10 seconds

Running the Preprocessor

python video_classification_project/src/data/preprocess_videos.py
This launches an interactive CLI that prompts for processing mode (CPU or GPU) and a per-subcategory frame sampling strategy.

Frame Sampling Strategies

Samples frames evenly across the entire video duration. Best for content with consistent visual characteristics throughout.
start_frame = 0
end_frame = total_frames
step = sampling_frames / max_frames
frame_indices = [start_frame + int(i * step) for i in range(max_frames)]

GPU Batch Processing

On the NVIDIA A100 MIG partition the preprocessor batches frame tensors onto the GPU for normalization, using ImageNet statistics:
normalize_mean = torch.tensor([0.485, 0.456, 0.406], device='cuda:0').view(3, 1, 1)
normalize_std  = torch.tensor([0.229, 0.224, 0.225], device='cuda:0').view(3, 1, 1)

def normalize_on_gpu(self, tensor_batch):
    return (tensor_batch - self.normalize_mean) / self.normalize_std
Batch size is computed dynamically from available GPU memory (using 40% of free VRAM) with a hard cap of 4 for MIG safety.

Output Format

Each subcategory produces a processed_data.pt file:
data_dict = {
    'videos': processed_data,        # Tensor [N, T, C, H, W]
    'labels': torch.tensor(labels),  # Tensor [N]
    'filenames': filenames,           # List[str]
    'category_mapping': {category: category_idx}
}
torch.save(data_dict, split_output_dir / 'processed_data.pt')
A companion metadata.json records frames per video, image size, sampling strategy, processing mode, and augmentation flag.

Stage 2 — CNN Feature Extraction

EnhancedFeatureExtractor runs a frozen pretrained CNN backbone over every .pt file and writes per-split HDF5 feature files.

Supported Backbones

BackboneFeature DimWeights
ResNet-502048IMAGENET1K_V2
ResNet-1012048IMAGENET1K_V2
EfficientNet-V2-S1280IMAGENET1K_V1
EfficientNet-V2-M1280IMAGENET1K_V1
The recommended backbone is EfficientNet-V2-S (feature dim 1280), which is what the production checkpoints were trained on.

Instantiating the Extractor

from model_train_new import EnhancedFeatureExtractor

extractor = EnhancedFeatureExtractor(
    data_dir="data/processed",
    output_dir="data/features",
    backbone="efficientnet_v2_s",
    device="cuda",
    multi_scale=True
)

# Extract features for all three splits
extractor.extract_all_splits(batch_size=24)
extract_all_splits() calls extract_features_from_split() sequentially for train, val, and test.

Multi-Scale Feature Extraction

With multi_scale=True (the default), the extractor runs the backbone at three temporal scales per video and averages the resulting frame features:
scales = [1.0, 0.85, 1.15]
scale_features = []

for scale in scales:
    if scale != 1.0:
        new_length = max(int(num_frames * scale), 5)
        indices = np.linspace(0, num_frames - 1, new_length).astype(int)
        scaled_video = video[indices]
    else:
        scaled_video = video

    # ... run CNN on scaled_video in mini-batches of 24 frames ...
    scale_features.append(scale_feat)

# Pad to the same length, then average across scales
max_len = max(sf.shape[0] for sf in scale_features)
padded_scales = []  # zero-pad shorter sequences
for sf in scale_features:
    if sf.shape[0] < max_len:
        padding = torch.zeros(max_len - sf.shape[0], sf.shape[1])
        sf = torch.cat([sf, padding], dim=0)
    padded_scales.append(sf)

video_features = torch.stack(padded_scales).mean(dim=0)  # [T, feature_dim]
Scales 0.85 and 1.15 simulate slower and faster playback, giving the model exposure to temporal variations at extraction time.

HDF5 Output Format

Features are zero-padded to a common frame length and saved with gzip compression:
padded_features = np.zeros(
    (num_videos, max_frames, self.feature_dim),
    dtype=np.float32
)  # shape: [N_videos, max_frames, feature_dim]

h5_file = h5py.File(output_file, 'w')
h5_file.create_dataset('features',   data=padded_features,
                        compression='gzip', compression_opts=4)
h5_file.create_dataset('labels',     data=np.array(all_labels, dtype=np.int64))
h5_file.create_dataset('num_frames', data=np.array(all_num_frames, dtype=np.int32))

h5_file.attrs['num_videos']       = num_videos
h5_file.attrs['max_frames']       = max_frames
h5_file.attrs['feature_dim']      = self.feature_dim
h5_file.attrs['category_mapping'] = json.dumps(category_mapping)
h5_file.attrs['multi_scale']      = self.multi_scale
h5_file.close()
For the test split the resulting file (test_features_multiscale.h5) contains 412 videos at shape [412, 73, 1280].
GPU cache is cleared every 50 files during extraction (torch.cuda.empty_cache()) to prevent memory fragmentation on the MIG partition.

Build docs developers (and LLMs) love