Preprocessing is split into two sequential stages: converting raw video files into .pt tensors, then running a CNN backbone to extract frame-level features stored in HDF5 files for fast training.
Stage 1 — Video to Tensor
Configuration
| Parameter | Value |
|---|
frames_per_video | 64 |
image_size | 256 × 256 |
processing_mode | gpu |
clip_duration | 10 seconds |
Running the Preprocessor
python video_classification_project/src/data/preprocess_videos.py
This launches an interactive CLI that prompts for processing mode (CPU or GPU) and a per-subcategory frame sampling strategy.
Frame Sampling Strategies
GPU Batch Processing
On the NVIDIA A100 MIG partition the preprocessor batches frame tensors onto the GPU for normalization, using ImageNet statistics:
normalize_mean = torch.tensor([0.485, 0.456, 0.406], device='cuda:0').view(3, 1, 1)
normalize_std = torch.tensor([0.229, 0.224, 0.225], device='cuda:0').view(3, 1, 1)
def normalize_on_gpu(self, tensor_batch):
return (tensor_batch - self.normalize_mean) / self.normalize_std
Batch size is computed dynamically from available GPU memory (using 40% of free VRAM) with a hard cap of 4 for MIG safety.
Each subcategory produces a processed_data.pt file:
data_dict = {
'videos': processed_data, # Tensor [N, T, C, H, W]
'labels': torch.tensor(labels), # Tensor [N]
'filenames': filenames, # List[str]
'category_mapping': {category: category_idx}
}
torch.save(data_dict, split_output_dir / 'processed_data.pt')
A companion metadata.json records frames per video, image size, sampling strategy, processing mode, and augmentation flag.
EnhancedFeatureExtractor runs a frozen pretrained CNN backbone over every .pt file and writes per-split HDF5 feature files.
Supported Backbones
| Backbone | Feature Dim | Weights |
|---|
| ResNet-50 | 2048 | IMAGENET1K_V2 |
| ResNet-101 | 2048 | IMAGENET1K_V2 |
| EfficientNet-V2-S | 1280 | IMAGENET1K_V1 |
| EfficientNet-V2-M | 1280 | IMAGENET1K_V1 |
The recommended backbone is EfficientNet-V2-S (feature dim 1280), which is what the production checkpoints were trained on.
from model_train_new import EnhancedFeatureExtractor
extractor = EnhancedFeatureExtractor(
data_dir="data/processed",
output_dir="data/features",
backbone="efficientnet_v2_s",
device="cuda",
multi_scale=True
)
# Extract features for all three splits
extractor.extract_all_splits(batch_size=24)
extract_all_splits() calls extract_features_from_split() sequentially for train, val, and test.
With multi_scale=True (the default), the extractor runs the backbone at three temporal scales per video and averages the resulting frame features:
scales = [1.0, 0.85, 1.15]
scale_features = []
for scale in scales:
if scale != 1.0:
new_length = max(int(num_frames * scale), 5)
indices = np.linspace(0, num_frames - 1, new_length).astype(int)
scaled_video = video[indices]
else:
scaled_video = video
# ... run CNN on scaled_video in mini-batches of 24 frames ...
scale_features.append(scale_feat)
# Pad to the same length, then average across scales
max_len = max(sf.shape[0] for sf in scale_features)
padded_scales = [] # zero-pad shorter sequences
for sf in scale_features:
if sf.shape[0] < max_len:
padding = torch.zeros(max_len - sf.shape[0], sf.shape[1])
sf = torch.cat([sf, padding], dim=0)
padded_scales.append(sf)
video_features = torch.stack(padded_scales).mean(dim=0) # [T, feature_dim]
Scales 0.85 and 1.15 simulate slower and faster playback, giving the model exposure to temporal variations at extraction time.
Features are zero-padded to a common frame length and saved with gzip compression:
padded_features = np.zeros(
(num_videos, max_frames, self.feature_dim),
dtype=np.float32
) # shape: [N_videos, max_frames, feature_dim]
h5_file = h5py.File(output_file, 'w')
h5_file.create_dataset('features', data=padded_features,
compression='gzip', compression_opts=4)
h5_file.create_dataset('labels', data=np.array(all_labels, dtype=np.int64))
h5_file.create_dataset('num_frames', data=np.array(all_num_frames, dtype=np.int32))
h5_file.attrs['num_videos'] = num_videos
h5_file.attrs['max_frames'] = max_frames
h5_file.attrs['feature_dim'] = self.feature_dim
h5_file.attrs['category_mapping'] = json.dumps(category_mapping)
h5_file.attrs['multi_scale'] = self.multi_scale
h5_file.close()
For the test split the resulting file (test_features_multiscale.h5) contains 412 videos at shape [412, 73, 1280].
GPU cache is cleared every 50 files during extraction (torch.cuda.empty_cache()) to prevent memory fragmentation on the MIG partition.