Stage 1 of the pipeline converts raw video frames into compact, semantically rich feature vectors using a frozen pretrained CNN. These vectors are saved to disk once and reused throughout training, eliminating the cost of running the backbone on every epoch.
EnhancedFeatureExtractor (defined in model_train_new.py, line 195) orchestrates backbone loading, multi-scale feature extraction, and HDF5 serialization.
class EnhancedFeatureExtractor:
"""Extract features with multiple backbones and scales"""
def __init__(self, data_dir, output_dir, backbone='resnet101',
device='cuda', multi_scale=True):
self.data_dir = Path(data_dir)
self.output_dir = Path(output_dir)
self.output_dir.mkdir(parents=True, exist_ok=True)
self.device = torch.device(device if torch.cuda.is_available() else 'cpu')
self.multi_scale = multi_scale
On instantiation the class:
- Loads the chosen pretrained backbone from
torchvision.models.
- Removes the classification head, replacing it with
nn.Identity().
- Freezes all parameters (
requires_grad=False).
- Moves the model to GPU.
Supported Backbones
ResNet-50
ResNet-101
EfficientNet-V2-S
EfficientNet-V2-M
self.cnn = models.resnet50(weights='IMAGENET1K_V2')
self.feature_dim = self.cnn.fc.in_features # 2048
self.cnn.fc = nn.Identity()
- Feature dim: 2048
- Tradeoff: Fastest extraction, good baseline accuracy (~92–94% end-to-end)
self.cnn = models.resnet101(weights='IMAGENET1K_V2')
self.feature_dim = self.cnn.fc.in_features # 2048
self.cnn.fc = nn.Identity()
- Feature dim: 2048
- Tradeoff: +2–3% accuracy over ResNet-50, moderate extraction time
self.cnn = models.efficientnet_v2_s(weights='IMAGENET1K_V1')
self.feature_dim = self.cnn.classifier[1].in_features # 1280
self.cnn.classifier = nn.Identity()
- Feature dim: 1280
- Tradeoff: Efficient compute, similar accuracy to ResNet-50
self.cnn = models.efficientnet_v2_m(weights='IMAGENET1K_V1')
self.feature_dim = self.cnn.classifier[1].in_features # 1280
self.cnn.classifier = nn.Identity()
- Feature dim: 1280
- Tradeoff: Best accuracy of all backbones (+3–4% over ResNet-50), slowest extraction
The production ensemble checkpoints (best_ensemble_model_1–4.pt) were trained with feature_dim=1280, corresponding to an EfficientNet-V2 backbone. Confirm the backbone matches feature_dim stored in the .h5 file attributes before loading a new checkpoint.
Feature Dimensions per Backbone
| Backbone | Feature Dim | ImageNet Weights | Relative Accuracy |
|---|
resnet50 | 2048 | IMAGENET1K_V2 | Baseline |
resnet101 | 2048 | IMAGENET1K_V2 | +2–3% |
efficientnet_v2_s | 1280 | IMAGENET1K_V1 | ~Baseline |
efficientnet_v2_m | 1280 | IMAGENET1K_V1 | +3–4% |
Preprocessing
Before the CNN sees any frame, videos are preprocessed offline:
| Parameter | Value |
|---|
| Image size | 256 × 256 pixels |
| Frames per video | 64 (uniformly sampled) |
| Processing device | GPU (CUDA) |
| Normalization | ImageNet mean/std |
The configuration_analysis.json records these settings for reproducibility:
{
"preprocessing_info": {
"frames_per_video": 64,
"image_size": [256, 256],
"processing_mode": "gpu"
}
}
When multi_scale=True, each video is processed at three temporal scales to capture motion dynamics at different speeds.
if self.multi_scale:
scales = [1.0, 0.85, 1.15]
scale_features = []
for scale in scales:
if scale != 1.0:
new_length = max(int(num_frames * scale), 5)
indices = np.linspace(0, num_frames - 1, new_length).astype(int)
scaled_video = video[indices]
else:
scaled_video = video
# Extract features for this scale
scale_num_frames = scaled_video.shape[0]
frame_features = []
for i in range(0, scale_num_frames, batch_size):
batch = scaled_video[i:i+batch_size].to(self.device)
features = self.cnn(batch)
frame_features.append(features.cpu())
del batch
scale_feat = torch.cat(frame_features, dim=0)
scale_features.append(scale_feat)
# Concatenate multi-scale features
max_len = max(sf.shape[0] for sf in scale_features)
padded_scales = []
for sf in scale_features:
if sf.shape[0] < max_len:
padding = torch.zeros(max_len - sf.shape[0], sf.shape[1])
sf = torch.cat([sf, padding], dim=0)
padded_scales.append(sf)
# Average across scales
video_features = torch.stack(padded_scales).mean(dim=0)
The 0.85× scale simulates faster motion (fewer frames), while 1.15× simulates slower motion (more frames via repetition). Averaging across all three scales makes the resulting feature vector robust to playback speed variation.
The main extraction loop scans a data split directory, processes each video, and accumulates features:
def extract_features_from_split(self, split='train', batch_size=24):
"""Extract features for entire split with multi-scale support"""
split_dir = self.data_dir / split
suffix = '_multiscale' if self.multi_scale else ''
output_file = self.output_dir / f'{split}_features{suffix}.h5'
# Scan all video files
video_files = []
labels = []
category_mapping = {}
category_dirs = sorted([d for d in split_dir.glob("*") if d.is_dir()])
for cat_idx, category_dir in enumerate(category_dirs):
category_name = category_dir.name
category_mapping[category_name] = cat_idx
for subcat_dir in sorted(category_dir.glob("*")):
data_file = subcat_dir / 'processed_data.pt'
if data_file.exists():
video_files.append(data_file)
labels.append(cat_idx)
# Extract features with torch.no_grad()
all_features = []
all_labels = []
all_num_frames = []
with torch.no_grad():
for file_idx, video_file in enumerate(tqdm(video_files)):
data = torch.load(video_file, map_location='cpu')
videos = data['videos'] if isinstance(data, dict) else data
for video_idx in range(videos.shape[0]):
video = videos[video_idx] # [T, C, H, W]
# ... multi-scale extraction ...
all_features.append(video_features.numpy())
all_labels.append(labels[file_idx])
all_num_frames.append(video_features.shape[0])
Saving to HDF5
After processing all videos, features are zero-padded to uniform length and written to a compressed HDF5 file:
# Save to HDF5
max_frames = max(all_num_frames)
num_videos = len(all_features)
padded_features = np.zeros(
(num_videos, max_frames, self.feature_dim), dtype=np.float32
)
for i, features in enumerate(all_features):
padded_features[i, :features.shape[0], :] = features
h5_file = h5py.File(output_file, 'w')
h5_file.create_dataset(
'features', data=padded_features,
compression='gzip', compression_opts=4
)
h5_file.create_dataset('labels', data=np.array(all_labels, dtype=np.int64))
h5_file.create_dataset('num_frames', data=np.array(all_num_frames, dtype=np.int32))
h5_file.attrs['num_videos'] = num_videos
h5_file.attrs['max_frames'] = max_frames
h5_file.attrs['feature_dim'] = self.feature_dim
h5_file.attrs['category_mapping'] = json.dumps(category_mapping)
h5_file.attrs['multi_scale'] = self.multi_scale
h5_file.close()
The resulting .h5 file structure for the test split (from configuration_analysis.json):
| Attribute | Value |
|---|
features shape | [412, 73, 1280] |
labels shape | [412] |
num_frames stats | min=73, max=73, mean=73 |
feature_dim | 1280 |
multi_scale | true |
| Compression | gzip level 4 |
To extract train, val, and test in sequence:
extractor = EnhancedFeatureExtractor(
data_dir="/path/to/processed",
output_dir="/path/to/features_enhanced",
backbone='efficientnet_v2_m',
multi_scale=True,
device='cuda'
)
extractor.extract_all_splits(batch_size=20)
Extraction is a one-time cost of 3–5 hours on an NVIDIA A100 MIG partition. GPU memory usage during extraction is 4–5 GB. Use batch_size=20 or lower if you encounter out-of-memory errors.
Memory Management
The extractor performs explicit GPU cache clearing every 50 videos to prevent memory fragmentation:
if file_idx % 50 == 0:
torch.cuda.empty_cache()
gc.collect()
Each frame batch is deleted immediately after processing (del batch) to minimize peak VRAM usage.