Skip to main content
Stage 1 of the pipeline converts raw video frames into compact, semantically rich feature vectors using a frozen pretrained CNN. These vectors are saved to disk once and reused throughout training, eliminating the cost of running the backbone on every epoch.

The EnhancedFeatureExtractor Class

EnhancedFeatureExtractor (defined in model_train_new.py, line 195) orchestrates backbone loading, multi-scale feature extraction, and HDF5 serialization.
class EnhancedFeatureExtractor:
    """Extract features with multiple backbones and scales"""
    
    def __init__(self, data_dir, output_dir, backbone='resnet101', 
                 device='cuda', multi_scale=True):
        self.data_dir = Path(data_dir)
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(parents=True, exist_ok=True)
        self.device = torch.device(device if torch.cuda.is_available() else 'cpu')
        self.multi_scale = multi_scale
On instantiation the class:
  1. Loads the chosen pretrained backbone from torchvision.models.
  2. Removes the classification head, replacing it with nn.Identity().
  3. Freezes all parameters (requires_grad=False).
  4. Moves the model to GPU.

Supported Backbones

self.cnn = models.resnet50(weights='IMAGENET1K_V2')
self.feature_dim = self.cnn.fc.in_features  # 2048
self.cnn.fc = nn.Identity()
  • Feature dim: 2048
  • Tradeoff: Fastest extraction, good baseline accuracy (~92–94% end-to-end)
The production ensemble checkpoints (best_ensemble_model_1–4.pt) were trained with feature_dim=1280, corresponding to an EfficientNet-V2 backbone. Confirm the backbone matches feature_dim stored in the .h5 file attributes before loading a new checkpoint.

Feature Dimensions per Backbone

BackboneFeature DimImageNet WeightsRelative Accuracy
resnet502048IMAGENET1K_V2Baseline
resnet1012048IMAGENET1K_V2+2–3%
efficientnet_v2_s1280IMAGENET1K_V1~Baseline
efficientnet_v2_m1280IMAGENET1K_V1+3–4%

Preprocessing

Before the CNN sees any frame, videos are preprocessed offline:
ParameterValue
Image size256 × 256 pixels
Frames per video64 (uniformly sampled)
Processing deviceGPU (CUDA)
NormalizationImageNet mean/std
The configuration_analysis.json records these settings for reproducibility:
{
  "preprocessing_info": {
    "frames_per_video": 64,
    "image_size": [256, 256],
    "processing_mode": "gpu"
  }
}

Multi-Scale Extraction

When multi_scale=True, each video is processed at three temporal scales to capture motion dynamics at different speeds.
if self.multi_scale:
    scales = [1.0, 0.85, 1.15]
    scale_features = []
    
    for scale in scales:
        if scale != 1.0:
            new_length = max(int(num_frames * scale), 5)
            indices = np.linspace(0, num_frames - 1, new_length).astype(int)
            scaled_video = video[indices]
        else:
            scaled_video = video
        
        # Extract features for this scale
        scale_num_frames = scaled_video.shape[0]
        frame_features = []
        
        for i in range(0, scale_num_frames, batch_size):
            batch = scaled_video[i:i+batch_size].to(self.device)
            features = self.cnn(batch)
            frame_features.append(features.cpu())
            del batch
        
        scale_feat = torch.cat(frame_features, dim=0)
        scale_features.append(scale_feat)
    
    # Concatenate multi-scale features
    max_len = max(sf.shape[0] for sf in scale_features)
    padded_scales = []
    for sf in scale_features:
        if sf.shape[0] < max_len:
            padding = torch.zeros(max_len - sf.shape[0], sf.shape[1])
            sf = torch.cat([sf, padding], dim=0)
        padded_scales.append(sf)
    
    # Average across scales
    video_features = torch.stack(padded_scales).mean(dim=0)
The 0.85× scale simulates faster motion (fewer frames), while 1.15× simulates slower motion (more frames via repetition). Averaging across all three scales makes the resulting feature vector robust to playback speed variation.

Full extract_features_from_split Method

The main extraction loop scans a data split directory, processes each video, and accumulates features:
def extract_features_from_split(self, split='train', batch_size=24):
    """Extract features for entire split with multi-scale support"""
    
    split_dir = self.data_dir / split
    suffix = '_multiscale' if self.multi_scale else ''
    output_file = self.output_dir / f'{split}_features{suffix}.h5'
    
    # Scan all video files
    video_files = []
    labels = []
    category_mapping = {}
    
    category_dirs = sorted([d for d in split_dir.glob("*") if d.is_dir()])
    
    for cat_idx, category_dir in enumerate(category_dirs):
        category_name = category_dir.name
        category_mapping[category_name] = cat_idx
        
        for subcat_dir in sorted(category_dir.glob("*")):
            data_file = subcat_dir / 'processed_data.pt'
            if data_file.exists():
                video_files.append(data_file)
                labels.append(cat_idx)
    
    # Extract features with torch.no_grad()
    all_features = []
    all_labels = []
    all_num_frames = []
    
    with torch.no_grad():
        for file_idx, video_file in enumerate(tqdm(video_files)):
            data = torch.load(video_file, map_location='cpu')
            videos = data['videos'] if isinstance(data, dict) else data
            
            for video_idx in range(videos.shape[0]):
                video = videos[video_idx]  # [T, C, H, W]
                # ... multi-scale extraction ...
                all_features.append(video_features.numpy())
                all_labels.append(labels[file_idx])
                all_num_frames.append(video_features.shape[0])

Saving to HDF5

After processing all videos, features are zero-padded to uniform length and written to a compressed HDF5 file:
# Save to HDF5
max_frames = max(all_num_frames)
num_videos = len(all_features)

padded_features = np.zeros(
    (num_videos, max_frames, self.feature_dim), dtype=np.float32
)

for i, features in enumerate(all_features):
    padded_features[i, :features.shape[0], :] = features

h5_file = h5py.File(output_file, 'w')
h5_file.create_dataset(
    'features', data=padded_features,
    compression='gzip', compression_opts=4
)
h5_file.create_dataset('labels', data=np.array(all_labels, dtype=np.int64))
h5_file.create_dataset('num_frames', data=np.array(all_num_frames, dtype=np.int32))

h5_file.attrs['num_videos'] = num_videos
h5_file.attrs['max_frames'] = max_frames
h5_file.attrs['feature_dim'] = self.feature_dim
h5_file.attrs['category_mapping'] = json.dumps(category_mapping)
h5_file.attrs['multi_scale'] = self.multi_scale

h5_file.close()
The resulting .h5 file structure for the test split (from configuration_analysis.json):
AttributeValue
features shape[412, 73, 1280]
labels shape[412]
num_frames statsmin=73, max=73, mean=73
feature_dim1280
multi_scaletrue
Compressiongzip level 4

Extracting All Splits

To extract train, val, and test in sequence:
extractor = EnhancedFeatureExtractor(
    data_dir="/path/to/processed",
    output_dir="/path/to/features_enhanced",
    backbone='efficientnet_v2_m',
    multi_scale=True,
    device='cuda'
)

extractor.extract_all_splits(batch_size=20)
Extraction is a one-time cost of 3–5 hours on an NVIDIA A100 MIG partition. GPU memory usage during extraction is 4–5 GB. Use batch_size=20 or lower if you encounter out-of-memory errors.

Memory Management

The extractor performs explicit GPU cache clearing every 50 videos to prevent memory fragmentation:
if file_idx % 50 == 0:
    torch.cuda.empty_cache()
    gc.collect()
Each frame batch is deleted immediately after processing (del batch) to minimize peak VRAM usage.

Build docs developers (and LLMs) love