Spatial Feature Extraction

Stage 1 of the pipeline converts raw video frames into compact, semantically rich feature vectors using a frozen pretrained CNN. These vectors are saved to disk once and reused throughout training, eliminating the cost of running the backbone on every epoch.

The EnhancedFeatureExtractor Class

EnhancedFeatureExtractor (defined in model_train_new.py, line 195) orchestrates backbone loading, multi-scale feature extraction, and HDF5 serialization.

class EnhancedFeatureExtractor:
    """Extract features with multiple backbones and scales"""
    
    def __init__(self, data_dir, output_dir, backbone='resnet101', 
                 device='cuda', multi_scale=True):
        self.data_dir = Path(data_dir)
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(parents=True, exist_ok=True)
        self.device = torch.device(device if torch.cuda.is_available() else 'cpu')
        self.multi_scale = multi_scale

On instantiation the class:

Loads the chosen pretrained backbone from torchvision.models.
Removes the classification head, replacing it with nn.Identity().
Freezes all parameters (requires_grad=False).
Moves the model to GPU.

Supported Backbones

ResNet-50
ResNet-101
EfficientNet-V2-S
EfficientNet-V2-M

self.cnn = models.resnet50(weights='IMAGENET1K_V2')
self.feature_dim = self.cnn.fc.in_features  # 2048
self.cnn.fc = nn.Identity()

Feature dim: 2048
Tradeoff: Fastest extraction, good baseline accuracy (~92–94% end-to-end)

self.cnn = models.resnet101(weights='IMAGENET1K_V2')
self.feature_dim = self.cnn.fc.in_features  # 2048
self.cnn.fc = nn.Identity()

Feature dim: 2048
Tradeoff: +2–3% accuracy over ResNet-50, moderate extraction time

self.cnn = models.efficientnet_v2_s(weights='IMAGENET1K_V1')
self.feature_dim = self.cnn.classifier[1].in_features  # 1280
self.cnn.classifier = nn.Identity()

Feature dim: 1280
Tradeoff: Efficient compute, similar accuracy to ResNet-50

self.cnn = models.efficientnet_v2_m(weights='IMAGENET1K_V1')
self.feature_dim = self.cnn.classifier[1].in_features  # 1280
self.cnn.classifier = nn.Identity()

Feature dim: 1280
Tradeoff: Best accuracy of all backbones (+3–4% over ResNet-50), slowest extraction

The production ensemble checkpoints (best_ensemble_model_1–4.pt) were trained with feature_dim=1280, corresponding to an EfficientNet-V2 backbone. Confirm the backbone matches feature_dim stored in the .h5 file attributes before loading a new checkpoint.

Feature Dimensions per Backbone

Backbone	Feature Dim	ImageNet Weights	Relative Accuracy
`resnet50`	2048	`IMAGENET1K_V2`	Baseline
`resnet101`	2048	`IMAGENET1K_V2`	+2–3%
`efficientnet_v2_s`	1280	`IMAGENET1K_V1`	~Baseline
`efficientnet_v2_m`	1280	`IMAGENET1K_V1`	+3–4%

Preprocessing

Before the CNN sees any frame, videos are preprocessed offline:

Parameter	Value
Image size	256 × 256 pixels
Frames per video	64 (uniformly sampled)
Processing device	GPU (CUDA)
Normalization	ImageNet mean/std

The configuration_analysis.json records these settings for reproducibility:

{
  "preprocessing_info": {
    "frames_per_video": 64,
    "image_size": [256, 256],
    "processing_mode": "gpu"
  }
}

Multi-Scale Extraction

When multi_scale=True, each video is processed at three temporal scales to capture motion dynamics at different speeds.

if self.multi_scale:
    scales = [1.0, 0.85, 1.15]
    scale_features = []
    
    for scale in scales:
        if scale != 1.0:
            new_length = max(int(num_frames * scale), 5)
            indices = np.linspace(0, num_frames - 1, new_length).astype(int)
            scaled_video = video[indices]
        else:
            scaled_video = video
        
        # Extract features for this scale
        scale_num_frames = scaled_video.shape[0]
        frame_features = []
        
        for i in range(0, scale_num_frames, batch_size):
            batch = scaled_video[i:i+batch_size].to(self.device)
            features = self.cnn(batch)
            frame_features.append(features.cpu())
            del batch
        
        scale_feat = torch.cat(frame_features, dim=0)
        scale_features.append(scale_feat)
    
    # Concatenate multi-scale features
    max_len = max(sf.shape[0] for sf in scale_features)
    padded_scales = []
    for sf in scale_features:
        if sf.shape[0] < max_len:
            padding = torch.zeros(max_len - sf.shape[0], sf.shape[1])
            sf = torch.cat([sf, padding], dim=0)
        padded_scales.append(sf)
    
    # Average across scales
    video_features = torch.stack(padded_scales).mean(dim=0)

The 0.85× scale simulates faster motion (fewer frames), while 1.15× simulates slower motion (more frames via repetition). Averaging across all three scales makes the resulting feature vector robust to playback speed variation.

Full extract_features_from_split Method

The main extraction loop scans a data split directory, processes each video, and accumulates features:

def extract_features_from_split(self, split='train', batch_size=24):
    """Extract features for entire split with multi-scale support"""
    
    split_dir = self.data_dir / split
    suffix = '_multiscale' if self.multi_scale else ''
    output_file = self.output_dir / f'{split}_features{suffix}.h5'
    
    # Scan all video files
    video_files = []
    labels = []
    category_mapping = {}
    
    category_dirs = sorted([d for d in split_dir.glob("*") if d.is_dir()])
    
    for cat_idx, category_dir in enumerate(category_dirs):
        category_name = category_dir.name
        category_mapping[category_name] = cat_idx
        
        for subcat_dir in sorted(category_dir.glob("*")):
            data_file = subcat_dir / 'processed_data.pt'
            if data_file.exists():
                video_files.append(data_file)
                labels.append(cat_idx)
    
    # Extract features with torch.no_grad()
    all_features = []
    all_labels = []
    all_num_frames = []
    
    with torch.no_grad():
        for file_idx, video_file in enumerate(tqdm(video_files)):
            data = torch.load(video_file, map_location='cpu')
            videos = data['videos'] if isinstance(data, dict) else data
            
            for video_idx in range(videos.shape[0]):
                video = videos[video_idx]  # [T, C, H, W]
                # ... multi-scale extraction ...
                all_features.append(video_features.numpy())
                all_labels.append(labels[file_idx])
                all_num_frames.append(video_features.shape[0])

Saving to HDF5

After processing all videos, features are zero-padded to uniform length and written to a compressed HDF5 file:

# Save to HDF5
max_frames = max(all_num_frames)
num_videos = len(all_features)

padded_features = np.zeros(
    (num_videos, max_frames, self.feature_dim), dtype=np.float32
)

for i, features in enumerate(all_features):
    padded_features[i, :features.shape[0], :] = features

h5_file = h5py.File(output_file, 'w')
h5_file.create_dataset(
    'features', data=padded_features,
    compression='gzip', compression_opts=4
)
h5_file.create_dataset('labels', data=np.array(all_labels, dtype=np.int64))
h5_file.create_dataset('num_frames', data=np.array(all_num_frames, dtype=np.int32))

h5_file.attrs['num_videos'] = num_videos
h5_file.attrs['max_frames'] = max_frames
h5_file.attrs['feature_dim'] = self.feature_dim
h5_file.attrs['category_mapping'] = json.dumps(category_mapping)
h5_file.attrs['multi_scale'] = self.multi_scale

h5_file.close()

The resulting .h5 file structure for the test split (from configuration_analysis.json):

Attribute	Value
`features` shape	`[412, 73, 1280]`
`labels` shape	`[412]`
`num_frames` stats	min=73, max=73, mean=73
`feature_dim`	1280
`multi_scale`	`true`
Compression	gzip level 4

Extracting All Splits

To extract train, val, and test in sequence:

extractor = EnhancedFeatureExtractor(
    data_dir="/path/to/processed",
    output_dir="/path/to/features_enhanced",
    backbone='efficientnet_v2_m',
    multi_scale=True,
    device='cuda'
)

extractor.extract_all_splits(batch_size=20)

Extraction is a one-time cost of 3–5 hours on an NVIDIA A100 MIG partition. GPU memory usage during extraction is 4–5 GB. Use batch_size=20 or lower if you encounter out-of-memory errors.

Memory Management

The extractor performs explicit GPU cache clearing every 50 videos to prevent memory fragmentation:

if file_idx % 50 == 0:
    torch.cuda.empty_cache()
    gc.collect()

Each frame batch is deleted immediately after processing (del batch) to minimize peak VRAM usage.

Get Started

Concepts

Training Guide

Inference & Deployment

Model Cards

Evaluation

The EnhancedFeatureExtractor Class

Supported Backbones

Feature Dimensions per Backbone

Preprocessing

Multi-Scale Extraction

Full extract_features_from_split Method

Saving to HDF5

Extracting All Splits

Memory Management

Build docs developers (and LLMs) love

Get Started

Concepts

Training Guide

Inference & Deployment

Model Cards

Evaluation

​The EnhancedFeatureExtractor Class

​Supported Backbones

​Feature Dimensions per Backbone

​Preprocessing

​Multi-Scale Extraction

​Full extract_features_from_split Method

​Saving to HDF5

​Extracting All Splits

​Memory Management

Build docs developers (and LLMs) love

The EnhancedFeatureExtractor Class

Supported Backbones

Feature Dimensions per Backbone

Preprocessing

Multi-Scale Extraction

Full extract_features_from_split Method

Saving to HDF5

Extracting All Splits

Memory Management