Feature extraction is the first step in the pipeline. It processes videos through CLIP (visual) and Whisper (audio/text) to create 1024-dimensional embeddings that capture both visual content and spoken/audio information.
How It Works
The extract_features.py script performs multimodal feature extraction:
Visual Features (512-d):
Sample 5 frames uniformly across each video
Encode each frame with CLIP’s vision encoder (ViT-B/32)
Average-pool frame embeddings into a single 512-d vector
Audio Features (512-d):
Extract audio track using FFmpeg
Transcribe speech with Whisper
Encode transcript text with CLIP’s text encoder
L2-normalize to match visual feature scale
Combined Embedding (1024-d):
Concatenate [visual_512 | audio_512]
Both modalities normalized before concatenation to prevent dominance
Prepare Your Data
Ensure your videos are organized according to the setup guide : data/Favorites/videos/
├── 123456789.mp4 # Unlabeled
├── 987654321.mp4 # Unlabeled
├── soccer/
│ ├── 111111111.mp4 # Labeled as "soccer"
│ └── 222222222.mp4
└── cooking/
└── 333333333.mp4 # Labeled as "cooking"
Run Extraction
python extract_features.py
The script will:
Auto-detect CUDA/CPU
Load CLIP and Whisper models
Discover labeled and unlabeled videos
Extract features with progress bars
Save artifacts to artifacts/
Monitor Progress
Expected terminal output: Using device: cuda
Loading CLIP (ViT-B/32)...
Loading Whisper (base)...
Dataset summary:
Labeled videos: 156
Unlabeled videos: 42
Categories: ['cooking', 'funny', 'motivational', 'pets', 'quran', 'soccer', 'tiktok', 'travel']
============================================================
Extracting features for labeled videos...
============================================================
Labeled: 100%|███████████████████| 156/156 [08:23<00:00, 3.23s/it]
Saved labeled embeddings: torch.Size([156, 1024])
============================================================
Extracting features for unlabeled videos...
============================================================
Unlabeled: 100%|█████████████████| 42/42 [02:11<00:00, 3.12s/it]
Saved unlabeled embeddings: torch.Size([42, 1024])
Saved 198 transcripts
Done! Artifacts saved to: artifacts
Output Artifacts
Feature extraction creates three files in artifacts/:
labeled_embeddings.pt
PyTorch tensor file containing features for videos in category folders.
import torch
data = torch.load( "artifacts/labeled_embeddings.pt" )
# Structure:
{
"features" : torch.Tensor, # Shape: [N, 1024] - feature vectors
"labels" : torch.Tensor, # Shape: [N] - category indices (0, 1, 2, ...)
"label_names" : List[ str ], # ['cooking', 'funny', 'soccer', ...]
"video_paths" : List[ str ], # Full paths to video files
}
Example inspection:
print ( f "Num videos: { data[ 'features' ].shape[ 0 ] } " )
print ( f "Feature dim: { data[ 'features' ].shape[ 1 ] } " )
print ( f "Categories: { data[ 'label_names' ] } " )
print ( f "Class distribution: { torch.bincount(data[ 'labels' ]) } " )
Output:
Num videos: 156
Feature dim: 1024
Categories: ['cooking', 'funny', 'motivational', 'pets', 'quran', 'soccer', 'tiktok', 'travel']
Class distribution: tensor([12, 8, 15, 18, 24, 45, 20, 14])
unlabeled_embeddings.pt
Features for unsorted videos in the root directory.
{
"features" : torch.Tensor, # Shape: [M, 1024]
"video_paths" : List[ str ], # Paths to unlabeled videos
}
These will be used by predict.py to generate folder predictions.
transcripts.json
Whisper transcriptions for all videos (useful for debugging and analysis).
{
"data/Favorites/videos/soccer/123456789.mp4" : "goal! what an incredible finish!" ,
"data/Favorites/videos/cooking/987654321.mp4" : "today we're making pasta carbonara" ,
"data/Favorites/videos/456789123.mp4" : "subscribe for more content"
}
# View all transcripts
cat artifacts/transcripts.json | python -m json.tool | less
# Search for keywords
cat artifacts/transcripts.json | grep -i "recipe"
Processing Time
Extraction time varies by hardware and dataset size:
Hardware Videos Time Videos/sec RTX 3060 (GPU) 200 ~10 min 0.3 M1 Mac (CPU) 200 ~45 min 0.07 CPU (Intel i7) 200 ~60 min 0.055
Bottlenecks:
Whisper transcription : ~70% of processing time
Video decoding : ~20%
CLIP encoding : ~10%
Processing is sequential (one video at a time). The script doesn’t support batch processing to avoid VRAM overflow.
Handling Edge Cases
Videos Without Audio
If audio extraction fails (silent video, corrupt audio, etc.), the script uses a zero vector for the audio component:
if audio_result is not None :
audio_emb, transcript = audio_result
else :
# Fallback: zero vector for missing audio
audio_emb = torch.zeros(vis_emb.shape[ 0 ])
This ensures all videos get 1024-d features even without audio.
Corrupt or Unreadable Videos
Videos that can’t be opened by OpenCV are skipped:
Skipping broken_video.mp4 (no visual features)
Check terminal output for skipped files and remove/replace them.
Empty Transcripts
If Whisper produces no text (e.g., instrumental music only), the audio embedding is also zeroed:
text = transcription[ "text" ].strip()
if not text:
return None # Will trigger zero-vector fallback
Advanced Configuration
Edit constants in extract_features.py:29-33 to customize extraction:
DATA_DIR = Path( __file__ ).parent / "data" / "Favorites" / "videos"
OUTPUT_DIR = Path( __file__ ).parent / "artifacts"
N_FRAMES = 5 # Number of frames to sample per video
CLIP_MODEL = "ViT-B/32" # CLIP model variant
WHISPER_MODEL = "base" # Whisper model size
Sampling More Frames
Increase N_FRAMES for videos with rapid scene changes:
N_FRAMES = 10 # More granular temporal coverage
Trade-off : Better temporal coverage vs. 2x processing time.
Using Larger Models
CLIP_MODEL = "ViT-B/16" # Higher resolution, +30% accuracy, 2x slower
WHISPER_MODEL = "small" # Better transcription, 1.5x slower
When to upgrade :
CLIP: Videos with fine visual details (text overlays, small objects)
Whisper: Non-English content, heavy accents, background noise
You typically re-run extraction after:
Adding new labeled videos : Move more videos into category folders
Adding new categories : Create new subfolders
Adding unlabeled videos : Drop new videos in root for prediction
Re-running overwrites artifacts/labeled_embeddings.pt and artifacts/unlabeled_embeddings.pt. Previous embeddings are lost unless you back them up.
# Backup before re-extraction
cp -r artifacts artifacts_backup_ $( date +%Y%m%d )
# Run extraction
python extract_features.py
Troubleshooting
Audio extraction failed: timeout
Switch to smaller models or CPU: WHISPER_MODEL = "tiny" # Smallest model
# Or force CPU:
device = "cpu" # In get_device() function
No videos found in DATA_DIR
Verify the path is correct: print ( DATA_DIR .absolute()) # Add at line 147
Ensure videos are .mp4 files (the script only processes *.mp4).
Skipping many videos (no visual features)
Indicates corrupt video files. Check with: ffmpeg -v error -i video.mp4 -f null - 2>&1
Re-download or re-encode problematic videos: ffmpeg -i input.mp4 -c:v libx264 -c:a aac output.mp4
Inspecting Features
Visualize the learned embeddings to understand what CLIP captures:
import torch
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
# Load embeddings
data = torch.load( "artifacts/labeled_embeddings.pt" )
X = data[ "features" ].numpy()
y = data[ "labels" ].numpy()
labels = data[ "label_names" ]
# Dimensionality reduction
tsne = TSNE( n_components = 2 , random_state = 42 )
X_2d = tsne.fit_transform(X)
# Plot
plt.figure( figsize = ( 10 , 8 ))
for i, label in enumerate (labels):
mask = y == i
plt.scatter(X_2d[mask, 0 ], X_2d[mask, 1 ], label = label, alpha = 0.6 )
plt.legend()
plt.title( "t-SNE Visualization of Video Embeddings" )
plt.savefig( "embeddings_tsne.png" , dpi = 150 )
Well-separated clusters indicate distinct categories; overlapping clusters suggest ambiguous or similar content.
Next Steps
Training Train a classifier on the extracted features