Skip to main content
Feature extraction is the first step in the pipeline. It processes videos through CLIP (visual) and Whisper (audio/text) to create 1024-dimensional embeddings that capture both visual content and spoken/audio information.

How It Works

The extract_features.py script performs multimodal feature extraction:
  1. Visual Features (512-d):
    • Sample 5 frames uniformly across each video
    • Encode each frame with CLIP’s vision encoder (ViT-B/32)
    • Average-pool frame embeddings into a single 512-d vector
  2. Audio Features (512-d):
    • Extract audio track using FFmpeg
    • Transcribe speech with Whisper
    • Encode transcript text with CLIP’s text encoder
    • L2-normalize to match visual feature scale
  3. Combined Embedding (1024-d):
    • Concatenate [visual_512 | audio_512]
    • Both modalities normalized before concatenation to prevent dominance

Running Feature Extraction

1

Prepare Your Data

Ensure your videos are organized according to the setup guide:
data/Favorites/videos/
├── 123456789.mp4          # Unlabeled
├── 987654321.mp4          # Unlabeled
├── soccer/
   ├── 111111111.mp4      # Labeled as "soccer"
   └── 222222222.mp4
└── cooking/
    └── 333333333.mp4      # Labeled as "cooking"
2

Run Extraction

python extract_features.py
The script will:
  • Auto-detect CUDA/CPU
  • Load CLIP and Whisper models
  • Discover labeled and unlabeled videos
  • Extract features with progress bars
  • Save artifacts to artifacts/
3

Monitor Progress

Expected terminal output:
Using device: cuda
Loading CLIP (ViT-B/32)...
Loading Whisper (base)...

Dataset summary:
  Labeled videos: 156
  Unlabeled videos: 42
  Categories: ['cooking', 'funny', 'motivational', 'pets', 'quran', 'soccer', 'tiktok', 'travel']

============================================================
Extracting features for labeled videos...
============================================================
Labeled: 100%|███████████████████| 156/156 [08:23<00:00,  3.23s/it]

Saved labeled embeddings: torch.Size([156, 1024])

============================================================
Extracting features for unlabeled videos...
============================================================
Unlabeled: 100%|█████████████████| 42/42 [02:11<00:00,  3.12s/it]

Saved unlabeled embeddings: torch.Size([42, 1024])
Saved 198 transcripts

Done! Artifacts saved to: artifacts

Output Artifacts

Feature extraction creates three files in artifacts/:

labeled_embeddings.pt

PyTorch tensor file containing features for videos in category folders.
import torch
data = torch.load("artifacts/labeled_embeddings.pt")

# Structure:
{
    "features": torch.Tensor,      # Shape: [N, 1024] - feature vectors
    "labels": torch.Tensor,        # Shape: [N] - category indices (0, 1, 2, ...)
    "label_names": List[str],      # ['cooking', 'funny', 'soccer', ...]
    "video_paths": List[str],      # Full paths to video files
}
Example inspection:
print(f"Num videos: {data['features'].shape[0]}")
print(f"Feature dim: {data['features'].shape[1]}")
print(f"Categories: {data['label_names']}")
print(f"Class distribution: {torch.bincount(data['labels'])}")
Output:
Num videos: 156
Feature dim: 1024
Categories: ['cooking', 'funny', 'motivational', 'pets', 'quran', 'soccer', 'tiktok', 'travel']
Class distribution: tensor([12, 8, 15, 18, 24, 45, 20, 14])

unlabeled_embeddings.pt

Features for unsorted videos in the root directory.
{
    "features": torch.Tensor,      # Shape: [M, 1024]
    "video_paths": List[str],      # Paths to unlabeled videos
}
These will be used by predict.py to generate folder predictions.

transcripts.json

Whisper transcriptions for all videos (useful for debugging and analysis).
{
  "data/Favorites/videos/soccer/123456789.mp4": "goal! what an incredible finish!",
  "data/Favorites/videos/cooking/987654321.mp4": "today we're making pasta carbonara",
  "data/Favorites/videos/456789123.mp4": "subscribe for more content"
}
# View all transcripts
cat artifacts/transcripts.json | python -m json.tool | less

# Search for keywords
cat artifacts/transcripts.json | grep -i "recipe"

Processing Time

Extraction time varies by hardware and dataset size:
HardwareVideosTimeVideos/sec
RTX 3060 (GPU)200~10 min0.3
M1 Mac (CPU)200~45 min0.07
CPU (Intel i7)200~60 min0.055
Bottlenecks:
  • Whisper transcription: ~70% of processing time
  • Video decoding: ~20%
  • CLIP encoding: ~10%
Processing is sequential (one video at a time). The script doesn’t support batch processing to avoid VRAM overflow.

Handling Edge Cases

Videos Without Audio

If audio extraction fails (silent video, corrupt audio, etc.), the script uses a zero vector for the audio component:
if audio_result is not None:
    audio_emb, transcript = audio_result
else:
    # Fallback: zero vector for missing audio
    audio_emb = torch.zeros(vis_emb.shape[0])
This ensures all videos get 1024-d features even without audio.

Corrupt or Unreadable Videos

Videos that can’t be opened by OpenCV are skipped:
Skipping broken_video.mp4 (no visual features)
Check terminal output for skipped files and remove/replace them.

Empty Transcripts

If Whisper produces no text (e.g., instrumental music only), the audio embedding is also zeroed:
text = transcription["text"].strip()
if not text:
    return None  # Will trigger zero-vector fallback

Advanced Configuration

Edit constants in extract_features.py:29-33 to customize extraction:
DATA_DIR = Path(__file__).parent / "data" / "Favorites" / "videos"
OUTPUT_DIR = Path(__file__).parent / "artifacts"
N_FRAMES = 5              # Number of frames to sample per video
CLIP_MODEL = "ViT-B/32"   # CLIP model variant
WHISPER_MODEL = "base"    # Whisper model size

Sampling More Frames

Increase N_FRAMES for videos with rapid scene changes:
N_FRAMES = 10  # More granular temporal coverage
Trade-off: Better temporal coverage vs. 2x processing time.

Using Larger Models

CLIP_MODEL = "ViT-B/16"      # Higher resolution, +30% accuracy, 2x slower
WHISPER_MODEL = "small"      # Better transcription, 1.5x slower
When to upgrade:
  • CLIP: Videos with fine visual details (text overlays, small objects)
  • Whisper: Non-English content, heavy accents, background noise

Re-running Extraction

You typically re-run extraction after:
  1. Adding new labeled videos: Move more videos into category folders
  2. Adding new categories: Create new subfolders
  3. Adding unlabeled videos: Drop new videos in root for prediction
Re-running overwrites artifacts/labeled_embeddings.pt and artifacts/unlabeled_embeddings.pt. Previous embeddings are lost unless you back them up.
# Backup before re-extraction
cp -r artifacts artifacts_backup_$(date +%Y%m%d)

# Run extraction
python extract_features.py

Troubleshooting

Long videos (>5 min) may exceed the 30-second timeout:Edit extract_features.py:95 to increase timeout:
result = subprocess.run(..., timeout=120)  # Increase to 120 seconds
Switch to smaller models or CPU:
WHISPER_MODEL = "tiny"  # Smallest model
# Or force CPU:
device = "cpu"  # In get_device() function
Verify the path is correct:
print(DATA_DIR.absolute())  # Add at line 147
Ensure videos are .mp4 files (the script only processes *.mp4).
Indicates corrupt video files. Check with:
ffmpeg -v error -i video.mp4 -f null - 2>&1
Re-download or re-encode problematic videos:
ffmpeg -i input.mp4 -c:v libx264 -c:a aac output.mp4

Inspecting Features

Visualize the learned embeddings to understand what CLIP captures:
import torch
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Load embeddings
data = torch.load("artifacts/labeled_embeddings.pt")
X = data["features"].numpy()
y = data["labels"].numpy()
labels = data["label_names"]

# Dimensionality reduction
tsne = TSNE(n_components=2, random_state=42)
X_2d = tsne.fit_transform(X)

# Plot
plt.figure(figsize=(10, 8))
for i, label in enumerate(labels):
    mask = y == i
    plt.scatter(X_2d[mask, 0], X_2d[mask, 1], label=label, alpha=0.6)
plt.legend()
plt.title("t-SNE Visualization of Video Embeddings")
plt.savefig("embeddings_tsne.png", dpi=150)
Well-separated clusters indicate distinct categories; overlapping clusters suggest ambiguous or similar content.

Next Steps

Training

Train a classifier on the extracted features

Build docs developers (and LLMs) love