Feature Extraction

Feature extraction is the first step in the pipeline. It processes videos through CLIP (visual) and Whisper (audio/text) to create 1024-dimensional embeddings that capture both visual content and spoken/audio information.

How It Works

The extract_features.py script performs multimodal feature extraction:

Visual Features (512-d):
- Sample 5 frames uniformly across each video
- Encode each frame with CLIP’s vision encoder (ViT-B/32)
- Average-pool frame embeddings into a single 512-d vector
Audio Features (512-d):
- Extract audio track using FFmpeg
- Transcribe speech with Whisper
- Encode transcript text with CLIP’s text encoder
- L2-normalize to match visual feature scale
Combined Embedding (1024-d):
- Concatenate [visual_512 | audio_512]
- Both modalities normalized before concatenation to prevent dominance

Running Feature Extraction

Prepare Your Data

Ensure your videos are organized according to the setup guide:

data/Favorites/videos/
├── 123456789.mp4          # Unlabeled
├── 987654321.mp4          # Unlabeled
├── soccer/
│   ├── 111111111.mp4      # Labeled as "soccer"
│   └── 222222222.mp4
└── cooking/
    └── 333333333.mp4      # Labeled as "cooking"

Run Extraction

python extract_features.py

The script will:

Auto-detect CUDA/CPU
Load CLIP and Whisper models
Discover labeled and unlabeled videos
Extract features with progress bars
Save artifacts to artifacts/

Monitor Progress

Expected terminal output:

Using device: cuda
Loading CLIP (ViT-B/32)...
Loading Whisper (base)...

Dataset summary:
  Labeled videos: 156
  Unlabeled videos: 42
  Categories: ['cooking', 'funny', 'motivational', 'pets', 'quran', 'soccer', 'tiktok', 'travel']

============================================================
Extracting features for labeled videos...
============================================================
Labeled: 100%|███████████████████| 156/156 [08:23<00:00,  3.23s/it]

Saved labeled embeddings: torch.Size([156, 1024])

============================================================
Extracting features for unlabeled videos...
============================================================
Unlabeled: 100%|█████████████████| 42/42 [02:11<00:00,  3.12s/it]

Saved unlabeled embeddings: torch.Size([42, 1024])
Saved 198 transcripts

Done! Artifacts saved to: artifacts

Output Artifacts

Feature extraction creates three files in artifacts/:

labeled_embeddings.pt

PyTorch tensor file containing features for videos in category folders.

import torch
data = torch.load("artifacts/labeled_embeddings.pt")

# Structure:
{
    "features": torch.Tensor,      # Shape: [N, 1024] - feature vectors
    "labels": torch.Tensor,        # Shape: [N] - category indices (0, 1, 2, ...)
    "label_names": List[str],      # ['cooking', 'funny', 'soccer', ...]
    "video_paths": List[str],      # Full paths to video files
}

Example inspection:

print(f"Num videos: {data['features'].shape[0]}")
print(f"Feature dim: {data['features'].shape[1]}")
print(f"Categories: {data['label_names']}")
print(f"Class distribution: {torch.bincount(data['labels'])}")

Output:

Num videos: 156
Feature dim: 1024
Categories: ['cooking', 'funny', 'motivational', 'pets', 'quran', 'soccer', 'tiktok', 'travel']
Class distribution: tensor([12, 8, 15, 18, 24, 45, 20, 14])

unlabeled_embeddings.pt

Features for unsorted videos in the root directory.

{
    "features": torch.Tensor,      # Shape: [M, 1024]
    "video_paths": List[str],      # Paths to unlabeled videos
}

These will be used by predict.py to generate folder predictions.

transcripts.json

Whisper transcriptions for all videos (useful for debugging and analysis).

{
  "data/Favorites/videos/soccer/123456789.mp4": "goal! what an incredible finish!",
  "data/Favorites/videos/cooking/987654321.mp4": "today we're making pasta carbonara",
  "data/Favorites/videos/456789123.mp4": "subscribe for more content"
}

Inspect transcripts

# View all transcripts
cat artifacts/transcripts.json | python -m json.tool | less

# Search for keywords
cat artifacts/transcripts.json | grep -i "recipe"

Processing Time

Extraction time varies by hardware and dataset size:

Hardware	Videos	Time	Videos/sec
RTX 3060 (GPU)	200	~10 min	0.3
M1 Mac (CPU)	200	~45 min	0.07
CPU (Intel i7)	200	~60 min	0.055

Bottlenecks:

Whisper transcription: ~70% of processing time
Video decoding: ~20%
CLIP encoding: ~10%

Processing is sequential (one video at a time). The script doesn’t support batch processing to avoid VRAM overflow.

Handling Edge Cases

Videos Without Audio

If audio extraction fails (silent video, corrupt audio, etc.), the script uses a zero vector for the audio component:

if audio_result is not None:
    audio_emb, transcript = audio_result
else:
    # Fallback: zero vector for missing audio
    audio_emb = torch.zeros(vis_emb.shape[0])

This ensures all videos get 1024-d features even without audio.

Corrupt or Unreadable Videos

Videos that can’t be opened by OpenCV are skipped:

Skipping broken_video.mp4 (no visual features)

Check terminal output for skipped files and remove/replace them.

Empty Transcripts

If Whisper produces no text (e.g., instrumental music only), the audio embedding is also zeroed:

text = transcription["text"].strip()
if not text:
    return None  # Will trigger zero-vector fallback

Advanced Configuration

Edit constants in extract_features.py:29-33 to customize extraction:

DATA_DIR = Path(__file__).parent / "data" / "Favorites" / "videos"
OUTPUT_DIR = Path(__file__).parent / "artifacts"
N_FRAMES = 5              # Number of frames to sample per video
CLIP_MODEL = "ViT-B/32"   # CLIP model variant
WHISPER_MODEL = "base"    # Whisper model size

Sampling More Frames

Increase N_FRAMES for videos with rapid scene changes:

N_FRAMES = 10  # More granular temporal coverage

Trade-off: Better temporal coverage vs. 2x processing time.

Using Larger Models

CLIP_MODEL = "ViT-B/16"      # Higher resolution, +30% accuracy, 2x slower
WHISPER_MODEL = "small"      # Better transcription, 1.5x slower

When to upgrade:

CLIP: Videos with fine visual details (text overlays, small objects)
Whisper: Non-English content, heavy accents, background noise

Re-running Extraction

You typically re-run extraction after:

Adding new labeled videos: Move more videos into category folders
Adding new categories: Create new subfolders
Adding unlabeled videos: Drop new videos in root for prediction

Re-running overwrites artifacts/labeled_embeddings.pt and artifacts/unlabeled_embeddings.pt. Previous embeddings are lost unless you back them up.

# Backup before re-extraction
cp -r artifacts artifacts_backup_$(date +%Y%m%d)

# Run extraction
python extract_features.py

Troubleshooting

Audio extraction failed: timeout

Long videos (>5 min) may exceed the 30-second timeout:Edit extract_features.py:95 to increase timeout:

result = subprocess.run(..., timeout=120)  # Increase to 120 seconds

CUDA out of memory

Switch to smaller models or CPU:

WHISPER_MODEL = "tiny"  # Smallest model
# Or force CPU:
device = "cpu"  # In get_device() function

No videos found in DATA_DIR

Verify the path is correct:

print(DATA_DIR.absolute())  # Add at line 147

Ensure videos are .mp4 files (the script only processes *.mp4).

Skipping many videos (no visual features)

Indicates corrupt video files. Check with:

ffmpeg -v error -i video.mp4 -f null - 2>&1

Re-download or re-encode problematic videos:

ffmpeg -i input.mp4 -c:v libx264 -c:a aac output.mp4

Inspecting Features

Visualize the learned embeddings to understand what CLIP captures:

import torch
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Load embeddings
data = torch.load("artifacts/labeled_embeddings.pt")
X = data["features"].numpy()
y = data["labels"].numpy()
labels = data["label_names"]

# Dimensionality reduction
tsne = TSNE(n_components=2, random_state=42)
X_2d = tsne.fit_transform(X)

# Plot
plt.figure(figsize=(10, 8))
for i, label in enumerate(labels):
    mask = y == i
    plt.scatter(X_2d[mask, 0], X_2d[mask, 1], label=label, alpha=0.6)
plt.legend()
plt.title("t-SNE Visualization of Video Embeddings")
plt.savefig("embeddings_tsne.png", dpi=150)

Well-separated clusters indicate distinct categories; overlapping clusters suggest ambiguous or similar content.

Next Steps

Training

Train a classifier on the extracted features

Get Started

Core Concepts

Guides

Advanced

How It Works

Running Feature Extraction

Output Artifacts

labeled_embeddings.pt

unlabeled_embeddings.pt

transcripts.json

Processing Time

Handling Edge Cases

Videos Without Audio

Corrupt or Unreadable Videos

Empty Transcripts

Advanced Configuration

Sampling More Frames

Using Larger Models

Re-running Extraction

Troubleshooting

Inspecting Features

Next Steps

Training

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced

​How It Works

​Running Feature Extraction

​Output Artifacts

​labeled_embeddings.pt

​unlabeled_embeddings.pt

​transcripts.json

​Processing Time

​Handling Edge Cases

​Videos Without Audio

​Corrupt or Unreadable Videos

​Empty Transcripts

​Advanced Configuration

​Sampling More Frames

​Using Larger Models

​Re-running Extraction

​Troubleshooting

​Inspecting Features

​Next Steps

Training

Build docs developers (and LLMs) love

How It Works

Running Feature Extraction

Output Artifacts

labeled_embeddings.pt

unlabeled_embeddings.pt

transcripts.json

Processing Time

Handling Edge Cases

Videos Without Audio

Corrupt or Unreadable Videos

Empty Transcripts

Advanced Configuration

Sampling More Frames

Using Larger Models

Re-running Extraction

Troubleshooting

Inspecting Features

Next Steps