Multimodal Feature Extraction

Overview

The feature extraction pipeline transforms raw TikTok videos into fixed-dimensional embeddings by combining two complementary modalities:

Visual Features: Extracted using OpenAI’s CLIP vision encoder
Audio Features: Extracted by transcribing audio with Whisper, then encoding the transcript with CLIP’s text encoder

This multimodal approach captures both “what the video shows” and “what is being said,” providing rich semantic representations for classification.

Visual Feature Extraction

Frame Sampling Strategy

N_FRAMES = 5  # Sample 5 frames uniformly across video duration

Instead of processing every frame (computationally expensive) or just the first frame (may miss key content), the system samples 5 frames uniformly distributed across the video timeline.

CLIP Vision Encoder

Model: ViT-B/32 (Vision Transformer with 32×32 patch size)

Why ViT-B/32?

Good balance between accuracy and speed
Trained on 400M image-text pairs
Produces semantic embeddings aligned with natural language
Output: 512-dimensional vectors

Implementation Details

From extract_features.py:52-82:

def extract_visual_features(video_path, clip_model, preprocess, device, n_frames=5):
    cap = cv2.VideoCapture(str(video_path))
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    
    # Sample frames uniformly
    indices = np.linspace(0, total_frames - 1, n_frames, dtype=int)
    embeddings = []
    
    for idx in indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
        ret, frame = cap.read()
        if not ret:
            continue
        
        # Convert BGR → RGB and preprocess
        img = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
        img_input = preprocess(img).unsqueeze(0).to(device)
        
        with torch.no_grad():
            emb = clip_model.encode_image(img_input)
        embeddings.append(emb.cpu())
    
    # Average pool across frames → single 512-d vector
    stacked = torch.cat(embeddings, dim=0)
    return stacked.mean(dim=0)

Key Steps:

Open video with OpenCV
Calculate uniform frame indices using np.linspace
For each sampled frame:
- Convert BGR to RGB (OpenCV → PIL format)
- Apply CLIP preprocessing (resize, normalize)
- Encode with CLIP vision encoder
Average pool across all frame embeddings to get a single 512-d vector

Average pooling provides a robust summary of visual content across the entire video, reducing sensitivity to individual frame noise or transient content.

Audio Feature Extraction

Two-Stage Pipeline

Stage 1: Audio Transcription with Whisper

Model: base (74M parameters)

Why Whisper Base?

Fast inference (~5-10x faster than large models)
Good accuracy for English content
Lightweight for CPU-only environments
Sufficient quality for semantic understanding

Audio Preprocessing:

ffmpeg -i video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 output.wav

Parameters:

-vn: No video (audio only)
-ar 16000: Resample to 16kHz (Whisper’s expected sample rate)
-ac 1: Convert to mono
-acodec pcm_s16le: Uncompressed 16-bit PCM

Stage 2: Text Encoding with CLIP

Instead of using Whisper’s audio embeddings directly, we transcribe audio to text and then encode the text with CLIP’s text encoder. This approach:

Aligns audio features with the same semantic space as visual features
Leverages CLIP’s powerful language-vision alignment
Makes the text embedding comparable to the visual embedding

Implementation Details

From extract_features.py:85-119:

def extract_audio_features(video_path, whisper_model, clip_model, device):
    # Extract audio to temp WAV file
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
        tmp_path = tmp.name
    
    try:
        # ffmpeg: extract audio track
        result = subprocess.run(
            ["ffmpeg", "-y", "-i", str(video_path), "-vn", "-acodec", "pcm_s16le",
             "-ar", "16000", "-ac", "1", tmp_path],
            capture_output=True, timeout=30
        )
        
        if result.returncode != 0:
            return None
        
        # Whisper transcription
        transcription = whisper_model.transcribe(tmp_path, fp16=(device == "cuda"))
        text = transcription["text"].strip()
        
        if not text:
            return None
        
        # CLIP text encoding (max 77 tokens)
        tokens = clip.tokenize([text[:77]], truncate=True).to(device)
        with torch.no_grad():
            text_emb = clip_model.encode_text(tokens)
        
        return text_emb.cpu().squeeze(0), text
    
    except Exception as e:
        return None
    finally:
        if os.path.exists(tmp_path):
            os.unlink(tmp_path)

Key Steps:

Extract audio track to temporary WAV file using ffmpeg
Transcribe audio with Whisper (returns dict with “text” key)
Truncate transcript to 77 tokens (CLIP’s maximum context length)
Tokenize and encode with CLIP text encoder → 512-d vector
Return both embedding and transcript text (for inspection)

Token Limit: CLIP’s text encoder has a maximum context length of 77 tokens. Transcripts are truncated to fit, which works well for short-form TikTok videos (typically 15-60 seconds).

Feature Fusion

Normalization and Concatenation

From extract_features.py:189-193:

# Normalize each modality before concatenation
vis_emb = vis_emb / vis_emb.norm()
audio_emb = audio_emb / (audio_emb.norm() + 1e-8)

# Concatenate: [visual_512 | audio_512] = 1024-d
combined = torch.cat([vis_emb, audio_emb], dim=0)

Why Normalize?

L2 normalization ensures balanced contributions from both modalities. Without normalization, one modality might dominate the feature space due to different magnitude scales.The + 1e-8 term prevents division by zero for the zero-vector fallback case.

Handling Missing Modalities

Graceful Degradation: If audio extraction fails (corrupted audio, silent video, no speech), the system uses a zero vector for the audio modality:

if audio_result is not None:
    audio_emb, transcript = audio_result
else:
    # Zero vector fallback for missing audio
    audio_emb = torch.zeros(vis_emb.shape[0])  # 512-d zeros

The classifier learns to handle this pattern during training, effectively treating it as “no audio information available.”

Final Feature Vector

┌─────────────────────────────────────┐
│   Visual Features (512-d)           │  ← CLIP ViT-B/32 vision encoder
│   [normalized L2]                   │     (averaged across 5 frames)
├─────────────────────────────────────┤
│   Audio Features (512-d)            │  ← Whisper → CLIP text encoder
│   [normalized L2]                   │     (or zeros if no audio)
└─────────────────────────────────────┘
         = 1024-d combined vector

This representation is:

Fixed-dimensional: Always 1024-d regardless of video length
Semantic: Captures high-level content, not raw pixels/waveforms
Multimodal: Combines complementary information sources
Robust: Handles missing audio gracefully

Model Loading

From extract_features.py:42-49:

def load_models(device):
    print(f"Loading CLIP ({CLIP_MODEL})...")
    clip_model, clip_preprocess = clip.load(CLIP_MODEL, device=device)
    
    print(f"Loading Whisper ({WHISPER_MODEL})...")
    whisper_model = whisper.load_model(WHISPER_MODEL, device=device)
    
    return clip_model, clip_preprocess, whisper_model

First Run: Models are automatically downloaded from OpenAI on first use:

CLIP ViT-B/32: ~350 MB
Whisper Base: ~140 MB

Subsequent runs load from cache (~/.cache/clip/ and ~/.cache/whisper/)

Output Format

Labeled Embeddings

Saved to artifacts/labeled_embeddings.pt:

{
    "features": torch.Tensor,      # Shape: (N, 1024)
    "labels": torch.Tensor,        # Shape: (N,) - integer class indices
    "label_names": List[str],      # Category names (e.g., ["soccer", "funny", ...])
    "video_paths": List[str],      # Paths to source videos
}

Unlabeled Embeddings

Saved to artifacts/unlabeled_embeddings.pt:

{
    "features": torch.Tensor,      # Shape: (M, 1024)
    "video_paths": List[str],      # Paths to unsorted videos
}

Transcripts

Saved to artifacts/transcripts.json for inspection:

{
  "/path/to/video1.mp4": "Check out this amazing goal!",
  "/path/to/video2.mp4": "This is so funny haha",
  ...
}

Next Steps

Learn about the Training Pipeline to understand how these features are used for classification
See System Architecture for the complete data flow

Get Started

Core Concepts

Guides

Advanced

Overview

Visual Feature Extraction

Frame Sampling Strategy

CLIP Vision Encoder

Implementation Details

Audio Feature Extraction

Two-Stage Pipeline

Stage 1: Audio Transcription with Whisper

Stage 2: Text Encoding with CLIP

Implementation Details

Feature Fusion

Normalization and Concatenation

Handling Missing Modalities

Final Feature Vector

Model Loading

Output Format

Labeled Embeddings

Unlabeled Embeddings

Transcripts

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced

​Overview

​Visual Feature Extraction

​Frame Sampling Strategy

​CLIP Vision Encoder

​Implementation Details

​Audio Feature Extraction

​Two-Stage Pipeline

​Stage 1: Audio Transcription with Whisper

​Stage 2: Text Encoding with CLIP

​Implementation Details

​Feature Fusion

​Normalization and Concatenation

​Handling Missing Modalities

​Final Feature Vector

​Model Loading

​Output Format

​Labeled Embeddings

​Unlabeled Embeddings

​Transcripts

​Next Steps

Build docs developers (and LLMs) love

Overview

Visual Feature Extraction

Frame Sampling Strategy

CLIP Vision Encoder

Implementation Details

Audio Feature Extraction

Two-Stage Pipeline

Stage 1: Audio Transcription with Whisper

Stage 2: Text Encoding with CLIP

Implementation Details

Feature Fusion

Normalization and Concatenation

Handling Missing Modalities

Final Feature Vector

Model Loading

Output Format

Labeled Embeddings

Unlabeled Embeddings

Transcripts

Next Steps