Skip to main content

Overview

The feature extraction pipeline transforms raw TikTok videos into fixed-dimensional embeddings by combining two complementary modalities:
  1. Visual Features: Extracted using OpenAI’s CLIP vision encoder
  2. Audio Features: Extracted by transcribing audio with Whisper, then encoding the transcript with CLIP’s text encoder
This multimodal approach captures both “what the video shows” and “what is being said,” providing rich semantic representations for classification.

Visual Feature Extraction

Frame Sampling Strategy

N_FRAMES = 5  # Sample 5 frames uniformly across video duration
Instead of processing every frame (computationally expensive) or just the first frame (may miss key content), the system samples 5 frames uniformly distributed across the video timeline.

CLIP Vision Encoder

Model: ViT-B/32 (Vision Transformer with 32×32 patch size)
Why ViT-B/32?
  • Good balance between accuracy and speed
  • Trained on 400M image-text pairs
  • Produces semantic embeddings aligned with natural language
  • Output: 512-dimensional vectors

Implementation Details

From extract_features.py:52-82:
def extract_visual_features(video_path, clip_model, preprocess, device, n_frames=5):
    cap = cv2.VideoCapture(str(video_path))
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    
    # Sample frames uniformly
    indices = np.linspace(0, total_frames - 1, n_frames, dtype=int)
    embeddings = []
    
    for idx in indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
        ret, frame = cap.read()
        if not ret:
            continue
        
        # Convert BGR → RGB and preprocess
        img = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
        img_input = preprocess(img).unsqueeze(0).to(device)
        
        with torch.no_grad():
            emb = clip_model.encode_image(img_input)
        embeddings.append(emb.cpu())
    
    # Average pool across frames → single 512-d vector
    stacked = torch.cat(embeddings, dim=0)
    return stacked.mean(dim=0)
Key Steps:
  1. Open video with OpenCV
  2. Calculate uniform frame indices using np.linspace
  3. For each sampled frame:
    • Convert BGR to RGB (OpenCV → PIL format)
    • Apply CLIP preprocessing (resize, normalize)
    • Encode with CLIP vision encoder
  4. Average pool across all frame embeddings to get a single 512-d vector
Average pooling provides a robust summary of visual content across the entire video, reducing sensitivity to individual frame noise or transient content.

Audio Feature Extraction

Two-Stage Pipeline

Stage 1: Audio Transcription with Whisper

Model: base (74M parameters)
Why Whisper Base?
  • Fast inference (~5-10x faster than large models)
  • Good accuracy for English content
  • Lightweight for CPU-only environments
  • Sufficient quality for semantic understanding
Audio Preprocessing:
ffmpeg -i video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 output.wav
Parameters:
  • -vn: No video (audio only)
  • -ar 16000: Resample to 16kHz (Whisper’s expected sample rate)
  • -ac 1: Convert to mono
  • -acodec pcm_s16le: Uncompressed 16-bit PCM

Stage 2: Text Encoding with CLIP

Instead of using Whisper’s audio embeddings directly, we transcribe audio to text and then encode the text with CLIP’s text encoder. This approach:
  1. Aligns audio features with the same semantic space as visual features
  2. Leverages CLIP’s powerful language-vision alignment
  3. Makes the text embedding comparable to the visual embedding

Implementation Details

From extract_features.py:85-119:
def extract_audio_features(video_path, whisper_model, clip_model, device):
    # Extract audio to temp WAV file
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
        tmp_path = tmp.name
    
    try:
        # ffmpeg: extract audio track
        result = subprocess.run(
            ["ffmpeg", "-y", "-i", str(video_path), "-vn", "-acodec", "pcm_s16le",
             "-ar", "16000", "-ac", "1", tmp_path],
            capture_output=True, timeout=30
        )
        
        if result.returncode != 0:
            return None
        
        # Whisper transcription
        transcription = whisper_model.transcribe(tmp_path, fp16=(device == "cuda"))
        text = transcription["text"].strip()
        
        if not text:
            return None
        
        # CLIP text encoding (max 77 tokens)
        tokens = clip.tokenize([text[:77]], truncate=True).to(device)
        with torch.no_grad():
            text_emb = clip_model.encode_text(tokens)
        
        return text_emb.cpu().squeeze(0), text
    
    except Exception as e:
        return None
    finally:
        if os.path.exists(tmp_path):
            os.unlink(tmp_path)
Key Steps:
  1. Extract audio track to temporary WAV file using ffmpeg
  2. Transcribe audio with Whisper (returns dict with “text” key)
  3. Truncate transcript to 77 tokens (CLIP’s maximum context length)
  4. Tokenize and encode with CLIP text encoder → 512-d vector
  5. Return both embedding and transcript text (for inspection)
Token Limit: CLIP’s text encoder has a maximum context length of 77 tokens. Transcripts are truncated to fit, which works well for short-form TikTok videos (typically 15-60 seconds).

Feature Fusion

Normalization and Concatenation

From extract_features.py:189-193:
# Normalize each modality before concatenation
vis_emb = vis_emb / vis_emb.norm()
audio_emb = audio_emb / (audio_emb.norm() + 1e-8)

# Concatenate: [visual_512 | audio_512] = 1024-d
combined = torch.cat([vis_emb, audio_emb], dim=0)
Why Normalize?
L2 normalization ensures balanced contributions from both modalities. Without normalization, one modality might dominate the feature space due to different magnitude scales.The + 1e-8 term prevents division by zero for the zero-vector fallback case.

Handling Missing Modalities

Graceful Degradation: If audio extraction fails (corrupted audio, silent video, no speech), the system uses a zero vector for the audio modality:
if audio_result is not None:
    audio_emb, transcript = audio_result
else:
    # Zero vector fallback for missing audio
    audio_emb = torch.zeros(vis_emb.shape[0])  # 512-d zeros
The classifier learns to handle this pattern during training, effectively treating it as “no audio information available.”

Final Feature Vector

┌─────────────────────────────────────┐
│   Visual Features (512-d)           │  ← CLIP ViT-B/32 vision encoder
│   [normalized L2]                   │     (averaged across 5 frames)
├─────────────────────────────────────┤
│   Audio Features (512-d)            │  ← Whisper → CLIP text encoder
│   [normalized L2]                   │     (or zeros if no audio)
└─────────────────────────────────────┘
         = 1024-d combined vector
This representation is:
  • Fixed-dimensional: Always 1024-d regardless of video length
  • Semantic: Captures high-level content, not raw pixels/waveforms
  • Multimodal: Combines complementary information sources
  • Robust: Handles missing audio gracefully

Model Loading

From extract_features.py:42-49:
def load_models(device):
    print(f"Loading CLIP ({CLIP_MODEL})...")
    clip_model, clip_preprocess = clip.load(CLIP_MODEL, device=device)
    
    print(f"Loading Whisper ({WHISPER_MODEL})...")
    whisper_model = whisper.load_model(WHISPER_MODEL, device=device)
    
    return clip_model, clip_preprocess, whisper_model
First Run: Models are automatically downloaded from OpenAI on first use:
  • CLIP ViT-B/32: ~350 MB
  • Whisper Base: ~140 MB
Subsequent runs load from cache (~/.cache/clip/ and ~/.cache/whisper/)

Output Format

Labeled Embeddings

Saved to artifacts/labeled_embeddings.pt:
{
    "features": torch.Tensor,      # Shape: (N, 1024)
    "labels": torch.Tensor,        # Shape: (N,) - integer class indices
    "label_names": List[str],      # Category names (e.g., ["soccer", "funny", ...])
    "video_paths": List[str],      # Paths to source videos
}

Unlabeled Embeddings

Saved to artifacts/unlabeled_embeddings.pt:
{
    "features": torch.Tensor,      # Shape: (M, 1024)
    "video_paths": List[str],      # Paths to unsorted videos
}

Transcripts

Saved to artifacts/transcripts.json for inspection:
{
  "/path/to/video1.mp4": "Check out this amazing goal!",
  "/path/to/video2.mp4": "This is so funny haha",
  ...
}

Next Steps

Build docs developers (and LLMs) love