Overview
The feature extraction pipeline transforms raw TikTok videos into fixed-dimensional embeddings by combining two complementary modalities:- Visual Features: Extracted using OpenAI’s CLIP vision encoder
- Audio Features: Extracted by transcribing audio with Whisper, then encoding the transcript with CLIP’s text encoder
Visual Feature Extraction
Frame Sampling Strategy
CLIP Vision Encoder
Model:ViT-B/32 (Vision Transformer with 32×32 patch size)
Why ViT-B/32?
- Good balance between accuracy and speed
- Trained on 400M image-text pairs
- Produces semantic embeddings aligned with natural language
- Output: 512-dimensional vectors
Implementation Details
Fromextract_features.py:52-82:
- Open video with OpenCV
- Calculate uniform frame indices using
np.linspace - For each sampled frame:
- Convert BGR to RGB (OpenCV → PIL format)
- Apply CLIP preprocessing (resize, normalize)
- Encode with CLIP vision encoder
- Average pool across all frame embeddings to get a single 512-d vector
Audio Feature Extraction
Two-Stage Pipeline
Stage 1: Audio Transcription with Whisper
Model:base (74M parameters)
Why Whisper Base?
- Fast inference (~5-10x faster than large models)
- Good accuracy for English content
- Lightweight for CPU-only environments
- Sufficient quality for semantic understanding
-vn: No video (audio only)-ar 16000: Resample to 16kHz (Whisper’s expected sample rate)-ac 1: Convert to mono-acodec pcm_s16le: Uncompressed 16-bit PCM
Stage 2: Text Encoding with CLIP
Instead of using Whisper’s audio embeddings directly, we transcribe audio to text and then encode the text with CLIP’s text encoder. This approach:
- Aligns audio features with the same semantic space as visual features
- Leverages CLIP’s powerful language-vision alignment
- Makes the text embedding comparable to the visual embedding
Implementation Details
Fromextract_features.py:85-119:
- Extract audio track to temporary WAV file using ffmpeg
- Transcribe audio with Whisper (returns dict with “text” key)
- Truncate transcript to 77 tokens (CLIP’s maximum context length)
- Tokenize and encode with CLIP text encoder → 512-d vector
- Return both embedding and transcript text (for inspection)
Feature Fusion
Normalization and Concatenation
Fromextract_features.py:189-193:
L2 normalization ensures balanced contributions from both modalities. Without normalization, one modality might dominate the feature space due to different magnitude scales.The
+ 1e-8 term prevents division by zero for the zero-vector fallback case.Handling Missing Modalities
Graceful Degradation: If audio extraction fails (corrupted audio, silent video, no speech), the system uses a zero vector for the audio modality:Final Feature Vector
- Fixed-dimensional: Always 1024-d regardless of video length
- Semantic: Captures high-level content, not raw pixels/waveforms
- Multimodal: Combines complementary information sources
- Robust: Handles missing audio gracefully
Model Loading
Fromextract_features.py:42-49:
First Run: Models are automatically downloaded from OpenAI on first use:
- CLIP ViT-B/32: ~350 MB
- Whisper Base: ~140 MB
~/.cache/clip/ and ~/.cache/whisper/)Output Format
Labeled Embeddings
Saved toartifacts/labeled_embeddings.pt:
Unlabeled Embeddings
Saved toartifacts/unlabeled_embeddings.pt:
Transcripts
Saved toartifacts/transcripts.json for inspection:
Next Steps
- Learn about the Training Pipeline to understand how these features are used for classification
- See System Architecture for the complete data flow