Overview
Extracts multi-modal features from TikTok videos by combining visual and audio embeddings. For each video, the script samples frames uniformly, encodes them with CLIP’s vision encoder, extracts audio for transcription with Whisper, and concatenates the embeddings into a 1024-dimensional feature vector. Location:source/extract_features.py
Output Files:
artifacts/labeled_embeddings.pt- Features for videos in category foldersartifacts/unlabeled_embeddings.pt- Features for unsorted videosartifacts/transcripts.json- Audio transcriptions
Configuration Constants
Directory containing TikTok videos. Videos in subfolders are labeled, videos in root are unlabeled.
Directory where embeddings and transcripts are saved.
Number of frames to sample uniformly from each video for visual feature extraction.
CLIP model architecture to use for encoding visual and text features.
Whisper model size for audio transcription (options: tiny, base, small, medium, large).
Functions
get_device()
Determines whether to use GPU or CPU for model inference. Returns:str - Either "cuda" if GPU is available, otherwise "cpu"
load_models(device)
Loads CLIP and Whisper models for feature extraction.Device to load models on (
"cuda" or "cpu")tuple - (clip_model, clip_preprocess, whisper_model)
extract_visual_features(video_path, clip_model, preprocess, device, n_frames=N_FRAMES)
Samples frames uniformly from a video and encodes them with CLIP vision encoder.Path to the video file
Loaded CLIP model instance
CLIP preprocessing function for images
Device to run inference on
Number of frames to sample from the video
torch.Tensor or None - 512-dimensional visual embedding (average-pooled across frames), or None if extraction fails
extract_audio_features(video_path, whisper_model, clip_model, device)
Extracts audio from video, transcribes it with Whisper, and encodes the transcript with CLIP text encoder.Path to the video file
Loaded Whisper model instance
Loaded CLIP model instance for encoding transcript
Device to run inference on
tuple or None - (audio_embedding, transcript_text) where audio_embedding is a 512-dimensional tensor, or None if extraction fails
Process:
- Extracts audio to temporary WAV file using ffmpeg (16kHz, mono, PCM)
- Transcribes audio with Whisper
- Encodes transcript with CLIP text encoder (max 77 tokens)
- Returns embedding and transcript text
discover_dataset(data_dir)
Scans the data directory to find labeled and unlabeled videos.Root directory containing videos and category subfolders
tuple - (labeled, unlabeled, label_names) where:
labeled: list of(video_path, folder_name)tuples for videos in subfoldersunlabeled: list ofvideo_pathfor videos in root directorylabel_names: sorted list of category folder names
main()
Main execution function that orchestrates the complete feature extraction pipeline. Pipeline:- Detects available device (GPU/CPU)
- Loads CLIP and Whisper models
- Discovers dataset (labeled and unlabeled videos)
- Extracts features for labeled videos:
- Visual features (512-d) from sampled frames
- Audio features (512-d) from transcribed audio
- Normalizes and concatenates to 1024-d vector
- Extracts features for unlabeled videos
- Saves embeddings and transcripts to artifacts directory
labeled_embeddings.pt: Dictionary with keys:features: Tensor of shape[N, 1024]labels: Tensor of integer labelslabel_names: List of category namesvideo_paths: List of video file paths
unlabeled_embeddings.pt: Dictionary with keys:features: Tensor of shape[M, 1024]video_paths: List of video file paths
transcripts.json: Dictionary mapping video paths to transcript text
Usage
- Load CLIP (ViT-B/32) and Whisper (base) models
- Process all videos in
data/Favorites/videos/ - Save embeddings to
artifacts/labeled_embeddings.ptandartifacts/unlabeled_embeddings.pt - Save transcripts to
artifacts/transcripts.json
Feature Vector Structure
Each video is represented as a 1024-dimensional vector:- Dimensions 0-511: Visual features (CLIP vision encoder, average-pooled across frames)
- Dimensions 512-1023: Audio features (CLIP text encoder of Whisper transcription)