Skip to main content

Overview

Extracts multi-modal features from TikTok videos by combining visual and audio embeddings. For each video, the script samples frames uniformly, encodes them with CLIP’s vision encoder, extracts audio for transcription with Whisper, and concatenates the embeddings into a 1024-dimensional feature vector. Location: source/extract_features.py Output Files:
  • artifacts/labeled_embeddings.pt - Features for videos in category folders
  • artifacts/unlabeled_embeddings.pt - Features for unsorted videos
  • artifacts/transcripts.json - Audio transcriptions

Configuration Constants

DATA_DIR
Path
default:"data/Favorites/videos"
Directory containing TikTok videos. Videos in subfolders are labeled, videos in root are unlabeled.
OUTPUT_DIR
Path
default:"artifacts"
Directory where embeddings and transcripts are saved.
N_FRAMES
int
default:"5"
Number of frames to sample uniformly from each video for visual feature extraction.
CLIP_MODEL
str
default:"ViT-B/32"
CLIP model architecture to use for encoding visual and text features.
WHISPER_MODEL
str
default:"base"
Whisper model size for audio transcription (options: tiny, base, small, medium, large).

Functions

get_device()

Determines whether to use GPU or CPU for model inference. Returns: str - Either "cuda" if GPU is available, otherwise "cpu"
device = get_device()

load_models(device)

Loads CLIP and Whisper models for feature extraction.
device
str
required
Device to load models on ("cuda" or "cpu")
Returns: tuple - (clip_model, clip_preprocess, whisper_model)
clip_model, preprocess, whisper_model = load_models("cuda")

extract_visual_features(video_path, clip_model, preprocess, device, n_frames=N_FRAMES)

Samples frames uniformly from a video and encodes them with CLIP vision encoder.
video_path
Path
required
Path to the video file
clip_model
CLIP
required
Loaded CLIP model instance
preprocess
callable
required
CLIP preprocessing function for images
device
str
required
Device to run inference on
n_frames
int
default:"5"
Number of frames to sample from the video
Returns: torch.Tensor or None - 512-dimensional visual embedding (average-pooled across frames), or None if extraction fails
visual_emb = extract_visual_features(
    video_path=Path("video.mp4"),
    clip_model=clip_model,
    preprocess=preprocess,
    device="cuda",
    n_frames=5
)

extract_audio_features(video_path, whisper_model, clip_model, device)

Extracts audio from video, transcribes it with Whisper, and encodes the transcript with CLIP text encoder.
video_path
Path
required
Path to the video file
whisper_model
Whisper
required
Loaded Whisper model instance
clip_model
CLIP
required
Loaded CLIP model instance for encoding transcript
device
str
required
Device to run inference on
Returns: tuple or None - (audio_embedding, transcript_text) where audio_embedding is a 512-dimensional tensor, or None if extraction fails Process:
  1. Extracts audio to temporary WAV file using ffmpeg (16kHz, mono, PCM)
  2. Transcribes audio with Whisper
  3. Encodes transcript with CLIP text encoder (max 77 tokens)
  4. Returns embedding and transcript text
result = extract_audio_features(
    video_path=Path("video.mp4"),
    whisper_model=whisper_model,
    clip_model=clip_model,
    device="cuda"
)
if result:
    audio_emb, transcript = result

discover_dataset(data_dir)

Scans the data directory to find labeled and unlabeled videos.
data_dir
Path
required
Root directory containing videos and category subfolders
Returns: tuple - (labeled, unlabeled, label_names) where:
  • labeled: list of (video_path, folder_name) tuples for videos in subfolders
  • unlabeled: list of video_path for videos in root directory
  • label_names: sorted list of category folder names
Directory Structure Expected:
data/Favorites/videos/
├── category1/
│   ├── video1.mp4
│   └── video2.mp4
├── category2/
│   └── video3.mp4
└── unsorted_video.mp4  # unlabeled
labeled, unlabeled, categories = discover_dataset(DATA_DIR)
print(f"Found {len(labeled)} labeled, {len(unlabeled)} unlabeled videos")
print(f"Categories: {categories}")

main()

Main execution function that orchestrates the complete feature extraction pipeline. Pipeline:
  1. Detects available device (GPU/CPU)
  2. Loads CLIP and Whisper models
  3. Discovers dataset (labeled and unlabeled videos)
  4. Extracts features for labeled videos:
    • Visual features (512-d) from sampled frames
    • Audio features (512-d) from transcribed audio
    • Normalizes and concatenates to 1024-d vector
  5. Extracts features for unlabeled videos
  6. Saves embeddings and transcripts to artifacts directory
Saved Artifacts:
  • labeled_embeddings.pt: Dictionary with keys:
    • features: Tensor of shape [N, 1024]
    • labels: Tensor of integer labels
    • label_names: List of category names
    • video_paths: List of video file paths
  • unlabeled_embeddings.pt: Dictionary with keys:
    • features: Tensor of shape [M, 1024]
    • video_paths: List of video file paths
  • transcripts.json: Dictionary mapping video paths to transcript text
if __name__ == "__main__":
    main()

Usage

# Extract features from all videos
python extract_features.py
The script will:
  1. Load CLIP (ViT-B/32) and Whisper (base) models
  2. Process all videos in data/Favorites/videos/
  3. Save embeddings to artifacts/labeled_embeddings.pt and artifacts/unlabeled_embeddings.pt
  4. Save transcripts to artifacts/transcripts.json

Feature Vector Structure

Each video is represented as a 1024-dimensional vector:
  • Dimensions 0-511: Visual features (CLIP vision encoder, average-pooled across frames)
  • Dimensions 512-1023: Audio features (CLIP text encoder of Whisper transcription)
Both modalities are L2-normalized before concatenation to ensure equal contribution.

Build docs developers (and LLMs) love