extract_features.py

Overview

Extracts multi-modal features from TikTok videos by combining visual and audio embeddings. For each video, the script samples frames uniformly, encodes them with CLIP’s vision encoder, extracts audio for transcription with Whisper, and concatenates the embeddings into a 1024-dimensional feature vector. Location: source/extract_features.py Output Files:

artifacts/labeled_embeddings.pt - Features for videos in category folders
artifacts/unlabeled_embeddings.pt - Features for unsorted videos
artifacts/transcripts.json - Audio transcriptions

Configuration Constants

DATA_DIR

Path

default:"data/Favorites/videos"

Directory containing TikTok videos. Videos in subfolders are labeled, videos in root are unlabeled.

OUTPUT_DIR

Path

default:"artifacts"

Directory where embeddings and transcripts are saved.

N_FRAMES

int

default:"5"

Number of frames to sample uniformly from each video for visual feature extraction.

CLIP_MODEL

str

default:"ViT-B/32"

CLIP model architecture to use for encoding visual and text features.

WHISPER_MODEL

str

default:"base"

Whisper model size for audio transcription (options: tiny, base, small, medium, large).

Functions

get_device()

Determines whether to use GPU or CPU for model inference. Returns: str - Either "cuda" if GPU is available, otherwise "cpu"

device = get_device()

load_models(device)

Loads CLIP and Whisper models for feature extraction.

device

str

required

Device to load models on ("cuda" or "cpu")

Returns: tuple - (clip_model, clip_preprocess, whisper_model)

clip_model, preprocess, whisper_model = load_models("cuda")

extract_visual_features(video_path, clip_model, preprocess, device, n_frames=N_FRAMES)

Samples frames uniformly from a video and encodes them with CLIP vision encoder.

video_path

Path

required

Path to the video file

clip_model

CLIP

required

Loaded CLIP model instance

preprocess

callable

required

CLIP preprocessing function for images

device

str

required

Device to run inference on

n_frames

int

default:"5"

Number of frames to sample from the video

Returns: torch.Tensor or None - 512-dimensional visual embedding (average-pooled across frames), or None if extraction fails

visual_emb = extract_visual_features(
    video_path=Path("video.mp4"),
    clip_model=clip_model,
    preprocess=preprocess,
    device="cuda",
    n_frames=5
)

extract_audio_features(video_path, whisper_model, clip_model, device)

Extracts audio from video, transcribes it with Whisper, and encodes the transcript with CLIP text encoder.

video_path

Path

required

Path to the video file

whisper_model

Whisper

required

Loaded Whisper model instance

clip_model

CLIP

required

Loaded CLIP model instance for encoding transcript

device

str

required

Device to run inference on

Returns: tuple or None - (audio_embedding, transcript_text) where audio_embedding is a 512-dimensional tensor, or None if extraction fails Process:

Extracts audio to temporary WAV file using ffmpeg (16kHz, mono, PCM)
Transcribes audio with Whisper
Encodes transcript with CLIP text encoder (max 77 tokens)
Returns embedding and transcript text

result = extract_audio_features(
    video_path=Path("video.mp4"),
    whisper_model=whisper_model,
    clip_model=clip_model,
    device="cuda"
)
if result:
    audio_emb, transcript = result

discover_dataset(data_dir)

Scans the data directory to find labeled and unlabeled videos.

data_dir

Path

required

Root directory containing videos and category subfolders

Returns: tuple - (labeled, unlabeled, label_names) where:

labeled: list of (video_path, folder_name) tuples for videos in subfolders
unlabeled: list of video_path for videos in root directory
label_names: sorted list of category folder names

Directory Structure Expected:

data/Favorites/videos/
├── category1/
│   ├── video1.mp4
│   └── video2.mp4
├── category2/
│   └── video3.mp4
└── unsorted_video.mp4  # unlabeled

labeled, unlabeled, categories = discover_dataset(DATA_DIR)
print(f"Found {len(labeled)} labeled, {len(unlabeled)} unlabeled videos")
print(f"Categories: {categories}")

main()

Main execution function that orchestrates the complete feature extraction pipeline. Pipeline:

Detects available device (GPU/CPU)
Loads CLIP and Whisper models
Discovers dataset (labeled and unlabeled videos)
Extracts features for labeled videos:
- Visual features (512-d) from sampled frames
- Audio features (512-d) from transcribed audio
- Normalizes and concatenates to 1024-d vector
Extracts features for unlabeled videos
Saves embeddings and transcripts to artifacts directory

Saved Artifacts:

labeled_embeddings.pt: Dictionary with keys:
- features: Tensor of shape [N, 1024]
- labels: Tensor of integer labels
- label_names: List of category names
- video_paths: List of video file paths
unlabeled_embeddings.pt: Dictionary with keys:
- features: Tensor of shape [M, 1024]
- video_paths: List of video file paths
transcripts.json: Dictionary mapping video paths to transcript text

if __name__ == "__main__":
    main()

Usage

# Extract features from all videos
python extract_features.py

The script will:

Load CLIP (ViT-B/32) and Whisper (base) models
Process all videos in data/Favorites/videos/
Save embeddings to artifacts/labeled_embeddings.pt and artifacts/unlabeled_embeddings.pt
Save transcripts to artifacts/transcripts.json

Feature Vector Structure

Each video is represented as a 1024-dimensional vector:

Dimensions 0-511: Visual features (CLIP vision encoder, average-pooled across frames)
Dimensions 512-1023: Audio features (CLIP text encoder of Whisper transcription)

Both modalities are L2-normalized before concatenation to ensure equal contribution.

Scripts

Models

Overview

Configuration Constants

Functions

get_device()

load_models(device)

extract_visual_features(video_path, clip_model, preprocess, device, n_frames=N_FRAMES)

extract_audio_features(video_path, whisper_model, clip_model, device)

discover_dataset(data_dir)

main()

Usage

Feature Vector Structure

Build docs developers (and LLMs) love

Scripts

Models

​Overview

​Configuration Constants

​Functions

​get_device()

​load_models(device)

​extract_visual_features(video_path, clip_model, preprocess, device, n_frames=N_FRAMES)

​extract_audio_features(video_path, whisper_model, clip_model, device)

​discover_dataset(data_dir)

​main()

​Usage

​Feature Vector Structure

Build docs developers (and LLMs) love

Overview

Configuration Constants

Functions

get_device()

load_models(device)

extract_visual_features(video_path, clip_model, preprocess, device, n_frames=N_FRAMES)

extract_audio_features(video_path, whisper_model, clip_model, device)

discover_dataset(data_dir)

main()

Usage

Feature Vector Structure