Overview
The transcription system uses Faster-Whisper, an optimized implementation of OpenAI’s Whisper model with CUDA acceleration. It automatically extracts audio from videos and generates timestamped transcription segments for highlight selection and subtitle generation.How It Works
Function Signature
Components/Transcription.py
Device Detection
The system automatically selects the optimal processing device:Components/Transcription.py:7-8
CUDA Acceleration: GPU transcription is ~10-15× faster than CPU. A 5-minute video takes ~30 seconds on GPU vs ~5 minutes on CPU.
Model Configuration
Faster-Whisper is initialized with specific parameters:Components/Transcription.py:9
Whisper model variant.
base.en is optimized for English-only transcription with good speed/accuracy balance.Processing device. Automatically selected based on CUDA availability.
Available Model Sizes
| Model | Parameters | VRAM | Speed | Accuracy | Use Case |
|---|---|---|---|---|---|
tiny.en | 39M | ~1GB | Fastest | Lower | Quick drafts |
base.en | 74M | ~1GB | Fast | Good | Default (recommended) |
small.en | 244M | ~2GB | Medium | Better | Higher accuracy |
medium.en | 769M | ~5GB | Slow | Best | Professional quality |
Transcription Parameters
The transcription process is configured with specific parameters:Components/Transcription.py:11
Path to the audio file to transcribe.
Beam search width. Higher values (10+) improve accuracy but slow processing. 5 is a good balance.
Source language code. Set to
"en" for English. Use None for auto-detection.Maximum tokens per segment. 128 allows for ~20-30 words per segment.
Whether to use previous segments for context.
False prevents cascading errors in long videos.Transcription Segments
The output is a list of timestamped text segments:Components/Transcription.py:12-14
Segment Structure
Each segment contains:text: Transcribed text contentstart: Start time in seconds (float)end: End time in seconds (float)
Example Output
Leading Spaces: Whisper often adds a leading space to transcribed text. This is stripped automatically in the subtitle generation (
text.strip()).Performance Benchmarks
GPU (CUDA) Performance
Tested on NVIDIA RTX 3080 (10GB VRAM):| Video Length | Model | Processing Time | Real-time Factor |
|---|---|---|---|
| 5 minutes | base.en | ~30 seconds | 10× faster |
| 10 minutes | base.en | ~60 seconds | 10× faster |
| 30 minutes | base.en | ~3 minutes | 10× faster |
| 5 minutes | small.en | ~50 seconds | 6× faster |
CPU Performance
Tested on Intel i7-10700K (8 cores):| Video Length | Model | Processing Time | Real-time Factor |
|---|---|---|---|
| 5 minutes | base.en | ~5 minutes | 1× (real-time) |
| 10 minutes | base.en | ~10 minutes | 1× (real-time) |
Error Handling
The function includes comprehensive error handling:Components/Transcription.py:17-19
[] rather than raising exceptions, allowing the pipeline to fail gracefully.
Audio Extraction
Before transcription, audio must be extracted from the video. This is typically done in the main pipeline:WAV Format: Whisper works best with uncompressed WAV audio. MP3/AAC may work but can have timing issues.
Output Format for Downstream Tasks
For Highlight Selection
The transcription must be formatted as a timestamped string:For Subtitle Generation
The raw list format is used directly:Customizing Transcription
Change Model Size
Change Model Size
Edit
Components/Transcription.py:9:Improve Accuracy
Improve Accuracy
Edit
Components/Transcription.py:11:Enable Language Auto-Detection
Enable Language Auto-Detection
Edit
Components/Transcription.py:11:Force CPU Processing
Force CPU Processing
Edit
Components/Transcription.py:9:Troubleshooting
CUDA Out of Memory
If you get CUDA OOM errors with larger models:Incorrect Timestamps
If subtitle timing is off:- Ensure audio is extracted at the correct sample rate (Whisper expects 16kHz)
- Verify the video FPS matches the source video
- Check that
video_start_timeis correctly set when using cropped clips
Poor Transcription Quality
- Use a larger model (
small.enormedium.en) - Increase
beam_sizeto 10 or higher - Ensure audio quality is good (no heavy compression, clear speech)
- For non-English content, use multilingual models without
.ensuffix
Dependencies
requirements.txt
