Skip to main content

Overview

The transcription system uses Faster-Whisper, an optimized implementation of OpenAI’s Whisper model with CUDA acceleration. It automatically extracts audio from videos and generates timestamped transcription segments for highlight selection and subtitle generation.

How It Works

1

Audio Extraction

MoviePy extracts audio from the video file and converts to WAV format.
2

Device Detection

Automatically detects if CUDA (GPU) is available, falls back to CPU if not.
3

Model Loading

Loads the Faster-Whisper base.en model optimized for English transcription.
4

Transcription

Processes audio and generates timestamped text segments.

Function Signature

Components/Transcription.py
def transcribeAudio(audio_path):
    """
    Transcribe audio file using Faster-Whisper with CUDA acceleration.
    
    Args:
        audio_path: Path to audio file (typically .wav format)
    
    Returns:
        List of [text, start, end] tuples. Example:
        [
            ["Hello everyone", 0.0, 2.5],
            ["Today we're discussing AI", 2.5, 5.8],
            ...
        ]
        Returns empty list [] on error.
    """

Device Detection

The system automatically selects the optimal processing device:
Components/Transcription.py:7-8
Device = "cuda" if torch.cuda.is_available() else "cpu"
print(Device)
CUDA Acceleration: GPU transcription is ~10-15× faster than CPU. A 5-minute video takes ~30 seconds on GPU vs ~5 minutes on CPU.

Model Configuration

Faster-Whisper is initialized with specific parameters:
Components/Transcription.py:9
model = WhisperModel("base.en", device="cuda" if torch.cuda.is_available() else "cpu")
model_size
string
default:"base.en"
Whisper model variant. base.en is optimized for English-only transcription with good speed/accuracy balance.
device
string
default:"cuda|cpu"
Processing device. Automatically selected based on CUDA availability.

Available Model Sizes

ModelParametersVRAMSpeedAccuracyUse Case
tiny.en39M~1GBFastestLowerQuick drafts
base.en74M~1GBFastGoodDefault (recommended)
small.en244M~2GBMediumBetterHigher accuracy
medium.en769M~5GBSlowBestProfessional quality
The base.en model provides the best balance of speed and accuracy for YouTube Shorts generation. Upgrade to small.en or medium.en for better accuracy if you have a powerful GPU.

Transcription Parameters

The transcription process is configured with specific parameters:
Components/Transcription.py:11
segments, info = model.transcribe(
    audio=audio_path, 
    beam_size=5, 
    language="en", 
    max_new_tokens=128, 
    condition_on_previous_text=False
)
audio
string
required
Path to the audio file to transcribe.
beam_size
int
default:"5"
Beam search width. Higher values (10+) improve accuracy but slow processing. 5 is a good balance.
language
string
default:"en"
Source language code. Set to "en" for English. Use None for auto-detection.
max_new_tokens
int
default:"128"
Maximum tokens per segment. 128 allows for ~20-30 words per segment.
condition_on_previous_text
bool
default:"false"
Whether to use previous segments for context. False prevents cascading errors in long videos.
Context Conditioning: Setting condition_on_previous_text=True can improve coherence but may cause cascading errors if early segments are misrecognized. Keep it False for robust transcription.

Transcription Segments

The output is a list of timestamped text segments:
Components/Transcription.py:12-14
segments = list(segments)
extracted_texts = [[segment.text, segment.start, segment.end] for segment in segments]
print(f"✓ Transcription complete: {len(extracted_texts)} segments extracted")

Segment Structure

Each segment contains:
  • text: Transcribed text content
  • start: Start time in seconds (float)
  • end: End time in seconds (float)

Example Output

[
    [" Hello everyone, welcome back to the channel.", 0.0, 3.2],
    [" Today we're going to discuss machine learning.", 3.2, 6.8],
    [" Let's start with the basics of neural networks.", 6.8, 10.5],
    ...
]
Leading Spaces: Whisper often adds a leading space to transcribed text. This is stripped automatically in the subtitle generation (text.strip()).

Performance Benchmarks

GPU (CUDA) Performance

Tested on NVIDIA RTX 3080 (10GB VRAM):
Video LengthModelProcessing TimeReal-time Factor
5 minutesbase.en~30 seconds10× faster
10 minutesbase.en~60 seconds10× faster
30 minutesbase.en~3 minutes10× faster
5 minutessmall.en~50 seconds6× faster

CPU Performance

Tested on Intel i7-10700K (8 cores):
Video LengthModelProcessing TimeReal-time Factor
5 minutesbase.en~5 minutes1× (real-time)
10 minutesbase.en~10 minutes1× (real-time)
CUDA Acceleration: GPU processing is ~10× faster than CPU. For production use, a GPU with at least 4GB VRAM is highly recommended.

Error Handling

The function includes comprehensive error handling:
Components/Transcription.py:17-19
except Exception as e:
    print("Transcription Error:", e)
    return []
Errors return an empty list [] rather than raising exceptions, allowing the pipeline to fail gracefully.

Audio Extraction

Before transcription, audio must be extracted from the video. This is typically done in the main pipeline:
from moviepy.editor import VideoFileClip

# Extract audio from video
video = VideoFileClip("video.mp4")
audio_path = "audio.wav"
video.audio.write_audiofile(audio_path)
video.close()

# Transcribe
from Components.Transcription import transcribeAudio
transcriptions = transcribeAudio(audio_path)
WAV Format: Whisper works best with uncompressed WAV audio. MP3/AAC may work but can have timing issues.

Output Format for Downstream Tasks

For Highlight Selection

The transcription must be formatted as a timestamped string:
TransText = ""
for text, start, end in transcriptions:
    TransText += f"{start} - {end}: {text}\n"

# Pass to GPT-4o-mini for highlight selection
from Components.LanguageTasks import GetHighlight
start_time, end_time = GetHighlight(TransText)

For Subtitle Generation

The raw list format is used directly:
from Components.Subtitles import add_subtitles_to_video

add_subtitles_to_video(
    input_video="cropped_video.mp4",
    output_video="final_video.mp4",
    transcriptions=transcriptions,  # Raw list of [text, start, end]
    video_start_time=68  # If video was cropped from 68 seconds
)

Customizing Transcription

Edit Components/Transcription.py:9:
# Use small model for better accuracy
model = WhisperModel("small.en", device="cuda" if torch.cuda.is_available() else "cpu")

# Use multilingual base model
model = WhisperModel("base", device="cuda" if torch.cuda.is_available() else "cpu")
Edit Components/Transcription.py:11:
segments, info = model.transcribe(
    audio=audio_path,
    beam_size=10,  # Increase beam search (slower but more accurate)
    language="en",
    max_new_tokens=256,  # Allow longer segments
    condition_on_previous_text=True  # Use context (may help with coherence)
)
Edit Components/Transcription.py:11:
# Use multilingual model
model = WhisperModel("base", device="cuda" if torch.cuda.is_available() else "cpu")

segments, info = model.transcribe(
    audio=audio_path,
    beam_size=5,
    language=None,  # Auto-detect language
    max_new_tokens=128,
    condition_on_previous_text=False
)
Edit Components/Transcription.py:9:
# Always use CPU (useful for testing or systems without CUDA)
model = WhisperModel("base.en", device="cpu")

Troubleshooting

CUDA Out of Memory

If you get CUDA OOM errors with larger models:
# Use a smaller model
model = WhisperModel("tiny.en", device="cuda")

# Or fall back to CPU
model = WhisperModel("base.en", device="cpu")

Incorrect Timestamps

If subtitle timing is off:
  • Ensure audio is extracted at the correct sample rate (Whisper expects 16kHz)
  • Verify the video FPS matches the source video
  • Check that video_start_time is correctly set when using cropped clips

Poor Transcription Quality

  • Use a larger model (small.en or medium.en)
  • Increase beam_size to 10 or higher
  • Ensure audio quality is good (no heavy compression, clear speech)
  • For non-English content, use multilingual models without .en suffix

Dependencies

requirements.txt
faster-whisper>=0.9.0
torch>=2.0.0
torchaudio>=2.0.0
CUDA Setup: For GPU acceleration, ensure PyTorch is installed with CUDA support:
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118

Build docs developers (and LLMs) love