Transcription

Overview

The transcription system uses Faster-Whisper, an optimized implementation of OpenAI’s Whisper model with CUDA acceleration. It automatically extracts audio from videos and generates timestamped transcription segments for highlight selection and subtitle generation.

How It Works

Audio Extraction

MoviePy extracts audio from the video file and converts to WAV format.

Device Detection

Automatically detects if CUDA (GPU) is available, falls back to CPU if not.

Model Loading

Loads the Faster-Whisper base.en model optimized for English transcription.

Transcription

Processes audio and generates timestamped text segments.

Function Signature

Components/Transcription.py

def transcribeAudio(audio_path):
    """
    Transcribe audio file using Faster-Whisper with CUDA acceleration.
    
    Args:
        audio_path: Path to audio file (typically .wav format)
    
    Returns:
        List of [text, start, end] tuples. Example:
        [
            ["Hello everyone", 0.0, 2.5],
            ["Today we're discussing AI", 2.5, 5.8],
            ...
        ]
        Returns empty list [] on error.
    """

Device Detection

The system automatically selects the optimal processing device:

Components/Transcription.py:7-8

Device = "cuda" if torch.cuda.is_available() else "cpu"
print(Device)

CUDA Acceleration: GPU transcription is ~10-15× faster than CPU. A 5-minute video takes ~30 seconds on GPU vs ~5 minutes on CPU.

Model Configuration

Faster-Whisper is initialized with specific parameters:

Components/Transcription.py:9

model = WhisperModel("base.en", device="cuda" if torch.cuda.is_available() else "cpu")

model_size

string

default:"base.en"

Whisper model variant. base.en is optimized for English-only transcription with good speed/accuracy balance.

device

string

default:"cuda|cpu"

Processing device. Automatically selected based on CUDA availability.

Available Model Sizes

Model	Parameters	VRAM	Speed	Accuracy	Use Case
`tiny.en`	39M	~1GB	Fastest	Lower	Quick drafts
`base.en`	74M	~1GB	Fast	Good	Default (recommended)
`small.en`	244M	~2GB	Medium	Better	Higher accuracy
`medium.en`	769M	~5GB	Slow	Best	Professional quality

The base.en model provides the best balance of speed and accuracy for YouTube Shorts generation. Upgrade to small.en or medium.en for better accuracy if you have a powerful GPU.

Transcription Parameters

The transcription process is configured with specific parameters:

Components/Transcription.py:11

segments, info = model.transcribe(
    audio=audio_path, 
    beam_size=5, 
    language="en", 
    max_new_tokens=128, 
    condition_on_previous_text=False
)

audio

string

required

Path to the audio file to transcribe.

beam_size

int

default:"5"

Beam search width. Higher values (10+) improve accuracy but slow processing. 5 is a good balance.

language

string

default:"en"

Source language code. Set to "en" for English. Use None for auto-detection.

max_new_tokens

int

default:"128"

Maximum tokens per segment. 128 allows for ~20-30 words per segment.

condition_on_previous_text

bool

default:"false"

Whether to use previous segments for context. False prevents cascading errors in long videos.

Context Conditioning: Setting condition_on_previous_text=True can improve coherence but may cause cascading errors if early segments are misrecognized. Keep it False for robust transcription.

Transcription Segments

The output is a list of timestamped text segments:

Components/Transcription.py:12-14

segments = list(segments)
extracted_texts = [[segment.text, segment.start, segment.end] for segment in segments]
print(f"✓ Transcription complete: {len(extracted_texts)} segments extracted")

Segment Structure

Each segment contains:

text: Transcribed text content
start: Start time in seconds (float)
end: End time in seconds (float)

Example Output

[
    [" Hello everyone, welcome back to the channel.", 0.0, 3.2],
    [" Today we're going to discuss machine learning.", 3.2, 6.8],
    [" Let's start with the basics of neural networks.", 6.8, 10.5],
    ...
]

Leading Spaces: Whisper often adds a leading space to transcribed text. This is stripped automatically in the subtitle generation (text.strip()).

Performance Benchmarks

GPU (CUDA) Performance

Tested on NVIDIA RTX 3080 (10GB VRAM):

Video Length	Model	Processing Time	Real-time Factor
5 minutes	base.en	~30 seconds	10× faster
10 minutes	base.en	~60 seconds	10× faster
30 minutes	base.en	~3 minutes	10× faster
5 minutes	small.en	~50 seconds	6× faster

CPU Performance

Tested on Intel i7-10700K (8 cores):

Video Length	Model	Processing Time	Real-time Factor
5 minutes	base.en	~5 minutes	1× (real-time)
10 minutes	base.en	~10 minutes	1× (real-time)

CUDA Acceleration: GPU processing is ~10× faster than CPU. For production use, a GPU with at least 4GB VRAM is highly recommended.

Error Handling

The function includes comprehensive error handling:

Components/Transcription.py:17-19

except Exception as e:
    print("Transcription Error:", e)
    return []

Errors return an empty list [] rather than raising exceptions, allowing the pipeline to fail gracefully.

Audio Extraction

Before transcription, audio must be extracted from the video. This is typically done in the main pipeline:

from moviepy.editor import VideoFileClip

# Extract audio from video
video = VideoFileClip("video.mp4")
audio_path = "audio.wav"
video.audio.write_audiofile(audio_path)
video.close()

# Transcribe
from Components.Transcription import transcribeAudio
transcriptions = transcribeAudio(audio_path)

WAV Format: Whisper works best with uncompressed WAV audio. MP3/AAC may work but can have timing issues.

Output Format for Downstream Tasks

For Highlight Selection

The transcription must be formatted as a timestamped string:

TransText = ""
for text, start, end in transcriptions:
    TransText += f"{start} - {end}: {text}\n"

# Pass to GPT-4o-mini for highlight selection
from Components.LanguageTasks import GetHighlight
start_time, end_time = GetHighlight(TransText)

For Subtitle Generation

The raw list format is used directly:

from Components.Subtitles import add_subtitles_to_video

add_subtitles_to_video(
    input_video="cropped_video.mp4",
    output_video="final_video.mp4",
    transcriptions=transcriptions,  # Raw list of [text, start, end]
    video_start_time=68  # If video was cropped from 68 seconds
)

Customizing Transcription

Change Model Size

Edit Components/Transcription.py:9:

# Use small model for better accuracy
model = WhisperModel("small.en", device="cuda" if torch.cuda.is_available() else "cpu")

# Use multilingual base model
model = WhisperModel("base", device="cuda" if torch.cuda.is_available() else "cpu")

Improve Accuracy

Edit Components/Transcription.py:11:

segments, info = model.transcribe(
    audio=audio_path,
    beam_size=10,  # Increase beam search (slower but more accurate)
    language="en",
    max_new_tokens=256,  # Allow longer segments
    condition_on_previous_text=True  # Use context (may help with coherence)
)

Enable Language Auto-Detection

Edit Components/Transcription.py:11:

# Use multilingual model
model = WhisperModel("base", device="cuda" if torch.cuda.is_available() else "cpu")

segments, info = model.transcribe(
    audio=audio_path,
    beam_size=5,
    language=None,  # Auto-detect language
    max_new_tokens=128,
    condition_on_previous_text=False
)

Force CPU Processing

Edit Components/Transcription.py:9:

# Always use CPU (useful for testing or systems without CUDA)
model = WhisperModel("base.en", device="cpu")

Troubleshooting

CUDA Out of Memory

If you get CUDA OOM errors with larger models:

# Use a smaller model
model = WhisperModel("tiny.en", device="cuda")

# Or fall back to CPU
model = WhisperModel("base.en", device="cpu")

Incorrect Timestamps

If subtitle timing is off:

Ensure audio is extracted at the correct sample rate (Whisper expects 16kHz)
Verify the video FPS matches the source video
Check that video_start_time is correctly set when using cropped clips

Poor Transcription Quality

Use a larger model (small.en or medium.en)
Increase beam_size to 10 or higher
Ensure audio quality is good (no heavy compression, clear speech)
For non-English content, use multilingual models without .en suffix

Dependencies

requirements.txt

faster-whisper>=0.9.0
torch>=2.0.0
torchaudio>=2.0.0

CUDA Setup: For GPU acceleration, ensure PyTorch is installed with CUDA support:

pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118

Get Started

User Guides

Features

Advanced

Overview

How It Works

Function Signature

Device Detection

Model Configuration

Available Model Sizes

Transcription Parameters

Transcription Segments

Segment Structure

Example Output

Performance Benchmarks

GPU (CUDA) Performance

CPU Performance

Error Handling

Audio Extraction

Output Format for Downstream Tasks

For Highlight Selection

For Subtitle Generation

Customizing Transcription

Troubleshooting

CUDA Out of Memory

Incorrect Timestamps

Poor Transcription Quality

Dependencies

Build docs developers (and LLMs) love

Get Started

User Guides

Features

Advanced

​Overview

​How It Works

​Function Signature

​Device Detection

​Model Configuration

​Available Model Sizes

​Transcription Parameters

​Transcription Segments

​Segment Structure

​Example Output

​Performance Benchmarks

​GPU (CUDA) Performance

​CPU Performance

​Error Handling

​Audio Extraction

​Output Format for Downstream Tasks

​For Highlight Selection

​For Subtitle Generation

​Customizing Transcription

​Troubleshooting

​CUDA Out of Memory

​Incorrect Timestamps

​Poor Transcription Quality

​Dependencies

Build docs developers (and LLMs) love

Overview

How It Works

Function Signature

Device Detection

Model Configuration

Available Model Sizes

Transcription Parameters

Transcription Segments

Segment Structure

Example Output

Performance Benchmarks

GPU (CUDA) Performance

CPU Performance

Error Handling

Audio Extraction

Output Format for Downstream Tasks

For Highlight Selection

For Subtitle Generation

Customizing Transcription

Troubleshooting

CUDA Out of Memory

Incorrect Timestamps

Poor Transcription Quality

Dependencies