recognize_faster_whisper()

Overview

Performs speech recognition using Faster Whisper, an optimized implementation of OpenAI’s Whisper model using CTranslate2. Provides 4x faster inference than the standard Whisper implementation with similar accuracy.

This method works completely offline - no internet connection required after model download.

Method Signature

recognize_faster_whisper(
    audio_data: AudioData,
    model: str = "base",
    show_dict: bool = False,
    init_options: dict | None = None,
    language: str | None = None,
    task: Literal["transcribe", "translate"] = "transcribe",
    beam_size: int = 5,
    **transcribe_options
) -> str | dict

Parameters

audio_data

AudioData

required

The audio data to recognize. Must be an AudioData instance.

model

str

default:"base"

Whisper model size to use. Available models:

"tiny" - Smallest, fastest (39M parameters)
"base" - Good balance (74M parameters)
"small" - Better accuracy (244M parameters)
"medium" - High accuracy (769M parameters)
"large" or "large-v3" - Best accuracy (1550M parameters)
"turbo" - Optimized large model

show_dict

bool

default:"False"

If True, returns a dictionary with full transcription details including detected language and segments. If False, returns only the transcription text.

init_options

dict | None

default:"None"

Options for model initialization:

device: "cpu", "cuda", or "auto" (default: auto)
compute_type: "int8", "float16", "float32" (default: auto-selected)
download_root: Directory to cache models (default: ~/.cache/huggingface/hub)

language

str | None

default:"None"

Language code (e.g., "en", "es", "fr"). If not specified, the language is automatically detected.

task

str

default:"transcribe"

Task to perform:

"transcribe" - Transcribe audio in its original language
"translate" - Transcribe and translate to English

beam_size

int

default:"5"

Beam size for beam search decoding. Higher values may improve accuracy but increase computation time.

**transcribe_options

kwargs

Additional options passed to Faster Whisper’s transcribe method. See Faster Whisper documentation for all available options.

Return Value

Simple Mode (show_dict=False)

text

str

The transcribed text from the audio.

Dictionary Mode (show_dict=True)

response

dict

Dictionary containing:

text (str): The transcribed text
segments (list): List of segment objects with timestamps and text
language (str): Detected or specified language code

Exceptions

UnknownValueError

Exception

Raised if the speech is unintelligible or transcription fails.

RequestError

Exception

Raised if there’s an error loading the model or processing audio.

Setup

Installation

Install Faster Whisper:

pip install SpeechRecognition[faster-whisper]

Or install the package directly:

pip install faster-whisper

For GPU acceleration, also install CUDA and cuDNN. See Faster Whisper installation guide.

Model Download

Models are automatically downloaded on first use and cached locally. The first run will download the selected model (can take a few minutes depending on model size and internet speed).

Examples

Basic Usage

import speech_recognition as sr

r = sr.Recognizer()

# From microphone
with sr.Microphone() as source:
    print("Say something!")
    audio = r.listen(source)

try:
    text = r.recognize_faster_whisper(audio, model="base")
    print(f"You said: {text}")
except sr.UnknownValueError:
    print("Could not understand audio")
except sr.RequestError as e:
    print(f"Error: {e}")

With Language Specification

import speech_recognition as sr

r = sr.Recognizer()

with sr.Microphone() as source:
    audio = r.listen(source)

# Specify language for faster processing
text = r.recognize_faster_whisper(
    audio,
    model="small",
    language="en"
)
print(f"English transcription: {text}")

From Audio File

import speech_recognition as sr

r = sr.Recognizer()

with sr.AudioFile('audio.wav') as source:
    audio = r.record(source)

text = r.recognize_faster_whisper(audio, model="medium")
print(f"Transcription: {text}")

With Full Response Details

import speech_recognition as sr

r = sr.Recognizer()

with sr.Microphone() as source:
    audio = r.listen(source)

# Get detailed response with segments and timestamps
result = r.recognize_faster_whisper(
    audio,
    model="base",
    show_dict=True
)

print(f"Text: {result['text']}")
print(f"Language: {result['language']}")
print(f"Segments: {len(result['segments'])}")

# Print each segment with timestamps
for segment in result['segments']:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

Translation to English

import speech_recognition as sr

r = sr.Recognizer()

with sr.Microphone() as source:
    print("Say something in any language...")
    audio = r.listen(source)

# Transcribe and translate to English
text = r.recognize_faster_whisper(
    audio,
    model="base",
    task="translate"
)
print(f"English translation: {text}")

GPU Acceleration

import speech_recognition as sr

r = sr.Recognizer()

with sr.Microphone() as source:
    audio = r.listen(source)

# Use GPU with float16 precision for faster processing
text = r.recognize_faster_whisper(
    audio,
    model="large-v3",
    init_options={
        "device": "cuda",
        "compute_type": "float16"
    }
)
print(f"GPU-accelerated: {text}")

Custom Model Cache Location

import speech_recognition as sr

r = sr.Recognizer()

with sr.Microphone() as source:
    audio = r.listen(source)

# Specify custom download location
text = r.recognize_faster_whisper(
    audio,
    model="base",
    init_options={
        "download_root": "/path/to/models"
    }
)

Performance Comparison

Faster Whisper provides significant performance improvements over standard Whisper:

Model	Standard Whisper	Faster Whisper	Speedup
tiny	32s	6s	5.3x
base	46s	10s	4.6x
small	83s	18s	4.6x
medium	152s	32s	4.8x
large	251s	55s	4.6x

Times are approximate for 1 minute of audio on CPU (Intel i7-12700K)

Use GPU acceleration for even faster processing. GPU inference can be 10-30x faster than CPU.

Model Selection Guide

tiny - Testing, prototyping (lowest accuracy)
base - General use, real-time applications
small - Good balance of speed and accuracy
medium - High accuracy needed, acceptable latency
large/large-v3 - Maximum accuracy, offline processing
turbo - Best of both worlds (accuracy + speed)

Language Support

Supports 99 languages including:

English (en)
Spanish (es)
French (fr)
German (de)
Italian (it)
Portuguese (pt)
Dutch (nl)
Russian (ru)
Chinese (zh)
Japanese (ja)
Korean (ko)
Arabic (ar)
Hindi (hi)
And 86 more…

Compute Types

Different compute types offer trade-offs between speed, memory, and accuracy:

int8 - 4x smaller, 2x faster, slight accuracy loss
float16 - 2x smaller, 2x faster (GPU only), minimal accuracy loss
float32 - Full precision, slower, maximum accuracy

Best Practices

Start with the base model and increase to small or medium if accuracy is insufficient.

Specify the language when known to improve both speed and accuracy.

Use int8 compute type for CPU inference to reduce memory usage and improve speed.

The large models require significant RAM (8GB+). Use smaller models on resource-constrained devices.

recognize_whisper() - Standard Whisper (slower but simpler)
recognize_openai() - Cloud-based OpenAI Whisper API
recognize_groq() - Ultra-fast cloud Whisper via Groq

Core Classes

Recognition Methods

Exceptions

Overview

Method Signature

Parameters

Return Value

Simple Mode (show_dict=False)

Dictionary Mode (show_dict=True)

Exceptions

Setup

Installation

Model Download

Examples

Basic Usage

With Language Specification

From Audio File

With Full Response Details

Translation to English

GPU Acceleration

Custom Model Cache Location

Performance Comparison

Model Selection Guide

Language Support

Compute Types

Best Practices

External Resources

Core Classes

Recognition Methods

Exceptions

​Overview

​Method Signature

​Parameters

​Return Value

​Simple Mode (show_dict=False)

​Dictionary Mode (show_dict=True)

​Exceptions

​Setup

​Installation

​Model Download

​Examples

​Basic Usage

​With Language Specification

​From Audio File

​With Full Response Details

​Translation to English

​GPU Acceleration

​Custom Model Cache Location

​Performance Comparison

​Model Selection Guide

​Language Support

​Compute Types

​Best Practices

​Related Methods

​External Resources

Overview

Method Signature

Parameters

Return Value

Simple Mode (show_dict=False)

Dictionary Mode (show_dict=True)

Exceptions

Setup

Installation

Model Download

Examples

Basic Usage

With Language Specification

From Audio File

With Full Response Details

Translation to English

GPU Acceleration

Custom Model Cache Location

Performance Comparison

Model Selection Guide

Language Support

Compute Types

Best Practices

Related Methods

External Resources