Endpoint
POST /v1/audio/transcriptions
Transcribes audio into text using speech-to-text models.
Request
Must be multipart/form-data
The AI provider to use (e.g., openai)
Your API key for the specified provider
The audio file to transcribe. Supported formats: mp3, mp4, mpeg, mpga, m4a, wav, webm. File size limit: 25 MB.
The model to use for transcription (e.g., whisper-1)
Language of the audio in ISO-639-1 format (e.g., en, fr, es). Providing the language improves accuracy and latency.
Optional text to guide the model’s style or continue a previous audio segment. Must match the audio language.
Format of the response: json, text, srt, verbose_json, or vtt
Sampling temperature between 0 and 1. Higher values produce more random output.
Timestamp granularities: word and/or segment (only with verbose_json format)
Response
Type of task (transcribe)
Duration of audio in seconds
Array of transcription segments with timestampsTranscribed text for this segment
Array of words with timestamps (when timestamp_granularities includes word)
Examples
Basic Transcription
curl http://localhost:8787/v1/audio/transcriptions \
-H "x-portkey-provider: openai" \
-H "x-portkey-api-key: sk-..." \
-F file="@audio.mp3" \
-F model="whisper-1"
Response
{
"text": "Hello, this is a test transcription of audio to text."
}
Python SDK
from portkey_ai import Portkey
from pathlib import Path
client = Portkey(
provider="openai",
Authorization="sk-..."
)
audio_file = open("audio.mp3", "rb")
transcription = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file
)
print(transcription.text)
JavaScript SDK
import Portkey from 'portkey-ai';
import fs from 'fs';
const client = new Portkey({
provider: 'openai',
Authorization: 'sk-...'
});
const transcription = await client.audio.transcriptions.create({
file: fs.createReadStream('audio.mp3'),
model: 'whisper-1'
});
console.log(transcription.text);
With Language Specification
from portkey_ai import Portkey
client = Portkey(
provider="openai",
Authorization="sk-..."
)
audio_file = open("french_audio.mp3", "rb")
transcription = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
language="fr" # French
)
print(transcription.text)
Verbose JSON with Timestamps
from portkey_ai import Portkey
client = Portkey(
provider="openai",
Authorization="sk-..."
)
audio_file = open("audio.mp3", "rb")
transcription = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="verbose_json",
timestamp_granularities=["segment"]
)
print(f"Language: {transcription.language}")
print(f"Duration: {transcription.duration}s")
print(f"\nText: {transcription.text}\n")
for segment in transcription.segments:
print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {segment.text}")
Word-Level Timestamps
from portkey_ai import Portkey
client = Portkey(
provider="openai",
Authorization="sk-..."
)
audio_file = open("audio.mp3", "rb")
transcription = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="verbose_json",
timestamp_granularities=["word"]
)
for word in transcription.words:
print(f"{word.start:.2f}s: {word.word}")
from portkey_ai import Portkey
client = Portkey(
provider="openai",
Authorization="sk-..."
)
audio_file = open("video_audio.mp3", "rb")
srt = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="srt"
)
# Save as subtitle file
with open("subtitles.srt", "w") as f:
f.write(srt.text)
from portkey_ai import Portkey
client = Portkey(
provider="openai",
Authorization="sk-..."
)
audio_file = open("video_audio.mp3", "rb")
vtt = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="vtt"
)
# Save as WebVTT subtitle file
with open("subtitles.vtt", "w") as f:
f.write(vtt.text)
With Prompt for Context
from portkey_ai import Portkey
client = Portkey(
provider="openai",
Authorization="sk-..."
)
audio_file = open("technical_audio.mp3", "rb")
transcription = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
prompt="This is a technical discussion about Portkey AI Gateway, APIs, and machine learning."
)
print(transcription.text)
Process Multiple Files
from portkey_ai import Portkey
from pathlib import Path
client = Portkey(
provider="openai",
Authorization="sk-..."
)
audio_dir = Path("audio_files")
for audio_file_path in audio_dir.glob("*.mp3"):
with open(audio_file_path, "rb") as audio_file:
transcription = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file
)
# Save transcription
output_path = audio_file_path.with_suffix(".txt")
output_path.write_text(transcription.text)
print(f"Transcribed: {audio_file_path.name}")
Real-time Transcription
from portkey_ai import Portkey
import pyaudio
import wave
client = Portkey(
provider="openai",
Authorization="sk-..."
)
def record_audio(filename, duration=5):
"""Record audio from microphone"""
CHUNK = 1024
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 16000
p = pyaudio.PyAudio()
stream = p.open(format=FORMAT, channels=CHANNELS,
rate=RATE, input=True,
frames_per_buffer=CHUNK)
print("Recording...")
frames = []
for i in range(0, int(RATE / CHUNK * duration)):
data = stream.read(CHUNK)
frames.append(data)
print("Finished recording")
stream.stop_stream()
stream.close()
p.terminate()
# Save
wf = wave.open(filename, 'wb')
wf.setnchannels(CHANNELS)
wf.setsampwidth(p.get_sample_size(FORMAT))
wf.setframerate(RATE)
wf.writeframes(b''.join(frames))
wf.close()
# Record and transcribe
record_audio("recording.wav", duration=5)
with open("recording.wav", "rb") as audio_file:
transcription = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file
)
print(f"Transcription: {transcription.text}")
- mp3: MPEG audio
- mp4: MPEG-4 audio
- mpeg: MPEG audio
- mpga: MPEG audio
- m4a: MPEG-4 audio
- wav: Waveform audio
- webm: WebM audio
Supported Languages
Whisper supports 90+ languages including:
- English (en), Spanish (es), French (fr), German (de)
- Chinese (zh), Japanese (ja), Korean (ko)
- Arabic (ar), Hindi (hi), Portuguese (pt)
- Russian (ru), Italian (it), Dutch (nl)
Full language list
Best Practices
- Specify Language: Improves accuracy and reduces latency
- Audio Quality: Use clear audio with minimal background noise
- File Size: Keep files under 25 MB (split longer audio if needed)
- Use Prompts: Provide context for technical terms or proper nouns
- Format Selection: Use
verbose_json for timestamps, srt/vtt for subtitles
Use Cases
- Meeting Transcriptions: Convert meeting recordings to text
- Subtitles: Generate subtitles for videos
- Voice Notes: Transcribe voice memos and notes
- Accessibility: Create text versions of audio content
- Content Analysis: Process podcasts and interviews
- Call Center: Transcribe customer service calls