Unlike other providers, this does not require credentials in the dashboard. Instead, you configure your service endpoint using environment variables.
How it works
Configuration
Set these environment variables on your Attendee server:CUSTOM_ASYNC_TRANSCRIPTION_URL(required): The full URL of your transcription endpoint (e.g.,https://192.168.0.1/transcribe)CUSTOM_ASYNC_TRANSCRIPTION_TIMEOUT(optional): Request timeout in seconds (default: 120)
Expected API format
Your transcription service must accept aPOST request with multipart/form-data containing:
audio: The audio file (sent as raw PCM audio, 16-bit linear PCM)sample_rate: The sample rate of the audio file in Hz- Any additional custom parameters you specify in
transcription_settings
Audio format details
- Format: Raw PCM (Pulse Code Modulation)
- Sample width: 16-bit
- Encoding: linear16
- Sample rate: Depends on the meeting source (typically 16000 Hz or 32000 Hz)
- Channels: 1 (mono)
Example request from Attendee to your service
Expected response format
Your service must return a JSON response with this structure:Response fields
status: Must be"done"for successful transcription, or"error"for failuresresult.transcription.full_transcript: The complete transcription textresult.transcription.utterances: Array of utterance objectsresult.transcription.utterances[].words: Array of word objects with timestampsresult.transcription.utterances[].words[].word: The word textresult.transcription.utterances[].words[].start: Start time in secondsresult.transcription.utterances[].words[].end: End time in seconds
Error response format
Usage example
When creating a bot, specify thecustom_async provider in transcription_settings:
custom_async will be sent as form data to your service along with the audio file. You can add any custom parameters your service needs.
Notes
- No credentials are needed in the Attendee dashboard
- Your service must respond asynchronously within the timeout period
- Audio is sent as raw PCM format (16-bit linear PCM, mono)
- The sample rate varies based on the meeting source (typically 16000 Hz or 32000 Hz)
- Word-level timestamps are supported if your service provides them
- You have full control over the transcription model, language detection, and processing