Skip to main content
The custom async transcription provider allows you to use your own self-hosted transcription service. This is ideal if you want full control over your transcription infrastructure, need to keep audio data on-premises, or want to use a custom transcription model.
Unlike other providers, this does not require credentials in the dashboard. Instead, you configure your service endpoint using environment variables.

How it works

1

Send audio

Attendee sends audio segments as raw PCM audio via HTTP POST to your configured endpoint
2

Process audio

Your service processes the audio and returns the transcription asynchronously
3

Return result

The response must follow the expected format (see below)

Configuration

Set these environment variables on your Attendee server:
  • CUSTOM_ASYNC_TRANSCRIPTION_URL (required): The full URL of your transcription endpoint (e.g., https://192.168.0.1/transcribe)
  • CUSTOM_ASYNC_TRANSCRIPTION_TIMEOUT (optional): Request timeout in seconds (default: 120)

Expected API format

Your transcription service must accept a POST request with multipart/form-data containing:
  • audio: The audio file (sent as raw PCM audio, 16-bit linear PCM)
  • sample_rate: The sample rate of the audio file in Hz
  • Any additional custom parameters you specify in transcription_settings

Audio format details

  • Format: Raw PCM (Pulse Code Modulation)
  • Sample width: 16-bit
  • Encoding: linear16
  • Sample rate: Depends on the meeting source (typically 16000 Hz or 32000 Hz)
  • Channels: 1 (mono)

Example request from Attendee to your service

curl -X POST 'http://your-service.com/transcribe' \
  -F '[email protected]' \
  -F 'language=fr-FR' \
  -F 'custom_param=value'

Expected response format

Your service must return a JSON response with this structure:
{
  "status": "done",
  "result": {
    "transcription": {
      "full_transcript": "The complete transcription text",
      "utterances": [
        {
          "words": [
            {
              "word": "hello",
              "start": 0.0,
              "end": 0.5
            },
            {
              "word": "world",
              "start": 0.6,
              "end": 1.0
            }
          ]
        }
      ]
    }
  }
}

Response fields

  • status: Must be "done" for successful transcription, or "error" for failures
  • result.transcription.full_transcript: The complete transcription text
  • result.transcription.utterances: Array of utterance objects
  • result.transcription.utterances[].words: Array of word objects with timestamps
  • result.transcription.utterances[].words[].word: The word text
  • result.transcription.utterances[].words[].start: Start time in seconds
  • result.transcription.utterances[].words[].end: End time in seconds

Error response format

{
  "status": "error",
  "error_code": "TRANSCRIPTION_FAILED"
}

Usage example

When creating a bot, specify the custom_async provider in transcription_settings:
{
  "meeting_url": "https://zoom.us/j/123456789",
  "bot_name": "My Bot",
  "transcription_settings": {
    "custom_async": {
      "language": "fr-FR",
      "model": "whisper-large-v3",
      "custom_param": "any_value"
    }
  }
}
All properties inside custom_async will be sent as form data to your service along with the audio file. You can add any custom parameters your service needs.

Notes

  • No credentials are needed in the Attendee dashboard
  • Your service must respond asynchronously within the timeout period
  • Audio is sent as raw PCM format (16-bit linear PCM, mono)
  • The sample rate varies based on the meeting source (typically 16000 Hz or 32000 Hz)
  • Word-level timestamps are supported if your service provides them
  • You have full control over the transcription model, language detection, and processing

Build docs developers (and LLMs) love