Overview
The AssemblyAI Real-Time Transcription Browser Example uses a three-tier architecture that separates concerns between the backend server, frontend client, and AssemblyAI’s streaming service.Architecture Components
1. Express Server (Backend)
The Express server acts as a security layer and token provider:server.js
The server’s primary responsibility is generating temporary tokens for secure client-side connections to AssemblyAI. It never exposes your API key to the browser.
2. Browser Client (Frontend)
The client handles three main responsibilities:- Audio capture using the Web Audio API and AudioWorklet
- Token retrieval from the Express server
- WebSocket communication with AssemblyAI’s real-time service
index.js
3. AssemblyAI Streaming Service
The AssemblyAI service receives audio data over WebSocket and returns transcripts in real-time using turn-based messages.Data Flow
Turn-Based Transcription
AssemblyAI returns transcripts as “turns” - natural speech segments organized by speaker turns:index.js
Turns may arrive out of order due to network conditions or processing delays. The application stores turns in an object and sorts them by
turn_order for display.Security Architecture
The token-based security model ensures:- API keys remain secret on the server
- Clients receive time-limited access tokens
- Tokens expire automatically (60 seconds in this example)
- Each client session requires a new token
Connection Lifecycle
- Initialization: User clicks “Record” button
- Token Request: Client fetches temporary token from Express server
- WebSocket Connection: Client connects to AssemblyAI using token
- Audio Streaming: AudioWorklet processes and sends audio chunks
- Transcription: AssemblyAI returns Turn messages with transcripts
- Termination: User clicks “Stop”, client sends Terminate message and closes connection
The application maintains state through boolean flags (
isRecording) and object references (ws, microphone) to coordinate between components.