Skip to main content

Overview

The Realtime API enables low-latency, multi-modal conversational experiences using WebSocket connections. It supports text and audio as both input and output, as well as function calling. Key benefits:
  • Native speech-to-speech: Low latency by skipping intermediate text format
  • Natural voices: Models can laugh, whisper, and follow tone directions
  • Simultaneous multimodal output: Get both text and audio in real-time

Connection Setup

The Realtime API is a stateful, event-based API that communicates over WebSocket.

WebSocket Connection

from openai import OpenAI

client = OpenAI()

# Connect to the Realtime API
with client.realtime.connect(model="gpt-4o-realtime-preview") as connection:
    # Connection is now established
    # Send and receive events through the connection
    pass

Connection Parameters

model
string
The Realtime model to use. Required for Azure, optional for OpenAI.Examples: gpt-4o-realtime-preview, gpt-4o-realtime-preview-2024-10-01
call_id
string
Optional call identifier for tracking purposes
extra_query
dict
Additional query parameters for the WebSocket connection
extra_headers
dict
Additional headers for the WebSocket connection
websocket_connection_options
dict
WebSocket-specific connection options

Session Management

Update Session Configuration

Update session settings at any time during the connection.
with client.realtime.connect() as connection:
    # Update session configuration
    connection.session.update(
        session={
            "instructions": "You are a helpful assistant. Speak clearly and concisely.",
            "voice": "alloy",
            "input_audio_format": "pcm16",
            "output_audio_format": "pcm16",
            "turn_detection": {
                "type": "server_vad",
                "threshold": 0.5,
                "prefix_padding_ms": 300,
                "silence_duration_ms": 500
            },
            "tools": [
                {
                    "type": "function",
                    "name": "get_weather",
                    "description": "Get weather information",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "location": {"type": "string"}
                        },
                        "required": ["location"]
                    }
                }
            ]
        }
    )

Session Configuration Options

session.instructions
string
System instructions for the model (e.g., “Be succinct”, “Speak quickly”)
session.voice
string
Voice for audio output. Options: alloy, echo, shimmerCan only be updated before any audio output has been generated.
session.input_audio_format
string
Format for input audio: pcm16 or g711_ulaw or g711_alaw
session.output_audio_format
string
Format for output audio: pcm16 or g711_ulaw or g711_alaw
session.turn_detection
object
Voice Activity Detection (VAD) configuration:
  • type: "server_vad" or null to disable
  • threshold: Detection sensitivity (0-1)
  • prefix_padding_ms: Audio before speech starts
  • silence_duration_ms: Silence duration to end turn
session.tools
array
Function tools available to the model
session.tool_choice
string | object
How model chooses tools: auto, none, required, or force a specific function
session.temperature
float
Sampling temperature (0-2). Higher = more random.
session.max_output_tokens
int | 'inf'
Maximum tokens per response (1-4096 or "inf")

Receiving Events

Iterate Through Events

with client.realtime.connect() as connection:
    # Iterate through server events
    for event in connection:
        if event.type == "session.created":
            print(f"Session created: {event.session.id}")
        elif event.type == "response.done":
            print("Response complete")
        elif event.type == "error":
            print(f"Error: {event.error}")

Receive Single Event

with client.realtime.connect() as connection:
    # Wait for next event
    event = connection.recv()
    print(f"Received: {event.type}")
    
    # Or receive raw bytes
    raw_data = connection.recv_bytes()
    event = connection.parse_event(raw_data)

Complete Example

from openai import OpenAI

client = OpenAI()

# Establish connection with model
with client.realtime.connect(model="gpt-4o-realtime-preview") as connection:
    # Configure the session
    connection.session.update(
        session={
            "instructions": "You are a helpful assistant. Be concise.",
            "voice": "alloy",
            "input_audio_format": "pcm16",
            "output_audio_format": "pcm16",
            "turn_detection": {
                "type": "server_vad",
                "threshold": 0.5,
                "silence_duration_ms": 500
            }
        }
    )
    
    # Send audio input
    audio_data = b"..."  # PCM16 audio bytes
    import base64
    connection.input_audio_buffer.append(
        audio=base64.b64encode(audio_data).decode('utf-8')
    )
    
    # Process events
    for event in connection:
        if event.type == "response.audio.delta":
            # Stream audio output
            audio_chunk = base64.b64decode(event.delta)
            # Play or save audio_chunk
        elif event.type == "response.done":
            print("Response complete")
            break

Async Usage

from openai import AsyncOpenAI

client = AsyncOpenAI()

async with client.realtime.connect() as connection:
    await connection.session.update(
        session={"instructions": "Be helpful"}
    )
    
    async for event in connection:
        print(event.type)

Notes

  • Installation: Requires openai[realtime] package: pip install openai[realtime]
  • Context manager: Connection is automatically closed when exiting the with block
  • Manual connection: Use .enter() method if you need to manage connection lifecycle manually
  • Azure: Model parameter is required for Azure Realtime API
  • Session configuration can be updated anytime except voice and model

Build docs developers (and LLMs) love