Skip to main content
The Gemini Live API enables low-latency bidirectional voice and video interactions with Gemini. The Live API can process text, audio, and video input in real-time, providing text and audio output for natural conversational experiences.

Overview

The Live API provides WebSocket-based streaming for real-time multimodal conversations with sub-second latency:

Real-Time Audio

Stream audio input and receive natural speech responses with native audio processing

Video Streaming

Send video frames for visual understanding in real-time conversations

Low Latency

Sub-second response times for natural, interactive experiences

Function Calling

Integrate tools and APIs during live conversations

Key Features

  • Bidirectional streaming: Send and receive data simultaneously
  • Native audio processing: PCM audio at 24kHz sampling rate
  • Barge-in support: Interrupt model responses naturally
  • Multimodal input: Combine text, audio, and video in the same session
  • Tool integration: Call functions during conversations

Getting Started

1

Install Dependencies

Install the WebSocket library:
pip install --upgrade websockets
2

Set Up Authentication

Configure your Google Cloud project:
import os

PROJECT_ID = "your-project-id"
LOCATION = "us-central1"
MODEL_ID = "gemini-live-2.5-flash-native-audio"

# Generate access token
import subprocess
result = subprocess.run(
    ["gcloud", "auth", "print-access-token"],
    capture_output=True,
    text=True
)
access_token = result.stdout.strip()
3

Establish WebSocket Connection

Connect to the Live API endpoint:
import websockets
import json

api_host = f"{LOCATION}-aiplatform.googleapis.com"
service_url = (
    f"wss://{api_host}/ws/google.cloud.aiplatform.v1.LlmBidiService/"
    f"BidiGenerateContent"
)

headers = {
    "Authorization": f"Bearer {access_token}",
    "Content-Type": "application/json"
}

async with websockets.connect(service_url, additional_headers=headers) as ws:
    print("Connected to Gemini Live API")

Session Establishment

The Live API follows a strict WebSocket sub-protocol with four phases:

1. Handshake

Establish the WebSocket connection with OAuth 2.0 authentication:
import websockets

api_host = "us-central1-aiplatform.googleapis.com"
service_url = (
    f"wss://{api_host}/ws/google.cloud.aiplatform.v1.LlmBidiService/"
    f"BidiGenerateContent"
)

headers = {
    "Authorization": f"Bearer {access_token}",
    "Content-Type": "application/json"
}

async with websockets.connect(service_url, additional_headers=headers) as ws:
    print("Handshake complete")

2. Setup

Configure the session with model parameters:
setup = {
    "setup": {
        "model": f"projects/{PROJECT_ID}/locations/{LOCATION}/publishers/google/models/{MODEL_ID}",
        "generation_config": {
            "response_modalities": ["AUDIO"],
            "speech_config": {
                "voice_config": {
                    "prebuilt_voice_config": {
                        "voice_name": "Aoede"
                    }
                }
            }
        },
        "system_instruction": {
            "parts": [{"text": "You are a helpful assistant."}]
        }
    }
}

# Send setup message
await ws.send(json.dumps(setup))
setup_response = await ws.recv()
print("Setup complete")

3. Session Loop

Run bidirectional send and receive loops concurrently:
import asyncio
import base64
import numpy as np

async def main():
    async with websockets.connect(service_url, additional_headers=headers) as ws:
        # Send setup
        await ws.send(json.dumps(setup))
        await ws.recv()
        
        # Define send loop
        async def send_loop():
            try:
                while True:
                    # Simulate reading audio chunks (20ms PCM16 at 24kHz)
                    # In production, read from microphone
                    await asyncio.sleep(0.02)
            except asyncio.CancelledError:
                pass
        
        # Define receive loop
        async def receive_loop():
            try:
                async for message in ws:
                    response = json.loads(message)
                    
                    # Handle audio output
                    if "serverContent" in response:
                        parts = response["serverContent"].get("modelTurn", {}).get("parts", [])
                        for part in parts:
                            if "inlineData" in part:
                                # Audio data is base64-encoded PCM
                                pcm_data = base64.b64decode(part["inlineData"]["data"])
                                # Play audio or buffer for playback
                                
                    # Handle turn completion
                    if response.get("serverContent", {}).get("turnComplete"):
                        print("Turn complete")
                        
                    # Handle interruption (barge-in)
                    if response.get("interrupted"):
                        print("Model interrupted - stop playback")
                        
            except websockets.exceptions.ConnectionClosed:
                print("Connection closed")
        
        # Run both loops concurrently
        await asyncio.gather(send_loop(), receive_loop())

await main()

4. Termination

Close the WebSocket connection:
await ws.close()
print("Session terminated")

Message Types

Client Messages

Send text messages to the model:
async def send_text(ws, text_input: str):
    msg = {
        "client_content": {
            "turns": [
                {
                    "role": "user",
                    "parts": [{"text": text_input}]
                }
            ],
            "turn_complete": True
        }
    }
    await ws.send(json.dumps(msg))

await send_text(ws, "Hello, Gemini!")

Server Messages

Handle different types of responses from the server:
async def handle_server_message(message: str):
    response = json.loads(message)
    
    # Model's text/audio output
    if "serverContent" in response:
        model_turn = response["serverContent"].get("modelTurn", {})
        
        for part in model_turn.get("parts", []):
            # Text output
            if "text" in part:
                print(f"Model: {part['text']}")
            
            # Audio output (PCM16 at 24kHz)
            if "inlineData" in part:
                audio_data = base64.b64decode(part["inlineData"]["data"])
                # Play audio through speakers
                play_audio(audio_data)
        
        # Check if model finished speaking
        if response["serverContent"].get("turnComplete"):
            print("Model finished turn")
    
    # Model wants to call a function
    if "toolCall" in response:
        for call in response["toolCall"].get("functionCalls", []):
            function_name = call["name"]
            args = call["args"]
            
            # Execute function
            result = execute_function(function_name, args)
            
            # Send result back
            await send_tool_response(ws, function_name, result)
    
    # Model was interrupted (user started speaking)
    if response.get("interrupted"):
        print("Barge-in detected - stopping playback")
        stop_audio_playback()

Complete Example: Text to Speech

A simple text-to-speech example:
import asyncio
import websockets
import json
import base64
import numpy as np
from IPython.display import Audio, display

async def text_to_speech(text_input: str):
    async with websockets.connect(service_url, additional_headers=headers) as ws:
        # Setup
        await ws.send(json.dumps(setup))
        await ws.recv()
        
        # Send text
        msg = {
            "client_content": {
                "turns": [{"role": "user", "parts": [{"text": text_input}]}],
                "turn_complete": True
            }
        }
        await ws.send(json.dumps(msg))
        
        # Collect audio response
        audio_data = []
        async for message in ws:
            response = json.loads(message)
            
            # Extract audio
            if "serverContent" in response:
                parts = response["serverContent"].get("modelTurn", {}).get("parts", [])
                for part in parts:
                    if "inlineData" in part:
                        pcm_data = base64.b64decode(part["inlineData"]["data"])
                        audio_data.append(np.frombuffer(pcm_data, dtype=np.int16))
            
            # Check if complete
            if response.get("serverContent", {}).get("turnComplete"):
                break
        
        # Play audio
        if audio_data:
            display(Audio(np.concatenate(audio_data), rate=24000, autoplay=True))

# Use it
await text_to_speech("Hello! How are you today?")

Function Calling in Live Sessions

Integrate tools and APIs during conversations:
# Define tools in setup
setup_with_tools = {
    "setup": {
        "model": model_path,
        "tools": [
            {
                "function_declarations": [
                    {
                        "name": "get_weather",
                        "description": "Get current weather for a location",
                        "parameters": {
                            "type": "object",
                            "properties": {
                                "location": {
                                    "type": "string",
                                    "description": "City name"
                                }
                            },
                            "required": ["location"]
                        }
                    }
                ]
            }
        ]
    }
}

# Handle function calls in receive loop
async def receive_with_tools():
    async for message in ws:
        response = json.loads(message)
        
        if "toolCall" in response:
            for call in response["toolCall"].get("functionCalls", []):
                if call["name"] == "get_weather":
                    location = call["args"]["location"]
                    weather = fetch_weather(location)  # Your implementation
                    
                    # Send result back
                    await send_tool_response(ws, "get_weather", {
                        "temperature": weather["temp"],
                        "condition": weather["condition"]
                    })

Use Cases

Voice Assistants

Build natural voice interfaces for customer support, information retrieval, and task automation

Real-Time Translation

Provide live translation services with audio input and output

Gaming NPCs

Create interactive game characters with natural voice responses

Visual Q&A

Answer questions about live video feeds or camera input

Customer Service

Handle customer inquiries with voice and screen sharing

Education

Interactive tutoring with multimodal explanations

Best Practices

Audio Format Requirements
  • Format: PCM16 (Linear 16-bit PCM)
  • Sample rate: 24kHz
  • Channels: Mono (1 channel)
  • Chunk size: ~20ms recommended (480 samples)
Handle Barge-In ProperlyAlways stop audio playback immediately when receiving an interrupted message. Continuing to play after interruption creates a poor user experience.
if response.get("interrupted"):
    audio_queue.clear()
    stop_playback()

Performance Tips

  1. Use asyncio: Run send and receive loops concurrently for lowest latency
  2. Buffer management: Keep audio buffers small to minimize delay
  3. Error handling: Implement reconnection logic for network issues
  4. Token expiration: Refresh OAuth tokens before they expire (default 60 minutes)
# Refresh token before expiration
import time

token_expiry = time.time() + 3600  # 1 hour

if time.time() > token_expiry - 300:  # 5 min before expiry
    access_token = get_new_token()
    # Reconnect with new token

Supported Models

The Live API supports specific Gemini models optimized for real-time interaction:
  • gemini-live-2.5-flash-native-audio: Best for voice interactions
  • gemini-2.0-flash-exp: Experimental with multimodal support
See the official documentation for the latest model availability.

Next Steps

WebSocket Demo App

Complete reference implementation with React frontend

Native Audio SDK

Higher-level SDK for audio interactions

Function Calling

Learn more about integrating tools

Pricing

View Live API pricing details

Build docs developers (and LLMs) love