Google Vertex AI

Overview

Google Vertex AI provides access to Gemini models, PaLM, and other Google AI models through Google Cloud Platform with enterprise features and SLAs.

Quick Start

Install LiteLLM

pip install litellm

Set Google Cloud Credentials

export VERTEX_PROJECT="your-project-id"
export VERTEX_LOCATION="us-central1"
export GOOGLE_APPLICATION_CREDENTIALS="path/to/credentials.json"

Make Your First Call

from litellm import completion

response = completion(
    model="vertex_ai/gemini-2.0-flash-exp",
    messages=[{"role": "user", "content": "Hello Gemini!"}]
)
print(response.choices[0].message.content)

Supported Models

Gemini 2.0
Gemini 1.5
Gemini 1.0
Other Models

Latest Gemini models with multimodal capabilities:

# Gemini 2.0 Flash (Experimental)
response = completion(
    model="vertex_ai/gemini-2.0-flash-exp",
    messages=[{"role": "user", "content": "Analyze this data..."}]
)

# With thinking mode
response = completion(
    model="vertex_ai/gemini-2.0-flash-thinking-exp-01-21",
    messages=[{"role": "user", "content": "Complex problem..."}]
)

Production Gemini models:

# Gemini 1.5 Pro - Most capable
response = completion(
    model="vertex_ai/gemini-1.5-pro",
    messages=[{"role": "user", "content": "Complex analysis..."}]
)

# Gemini 1.5 Flash - Fast and efficient
response = completion(
    model="vertex_ai/gemini-1.5-flash",
    messages=[{"role": "user", "content": "Quick task..."}]
)

# Gemini 1.5 Flash-8B - Ultra fast
response = completion(
    model="vertex_ai/gemini-1.5-flash-8b",
    messages=[{"role": "user", "content": "Simple query..."}]
)

Earlier Gemini models:

# Gemini 1.0 Pro
response = completion(
    model="vertex_ai/gemini-1.0-pro",
    messages=[{"role": "user", "content": "Task..."}]
)

# PaLM 2
response = completion(
    model="vertex_ai/text-bison",
    messages=[{"role": "user", "content": "Generate text..."}]
)

# Codey (Code generation)
response = completion(
    model="vertex_ai/code-bison",
    messages=[{"role": "user", "content": "Write Python code..."}]
)

Authentication

Service Account (Recommended)
Application Default Credentials
Direct Parameters

export GOOGLE_APPLICATION_CREDENTIALS="/path/to/credentials.json"
export VERTEX_PROJECT="your-project-id"
export VERTEX_LOCATION="us-central1"

from litellm import completion

response = completion(
    model="vertex_ai/gemini-1.5-pro",
    messages=[{"role": "user", "content": "Hello!"}]
)

# Authenticate using gcloud
gcloud auth application-default login

export VERTEX_PROJECT="your-project-id"
export VERTEX_LOCATION="us-central1"

from litellm import completion

response = completion(
    model="vertex_ai/gemini-1.5-pro",
    messages=[{"role": "user", "content": "Hello!"}]
)

from litellm import completion

response = completion(
    model="vertex_ai/gemini-1.5-pro",
    messages=[{"role": "user", "content": "Hello!"}],
    vertex_project="your-project-id",
    vertex_location="us-central1",
    vertex_credentials="/path/to/credentials.json"
)

Available Locations

Vertex AI is available in multiple regions:

Location	Code	Description
US Multi-Region	`us-central1`	US multi-region (recommended)
Europe	`europe-west1`	Belgium
Europe	`europe-west4`	Netherlands
Asia	`asia-southeast1`	Singapore
Asia	`asia-northeast1`	Tokyo

response = completion(
    model="vertex_ai/gemini-1.5-pro",
    messages=[{"role": "user", "content": "Hello!"}],
    vertex_location="europe-west1"
)

Multimodal (Vision)

Gemini models support images, videos, and audio:

response = completion(
    model="vertex_ai/gemini-1.5-pro",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {
                "type": "image_url",
                "image_url": {"url": "https://example.com/image.jpg"}
            }
        ]
    }]
)

Function Calling

Gemini supports function calling:

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "City name"
                },
                "unit": {
                    "type": "string",
                    "enum": ["celsius", "fahrenheit"]
                }
            },
            "required": ["location"]
        }
    }
}]

response = completion(
    model="vertex_ai/gemini-1.5-pro",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools
)

if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]
    print(f"Function: {tool_call.function.name}")
    print(f"Arguments: {tool_call.function.arguments}")

Streaming

from litellm import completion

response = completion(
    model="vertex_ai/gemini-1.5-pro",
    messages=[{"role": "user", "content": "Write a story"}],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Context Caching

Cache large contexts to reduce costs:

response = completion(
    model="vertex_ai/gemini-1.5-pro",
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "You are an expert in... " * 1000,  # Long prompt
                    "cache_control": {"type": "ephemeral"}
                }
            ]
        },
        {"role": "user", "content": "Question 1"}
    ]
)

# Subsequent requests reuse cached context
response2 = completion(
    model="vertex_ai/gemini-1.5-pro",
    messages=[
        {"role": "system", "content": [{...}]},  # Same cached content
        {"role": "user", "content": "Question 2"}
    ]
)

JSON Mode

Force JSON output:

response = completion(
    model="vertex_ai/gemini-1.5-pro",
    messages=[{
        "role": "user",
        "content": "Extract: John is 30 years old, lives in NYC"
    }],
    response_format={"type": "json_object"}
)

import json
data = json.loads(response.choices[0].message.content)

Grounding (Search)

Ground responses in Google Search or Vertex AI Search:

# Google Search grounding
response = completion(
    model="vertex_ai/gemini-1.5-pro",
    messages=[{"role": "user", "content": "What are the latest AI developments?"}],
    tools=[{"googleSearchRetrieval": {}}]
)

# Vertex AI Search grounding
response = completion(
    model="vertex_ai/gemini-1.5-pro",
    messages=[{"role": "user", "content": "Search our docs"}],
    tools=[{
        "retrieval": {
            "vertexAiSearch": {
                "datastore": "projects/PROJECT/locations/LOCATION/collections/default_collection/dataStores/DATASTORE_ID"
            }
        }
    }]
)

Safety Settings

Configure content safety filters:

response = completion(
    model="vertex_ai/gemini-1.5-pro",
    messages=[{"role": "user", "content": "Generate content"}],
    safety_settings=[
        {
            "category": "HARM_CATEGORY_HARASSMENT",
            "threshold": "BLOCK_MEDIUM_AND_ABOVE"
        },
        {
            "category": "HARM_CATEGORY_HATE_SPEECH",
            "threshold": "BLOCK_MEDIUM_AND_ABOVE"
        }
    ]
)

Embeddings

Generate embeddings:

from litellm import embedding

# Text embeddings
response = embedding(
    model="vertex_ai/text-embedding-005",
    input="Hello world"
)
print(len(response.data[0].embedding))  # 768 dimensions

# Multimodal embeddings (text + image)
response = embedding(
    model="vertex_ai/multimodalembedding",
    input={
        "text": "A cat",
        "image": {"url": "https://example.com/cat.jpg"}
    }
)

Advanced Parameters

Temperature and Sampling

response = completion(
    model="vertex_ai/gemini-1.5-pro",
    messages=[{"role": "user", "content": "Be creative"}],
    temperature=0.9,
    top_p=0.95,
    top_k=40,
    max_tokens=2048
)

System Instructions

response = completion(
    model="vertex_ai/gemini-1.5-pro",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"}
    ]
)

Stop Sequences

response = completion(
    model="vertex_ai/gemini-1.5-pro",
    messages=[{"role": "user", "content": "Count to 10"}],
    stop=["5", "\n\n"]
)

Batch Prediction

Process large batches asynchronously:

from litellm import create_batch, retrieve_batch

batch = create_batch(
    custom_llm_provider="vertex_ai",
    input_file_id="gs://bucket/input.jsonl",
    output_uri_prefix="gs://bucket/output/",
    endpoint="/generateContent"
)

print(f"Batch ID: {batch.id}")

Error Handling

from litellm import completion
from litellm.exceptions import (
    AuthenticationError,
    RateLimitError,
    APIError
)

try:
    response = completion(
        model="vertex_ai/gemini-1.5-pro",
        messages=[{"role": "user", "content": "Hello!"}]
    )
except AuthenticationError:
    print("Invalid Google Cloud credentials")
except RateLimitError:
    print("Quota exceeded")
except APIError as e:
    print(f"Vertex AI error: {e}")

Cost Tracking

from litellm import completion, completion_cost

response = completion(
    model="vertex_ai/gemini-1.5-pro",
    messages=[{"role": "user", "content": "Hello!"}]
)

cost = completion_cost(completion_response=response)
print(f"Cost: ${cost:.6f}")

print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")

Model Garden

Use models from Vertex AI Model Garden:

response = completion(
    model="vertex_ai_model_garden/meta/llama3-70b-instruct",
    messages=[{"role": "user", "content": "Hello"}],
    vertex_project="your-project",
    vertex_location="us-central1"
)

Best Practices

Use Service Accounts

Use service accounts with minimal required permissions for production.

Enable Caching

Use context caching for large prompts to reduce costs.

Choose Right Model

Use Flash for speed, Pro for quality, Flash-8B for high throughput.

Set Safety Filters

Configure appropriate safety settings for your use case.

Vision

Work with images, videos, and PDFs

Function Calling

Implement tool use with Gemini

Embeddings

Generate embeddings on Vertex AI

Streaming

Stream responses in real-time

Providers

Provider Features

Google Vertex AI

Overview

Quick Start

Supported Models

Authentication

Available Locations

Multimodal (Vision)

Function Calling

Streaming

Context Caching

JSON Mode

Grounding (Search)

Safety Settings

Embeddings

Advanced Parameters

Temperature and Sampling

System Instructions

Stop Sequences

Batch Prediction

Error Handling

Cost Tracking

Model Garden

Best Practices

Use Service Accounts

Enable Caching

Choose Right Model

Set Safety Filters

Vision

Function Calling

Embeddings

Streaming

Build docs developers (and LLMs) love

Providers

Provider Features

​Overview

​Quick Start

​Supported Models

​Authentication

​Available Locations

​Multimodal (Vision)

​Function Calling

​Streaming

​Context Caching

​JSON Mode

​Grounding (Search)

​Safety Settings

​Embeddings

​Advanced Parameters

​Temperature and Sampling

​System Instructions

​Stop Sequences

​Batch Prediction

​Error Handling

​Cost Tracking

​Model Garden

​Best Practices

Use Service Accounts

Enable Caching

Choose Right Model

Set Safety Filters

​Related Documentation

Vision

Function Calling

Embeddings

Streaming

Build docs developers (and LLMs) love

Overview

Quick Start

Supported Models

Authentication

Available Locations

Multimodal (Vision)

Function Calling

Streaming

Context Caching

JSON Mode

Grounding (Search)

Safety Settings

Embeddings

Advanced Parameters

Temperature and Sampling

System Instructions

Stop Sequences

Batch Prediction

Error Handling

Cost Tracking

Model Garden

Best Practices

Related Documentation