Skip to main content

Overview

Google Vertex AI provides access to Gemini models, PaLM, and other Google AI models through Google Cloud Platform with enterprise features and SLAs.

Quick Start

1

Install LiteLLM

pip install litellm
2

Set Google Cloud Credentials

export VERTEX_PROJECT="your-project-id"
export VERTEX_LOCATION="us-central1"
export GOOGLE_APPLICATION_CREDENTIALS="path/to/credentials.json"
3

Make Your First Call

from litellm import completion

response = completion(
    model="vertex_ai/gemini-2.0-flash-exp",
    messages=[{"role": "user", "content": "Hello Gemini!"}]
)
print(response.choices[0].message.content)

Supported Models

Latest Gemini models with multimodal capabilities:
# Gemini 2.0 Flash (Experimental)
response = completion(
    model="vertex_ai/gemini-2.0-flash-exp",
    messages=[{"role": "user", "content": "Analyze this data..."}]
)

# With thinking mode
response = completion(
    model="vertex_ai/gemini-2.0-flash-thinking-exp-01-21",
    messages=[{"role": "user", "content": "Complex problem..."}]
)

Authentication

Available Locations

Vertex AI is available in multiple regions:
LocationCodeDescription
US Multi-Regionus-central1US multi-region (recommended)
Europeeurope-west1Belgium
Europeeurope-west4Netherlands
Asiaasia-southeast1Singapore
Asiaasia-northeast1Tokyo
response = completion(
    model="vertex_ai/gemini-1.5-pro",
    messages=[{"role": "user", "content": "Hello!"}],
    vertex_location="europe-west1"
)

Multimodal (Vision)

Gemini models support images, videos, and audio:
response = completion(
    model="vertex_ai/gemini-1.5-pro",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {
                "type": "image_url",
                "image_url": {"url": "https://example.com/image.jpg"}
            }
        ]
    }]
)

Function Calling

Gemini supports function calling:
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "City name"
                },
                "unit": {
                    "type": "string",
                    "enum": ["celsius", "fahrenheit"]
                }
            },
            "required": ["location"]
        }
    }
}]

response = completion(
    model="vertex_ai/gemini-1.5-pro",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools
)

if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]
    print(f"Function: {tool_call.function.name}")
    print(f"Arguments: {tool_call.function.arguments}")

Streaming

from litellm import completion

response = completion(
    model="vertex_ai/gemini-1.5-pro",
    messages=[{"role": "user", "content": "Write a story"}],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Context Caching

Cache large contexts to reduce costs:
response = completion(
    model="vertex_ai/gemini-1.5-pro",
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "You are an expert in... " * 1000,  # Long prompt
                    "cache_control": {"type": "ephemeral"}
                }
            ]
        },
        {"role": "user", "content": "Question 1"}
    ]
)

# Subsequent requests reuse cached context
response2 = completion(
    model="vertex_ai/gemini-1.5-pro",
    messages=[
        {"role": "system", "content": [{...}]},  # Same cached content
        {"role": "user", "content": "Question 2"}
    ]
)

JSON Mode

Force JSON output:
response = completion(
    model="vertex_ai/gemini-1.5-pro",
    messages=[{
        "role": "user",
        "content": "Extract: John is 30 years old, lives in NYC"
    }],
    response_format={"type": "json_object"}
)

import json
data = json.loads(response.choices[0].message.content)
Ground responses in Google Search or Vertex AI Search:
# Google Search grounding
response = completion(
    model="vertex_ai/gemini-1.5-pro",
    messages=[{"role": "user", "content": "What are the latest AI developments?"}],
    tools=[{"googleSearchRetrieval": {}}]
)

# Vertex AI Search grounding
response = completion(
    model="vertex_ai/gemini-1.5-pro",
    messages=[{"role": "user", "content": "Search our docs"}],
    tools=[{
        "retrieval": {
            "vertexAiSearch": {
                "datastore": "projects/PROJECT/locations/LOCATION/collections/default_collection/dataStores/DATASTORE_ID"
            }
        }
    }]
)

Safety Settings

Configure content safety filters:
response = completion(
    model="vertex_ai/gemini-1.5-pro",
    messages=[{"role": "user", "content": "Generate content"}],
    safety_settings=[
        {
            "category": "HARM_CATEGORY_HARASSMENT",
            "threshold": "BLOCK_MEDIUM_AND_ABOVE"
        },
        {
            "category": "HARM_CATEGORY_HATE_SPEECH",
            "threshold": "BLOCK_MEDIUM_AND_ABOVE"
        }
    ]
)

Embeddings

Generate embeddings:
from litellm import embedding

# Text embeddings
response = embedding(
    model="vertex_ai/text-embedding-005",
    input="Hello world"
)
print(len(response.data[0].embedding))  # 768 dimensions

# Multimodal embeddings (text + image)
response = embedding(
    model="vertex_ai/multimodalembedding",
    input={
        "text": "A cat",
        "image": {"url": "https://example.com/cat.jpg"}
    }
)

Advanced Parameters

Temperature and Sampling

response = completion(
    model="vertex_ai/gemini-1.5-pro",
    messages=[{"role": "user", "content": "Be creative"}],
    temperature=0.9,
    top_p=0.95,
    top_k=40,
    max_tokens=2048
)

System Instructions

response = completion(
    model="vertex_ai/gemini-1.5-pro",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"}
    ]
)

Stop Sequences

response = completion(
    model="vertex_ai/gemini-1.5-pro",
    messages=[{"role": "user", "content": "Count to 10"}],
    stop=["5", "\n\n"]
)

Batch Prediction

Process large batches asynchronously:
from litellm import create_batch, retrieve_batch

batch = create_batch(
    custom_llm_provider="vertex_ai",
    input_file_id="gs://bucket/input.jsonl",
    output_uri_prefix="gs://bucket/output/",
    endpoint="/generateContent"
)

print(f"Batch ID: {batch.id}")

Error Handling

from litellm import completion
from litellm.exceptions import (
    AuthenticationError,
    RateLimitError,
    APIError
)

try:
    response = completion(
        model="vertex_ai/gemini-1.5-pro",
        messages=[{"role": "user", "content": "Hello!"}]
    )
except AuthenticationError:
    print("Invalid Google Cloud credentials")
except RateLimitError:
    print("Quota exceeded")
except APIError as e:
    print(f"Vertex AI error: {e}")

Cost Tracking

from litellm import completion, completion_cost

response = completion(
    model="vertex_ai/gemini-1.5-pro",
    messages=[{"role": "user", "content": "Hello!"}]
)

cost = completion_cost(completion_response=response)
print(f"Cost: ${cost:.6f}")

print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")

Model Garden

Use models from Vertex AI Model Garden:
response = completion(
    model="vertex_ai_model_garden/meta/llama3-70b-instruct",
    messages=[{"role": "user", "content": "Hello"}],
    vertex_project="your-project",
    vertex_location="us-central1"
)

Best Practices

Use Service Accounts

Use service accounts with minimal required permissions for production.

Enable Caching

Use context caching for large prompts to reduce costs.

Choose Right Model

Use Flash for speed, Pro for quality, Flash-8B for high throughput.

Set Safety Filters

Configure appropriate safety settings for your use case.

Vision

Work with images, videos, and PDFs

Function Calling

Implement tool use with Gemini

Embeddings

Generate embeddings on Vertex AI

Streaming

Stream responses in real-time

Build docs developers (and LLMs) love