Skip to main content

Overview

Vertex AI provides multiple options for serving open-source models with optimized inference performance. Choose the right serving solution based on your latency, throughput, and cost requirements.

Serving Options

High-throughput serving with PagedAttention:
  • Best for: High-volume production workloads
  • Features: Continuous batching, KV cache optimization
  • Throughput: Up to 24x higher than standard serving
  • Models: Most LLMs (Llama, Gemma, Mistral, etc.)

vLLM Deployment

vLLM is the recommended option for high-performance LLM serving.

Basic vLLM Deployment

1

Install Dependencies

pip install --upgrade google-cloud-aiplatform huggingface_hub
2

Initialize Vertex AI

import vertexai
from vertexai import model_garden

PROJECT_ID = "your-project-id"
LOCATION = "us-central1"

vertexai.init(project=PROJECT_ID, location=LOCATION)
3

Deploy with vLLM

# Models deployed through Model Garden SDK automatically use vLLM
model = model_garden.OpenModel("meta/[email protected]")

endpoint = model.deploy(
    machine_type="g2-standard-12",
    accelerator_type="NVIDIA_L4",
    accelerator_count=1,
    accept_eula=True
)
4

Test Inference

response = endpoint.predict(
    instances=[{
        "prompt": "Explain machine learning",
        "max_tokens": 200,
        "temperature": 0.7
    }]
)

print(response.predictions[0])

vLLM with Multiple LoRA Adapters

Serve one base model with multiple task-specific adapters:
from huggingface_hub import snapshot_download
import os

os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

# Download LoRA adapters
sql_adapter_path = snapshot_download(
    repo_id="google-cloud-partnership/gemma-2-2b-it-lora-sql",
    local_dir="./adapters/sql"
)

code_adapter_path = snapshot_download(
    repo_id="google-cloud-partnership/gemma-2-2b-it-lora-magicoder",
    local_dir="./adapters/code"
)

# Upload to GCS
BUCKET_URI = "gs://your-bucket"
!gcloud storage cp -r ./adapters/* {BUCKET_URI}/lora-adapters/

Using Multiple Adapters

import openai

client = openai.OpenAI(
    base_url=f"https://{endpoint.resource_name}/v1",
    api_key=auth_token
)

# Use SQL adapter
sql_response = client.chat.completions.create(
    model="sql",  # Specify adapter name
    messages=[{
        "role": "user",
        "content": "Write a SQL query to find top 10 customers by revenue"
    }]
)

# Use code adapter
code_response = client.chat.completions.create(
    model="code",  # Different adapter
    messages=[{
        "role": "user",
        "content": "Write a Python function to merge two sorted arrays"
    }]
)

Text Generation Inference (TGI)

Deploy Hugging Face models with TGI for optimized performance.

TGI Deployment

1

Authenticate with Hugging Face

from huggingface_hub import interpreter_login, get_token

# Login to Hugging Face
interpreter_login()

# Get token
hf_token = get_token()
2

Create Model Registry Entry

from google.cloud import aiplatform

# Upload model with TGI container
model = aiplatform.Model.upload(
    display_name="gemma-tgi",
    serving_container_image_uri="us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-hf-tgi-serve:20240220_0936_RC01",
    serving_container_environment_variables={
        "MODEL_ID": "google/gemma-7b-it",
        "HUGGING_FACE_HUB_TOKEN": hf_token,
        "DEPLOY_SOURCE": "notebook"
    },
    serving_container_ports=[7080]
)
3

Deploy to Endpoint

endpoint = model.deploy(
    machine_type="g2-standard-12",
    accelerator_type="NVIDIA_L4",
    accelerator_count=1,
    traffic_split={"0": 100},
    deploy_request_timeout=1800
)
4

Make Predictions

prediction = endpoint.predict(
    instances=[{
        "inputs": "Explain quantum computing",
        "parameters": {
            "max_new_tokens": 200,
            "temperature": 0.7,
            "top_p": 0.9
        }
    }]
)

print(prediction.predictions[0])

TGI with Multiple LoRA Adapters

# Environment variables for TGI with LoRA
env_vars = {
    "MODEL_ID": "google/gemma-2-9b-it",
    "HUGGING_FACE_HUB_TOKEN": hf_token,
    "NUM_SHARD": "1",
    "MAX_INPUT_LENGTH": "4096",
    "MAX_TOTAL_TOKENS": "8192",
    "LORA_ADAPTERS": "sql,code",  # Comma-separated adapter IDs
    "LORA_ADAPTER_sql": "google-cloud-partnership/gemma-2-9b-it-lora-sql",
    "LORA_ADAPTER_code": "google-cloud-partnership/gemma-2-9b-it-lora-magicoder"
}

model = aiplatform.Model.upload(
    display_name="gemma-tgi-multi-lora",
    serving_container_image_uri=TGI_IMAGE_URI,
    serving_container_environment_variables=env_vars,
    serving_container_ports=[7080]
)

Ollama on Cloud Run

Deploy models with Ollama for lightweight serving:
FROM ollama/ollama:latest

# Copy model
COPY Modelfile /Modelfile

# Pull and create model
RUN ollama serve & \
    sleep 5 && \
    ollama pull gemma2:2b && \
    ollama create mymodel -f /Modelfile

EXPOSE 11434

CMD ["serve"]

Custom PyTorch Handlers

Deploy models with custom preprocessing/postprocessing:
from ts.torch_handler.base_handler import BaseHandler
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

class CustomLLMHandler(BaseHandler):
    def initialize(self, context):
        self.manifest = context.manifest
        properties = context.system_properties
        
        model_id = properties.get("model_id", "google/gemma-2b")
        
        # Load model and tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_id,
            torch_dtype=torch.bfloat16,
            device_map="auto"
        )
        
        self.initialized = True
    
    def preprocess(self, data):
        """Custom preprocessing"""
        prompts = [item.get("data") or item.get("body") for item in data]
        
        # Apply chat template
        inputs = self.tokenizer(
            prompts,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=2048
        ).to(self.model.device)
        
        return inputs
    
    def inference(self, inputs):
        """Model inference"""
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=200,
                temperature=0.7,
                top_p=0.9,
                do_sample=True
            )
        return outputs
    
    def postprocess(self, outputs):
        """Custom postprocessing"""
        responses = self.tokenizer.batch_decode(
            outputs,
            skip_special_tokens=True
        )
        return responses

Performance Optimization

Batching Strategies

vLLM automatically batches requests:
# No configuration needed - vLLM handles batching
# Achieves up to 24x higher throughput
endpoint = model.deploy(
    machine_type="g2-standard-12",
    accelerator_type="NVIDIA_L4",
    accelerator_count=1
)

Memory Optimization

# vLLM supports automatic quantization
endpoint = model.deploy(
    serving_container_environment_variables={
        "QUANTIZATION": "awq",  # or "gptq", "squeezellm"
        "DTYPE": "float16"
    }
)

Autoscaling Configuration

from google.cloud import aiplatform

# Deploy with autoscaling
endpoint = model.deploy(
    machine_type="g2-standard-12",
    accelerator_type="NVIDIA_L4",
    accelerator_count=1,
    min_replica_count=1,
    max_replica_count=10,
    # Scale based on CPU utilization
    autoscaling_target_cpu_utilization=70,
    # Or scale based on accelerator utilization
    autoscaling_target_accelerator_utilization=80
)

Monitoring and Observability

Cloud Monitoring Integration

from google.cloud import monitoring_v3

# Create custom metrics
client = monitoring_v3.MetricServiceClient()

# Query endpoint metrics
results = client.list_time_series(
    request={
        "name": f"projects/{PROJECT_ID}",
        "filter": f'resource.type="aiplatform.googleapis.com/Endpoint"'
    }
)

for result in results:
    print(f"Metric: {result.metric.type}")
    print(f"Value: {result.points[0].value.double_value}")

Logging

import logging
from google.cloud import logging as cloud_logging

# Setup Cloud Logging
client = cloud_logging.Client()
client.setup_logging()

logger = logging.getLogger(__name__)

# Log predictions
response = endpoint.predict(instances=[{"prompt": "test"}])
logger.info(f"Prediction latency: {response.metadata['prediction_latency_ms']}ms")

Best Practices

Choose Right Serving Engine

Use vLLM for high throughput, TGI for HF models, custom handlers for specialized needs

Enable Autoscaling

Configure min/max replicas to handle traffic spikes efficiently

Optimize GPU Usage

Use tensor parallelism for large models, quantization for memory constraints

Monitor Performance

Track latency, throughput, and GPU utilization metrics

Use LoRA for Multi-Task

Serve multiple specialized models with shared base weights

Test Before Production

Load test endpoints to validate performance under expected traffic

Cost Optimization

1

Right-Size Compute

Start with smaller machine types and scale up based on metrics
2

Use Spot VMs

Enable spot VMs for up to 80% cost savings on fault-tolerant workloads
3

Scale to Zero

Use Cloud Run for infrequent workloads that can scale to zero
4

Batch Requests

Send multiple predictions in a single request to reduce overhead

Next Steps

Model Garden

Explore models available for deployment

Fine-Tuning

Customize models before deployment

Example Notebooks

View serving examples on GitHub

Performance Guide

Learn more about optimization techniques

Build docs developers (and LLMs) love