Open Models Overview

Introduction

Vertex AI provides comprehensive support for deploying and managing open-source models at scale. Whether you’re working with language models, image generation models, or custom architectures, Google Cloud offers the infrastructure and tools to deploy, fine-tune, and serve these models efficiently.

Open Source Model Ecosystem

Vertex AI Model Garden serves as your gateway to a vast ecosystem of open-source models:

Model Garden

Browse and deploy pre-configured open models from Vertex AI Model Garden

Hugging Face Hub

Access over 1 million models from the Hugging Face Hub

Fine-Tuning

Customize models for your specific use cases

Optimized Serving

Deploy models with vLLM, TGI, and other inference engines

Key Capabilities

Model Discovery and Deployment

Vertex AI Model Garden SDK simplifies discovering and deploying open models:

from vertexai import model_garden
import vertexai

# Initialize Vertex AI
vertexai.init(project=PROJECT_ID, location=LOCATION)

# List available models
models = model_garden.list_deployable_models(
    model_filter="gemma",
    list_hf_models=True  # Include Hugging Face models
)

# Deploy a model
model = model_garden.OpenModel("google/gemma3@gemma-3-1b-it")
endpoint = model.deploy(accept_eula=True)

Supported Model Types

Vertex AI Model Garden supports various model architectures:

Language Models
Vision Models
Specialized Models

Gemma: Google’s lightweight, state-of-the-art open models
Llama: Meta’s family of large language models
DeepSeek: Advanced reasoning and instruction models
Qwen: Multilingual language models
Mistral: Efficient and powerful language models

Deployment Options

Vertex AI Endpoints

Deploy models to managed endpoints with automatic scaling:

endpoint = model.deploy(
    machine_type="g2-standard-4",
    accelerator_type="NVIDIA_L4",
    accelerator_count=1,
    min_replica_count=1,
    max_replica_count=5
)

Cloud Run

Deploy lightweight models on Cloud Run for serverless inference:

Pay only for actual usage
Automatic scaling to zero
Integrated with Cloud Load Balancing

GKE (Google Kubernetes Engine)

Deploy models on GKE for advanced orchestration:

Full control over infrastructure
Custom autoscaling policies
Multi-region deployments

Integration with Google Cloud Services

BigQuery ML

Use open models directly in BigQuery for SQL-based inference:

SELECT ml_generate_text_llm_result
FROM ML.GENERATE_TEXT(
  MODEL `project.dataset.llama_model`,
  (SELECT "Explain quantum computing" AS prompt)
)

Vertex AI Pipelines

Orchestrate model training, evaluation, and deployment workflows

Cloud Storage

Store model artifacts, training data, and inference results

Vertex AI Experiments

Track fine-tuning experiments and compare model performance

Model Access and Authentication

Gated Models

Some models require accepting terms or providing authentication:

# Accept End User License Agreement
endpoint = model.deploy(accept_eula=True)

Organization Policies

Control which models can be deployed in your organization:

# Set allowed models policy
gcloud org-policies set-policy policy.yaml

Contact your organization administrator if you encounter policy constraint violations when deploying models.

Inference APIs

Vertex AI provides multiple APIs for model inference:

Vertex AI SDK
OpenAI-Compatible API
REST API

prediction = endpoint.predict(
    instances=[{
        "prompt": "Tell me a joke",
        "temperature": 0.7,
        "max_tokens": 50
    }]
)
print(prediction.predictions[0])

import openai

client = openai.OpenAI(
    base_url=endpoint_url,
    api_key=auth_token
)

response = client.chat.completions.create(
    model="",
    messages=[{"role": "user", "content": "Tell me a joke"}],
    temperature=0.7,
    max_tokens=50
)

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/endpoints/ENDPOINT_ID:predict \
  -d '{"instances": [{"prompt": "Tell me a joke"}]}'

Cost Optimization

Compute Options

Standard VMs

Predictable pricing for steady workloads

Spot VMs

Up to 80% cost savings for fault-tolerant workloads

Reserved Resources

Committed use discounts for long-running deployments

Autoscaling

Scale replicas based on traffic patterns

GPU Selection

Choose the right GPU for your workload:

GPU Type	Best For	Memory
NVIDIA L4	Cost-effective inference	24 GB
NVIDIA T4	Balanced workloads	16 GB
NVIDIA A100	Training & large models	40-80 GB
NVIDIA H100	Highest performance	80 GB

Next Steps

Explore Model Garden

Browse and deploy models from the catalog

Fine-Tune Models

Customize models for your use cases

Optimize Serving

Learn about inference optimization techniques

View Examples

Explore example notebooks on GitHub

Evaluation & Testing

Production Deployment

Open Models

Introduction

Open Source Model Ecosystem

Model Garden

Hugging Face Hub

Fine-Tuning

Optimized Serving

Key Capabilities

Model Discovery and Deployment

Supported Model Types

Deployment Options

Vertex AI Endpoints

Cloud Run

GKE (Google Kubernetes Engine)

Integration with Google Cloud Services

Model Access and Authentication

Gated Models

Organization Policies

Inference APIs

Cost Optimization

Compute Options

Standard VMs

Spot VMs

Reserved Resources

Autoscaling

GPU Selection

Next Steps

Explore Model Garden

Fine-Tune Models

Optimize Serving

View Examples

Build docs developers (and LLMs) love

Evaluation & Testing

Production Deployment

Open Models

​Introduction

​Open Source Model Ecosystem

Model Garden

Hugging Face Hub

Fine-Tuning

Optimized Serving

​Key Capabilities

​Model Discovery and Deployment

​Supported Model Types

​Deployment Options

​Vertex AI Endpoints

​Cloud Run

​GKE (Google Kubernetes Engine)

​Integration with Google Cloud Services

​Model Access and Authentication

​Gated Models

​Organization Policies

​Inference APIs

​Cost Optimization

​Compute Options

Standard VMs

Spot VMs

Reserved Resources

Autoscaling

​GPU Selection

​Next Steps

Explore Model Garden

Fine-Tune Models

Optimize Serving

View Examples

Build docs developers (and LLMs) love

Introduction

Open Source Model Ecosystem

Key Capabilities

Model Discovery and Deployment

Supported Model Types

Deployment Options

Vertex AI Endpoints

Cloud Run

GKE (Google Kubernetes Engine)

Integration with Google Cloud Services

Model Access and Authentication

Gated Models

Organization Policies

Inference APIs

Cost Optimization

Compute Options

GPU Selection

Next Steps