Overview
Vertex AI provides multiple options for serving open-source models with optimized inference performance. Choose the right serving solution based on your latency, throughput, and cost requirements.Serving Options
- vLLM
- Text Generation Inference (TGI)
- Ollama
- Custom Handlers
High-throughput serving with PagedAttention:
- Best for: High-volume production workloads
- Features: Continuous batching, KV cache optimization
- Throughput: Up to 24x higher than standard serving
- Models: Most LLMs (Llama, Gemma, Mistral, etc.)
vLLM Deployment
vLLM is the recommended option for high-performance LLM serving.Basic vLLM Deployment
vLLM with Multiple LoRA Adapters
Serve one base model with multiple task-specific adapters:Using Multiple Adapters
Text Generation Inference (TGI)
Deploy Hugging Face models with TGI for optimized performance.TGI Deployment
TGI with Multiple LoRA Adapters
Ollama on Cloud Run
Deploy models with Ollama for lightweight serving:Custom PyTorch Handlers
Deploy models with custom preprocessing/postprocessing:Performance Optimization
Batching Strategies
- Continuous Batching (vLLM)
- Static Batching
- Dynamic Batching
vLLM automatically batches requests:
Memory Optimization
Autoscaling Configuration
Monitoring and Observability
Cloud Monitoring Integration
Logging
Best Practices
Choose Right Serving Engine
Use vLLM for high throughput, TGI for HF models, custom handlers for specialized needs
Enable Autoscaling
Configure min/max replicas to handle traffic spikes efficiently
Optimize GPU Usage
Use tensor parallelism for large models, quantization for memory constraints
Monitor Performance
Track latency, throughput, and GPU utilization metrics
Use LoRA for Multi-Task
Serve multiple specialized models with shared base weights
Test Before Production
Load test endpoints to validate performance under expected traffic
Cost Optimization
Next Steps
Model Garden
Explore models available for deployment
Fine-Tuning
Customize models before deployment
Example Notebooks
View serving examples on GitHub
Performance Guide
Learn more about optimization techniques