Overview
vLLM is a high-throughput, memory-efficient inference engine for large language models. It supports:
PagedAttention : Efficient memory management for KV cache
Continuous batching : Dynamic request batching for high throughput
LoRA adapters : Runtime loading of fine-tuned adapters
OpenAI-compatible API : Drop-in replacement for OpenAI clients
Architecture
Key components:
vLLM Server : Serves base model with OpenAI API
LoRA Adapters : Task-specific fine-tuned weights
Model Registry : W&B artifact storage
Client : Python client for adapter management
Server Configuration
Command Line Setup
# Enable runtime LoRA updates
export VLLM_ALLOW_RUNTIME_LORA_UPDATING = True
# Start server
vllm serve microsoft/Phi-3-mini-4k-instruct \
--dtype auto \
--max-model-len 512 \
--enable-lora \
--gpu-memory-utilization 0.8 \
--download-dir ./vllm-storage
Parameter breakdown:
Parameter Purpose Value --dtypeData type for weights auto (FP16/BF16)--max-model-lenMaximum sequence length 512 tokens --enable-loraEnable LoRA adapter support Required --gpu-memory-utilizationGPU memory fraction 0.8 (80%) --download-dirModel cache directory ./vllm-storage
Set VLLM_ALLOW_RUNTIME_LORA_UPDATING=True to load adapters after server start.
Storage Requirements
Disk space:
Base model (Phi-3-mini): ~7 GB
LoRA adapters: ~50-100 MB each
Total recommended: 50 GB for multiple adapters
Client Implementation
The client (serving-llm/client.py) provides commands for adapter management:
import requests
import wandb
from openai import OpenAI
from pathlib import Path
DEFAULT_BASE_URL = "http://localhost:8000/v1"
def load_from_registry ( model_name : str , model_path : Path):
with wandb.init() as run:
artifact = run.use_artifact(model_name, type = "model" )
artifact_dir = artifact.download( root = model_path)
print ( f " { artifact_dir } " )
def load_adapter ( lora_name : str , lora_path : str , url : str = DEFAULT_BASE_URL ):
url = f " { url } /load_lora_adapter"
payload = { "lora_name" : lora_name, "lora_path" : lora_path}
response = requests.post(url, json = payload)
print (response)
def list_of_models ( url : str = DEFAULT_BASE_URL ):
url = f " { url } /models"
response = requests.get(url)
models = response.json()
print (json.dumps(models, indent = 4 ))
def test_client (
model : str ,
context : str = EXAMPLE_CONTEXT ,
query : str = EXAMPLE_QUERY ,
url : str = DEFAULT_BASE_URL ,
):
client = OpenAI( base_url = url, api_key = "any-api-key" )
messages = [{ "content" : f " { context } \n Input: { query } " , "role" : "user" }]
completion = client.chat.completions.create( model = model, messages = messages)
print (completion.choices[ 0 ].message.content)
Client Commands
List Models
Load from Registry
Load Adapter
Test Inference
python serving-llm/client.py list-of-models
Output: {
"object" : "list" ,
"data" : [
{
"id" : "microsoft/Phi-3-mini-4k-instruct" ,
"object" : "model" ,
"owned_by" : "vllm"
},
{
"id" : "sql-default-model" ,
"object" : "model" ,
"owned_by" : "vllm"
}
]
}
python serving-llm/client.py load-from-registry \
truskovskiyk/ml-in-production-practice/modal_generative_example:latest \
sql-default-model
Downloads LoRA adapter from W&B to ./sql-default-model/ python serving-llm/client.py load-adapter \
sql-default-model \
./sql-default-model
Registers adapter with vLLM server at runtime # Base model
python serving-llm/client.py test-client \
microsoft/Phi-3-mini-4k-instruct
# LoRA adapter
python serving-llm/client.py test-client \
sql-default-model
Adapter Management
Loading Workflow
Download from registry
python serving-llm/client.py load-from-registry \
truskovskiyk/ml-in-production-practice/modal_generative_example:latest \
sql-default-model
Downloads adapter weights to local directory
Register with vLLM
python serving-llm/client.py load-adapter \
sql-default-model \
./sql-default-model
Sends POST request to /v1/load_lora_adapter
Verify loading
python serving-llm/client.py list-of-models
Check adapter appears in model list
Test inference
python serving-llm/client.py test-client sql-default-model
Run sample query with adapter
Unloading Adapters
def unload_adapter ( lora_name : str , url : str = DEFAULT_BASE_URL ):
url = f " { url } /unload_lora_adapter"
payload = { "lora_name" : lora_name}
response = requests.post(url, json = payload)
result = response.json()
print (json.dumps(result, indent = 4 ))
Usage:
python serving-llm/client.py unload-adapter sql-default-model
OpenAI Client Integration
Chat Completions
from openai import OpenAI
client = OpenAI(
base_url = "http://localhost:8000/v1" ,
api_key = "any-api-key" # Not validated by vLLM
)
# Use base model
response = client.chat.completions.create(
model = "microsoft/Phi-3-mini-4k-instruct" ,
messages = [{ "role" : "user" , "content" : "Translate to SQL: Show all users" }]
)
print (response.choices[ 0 ].message.content)
Using LoRA Adapters
# Use loaded adapter
response = client.chat.completions.create(
model = "sql-default-model" , # Adapter name
messages = [
{
"role" : "user" ,
"content" : f " { database_schema } \n Input: { natural_language_query } "
}
]
)
print (response.choices[ 0 ].message.content)
Example context:
EXAMPLE_CONTEXT = """
CREATE TABLE salesperson (
salesperson_id INT,
name TEXT,
region TEXT
);
CREATE TABLE timber_sales (
sales_id INT,
salesperson_id INT,
volume REAL,
sale_date DATE
);
"""
EXAMPLE_QUERY = "What is the total volume of timber sold by each salesperson?"
Kubernetes Deployment
Manifest Structure
apiVersion : v1
kind : PersistentVolumeClaim
metadata :
name : vllm-storage-pvc
spec :
accessModes :
- ReadWriteOnce
resources :
requests :
storage : 50Gi
storageClassName : standard
---
apiVersion : apps/v1
kind : Deployment
metadata :
name : app-vllm
spec :
replicas : 1
template :
spec :
containers :
# Main vLLM server
- name : app-vllm
image : vllm/vllm-openai:latest
env :
- name : VLLM_ALLOW_RUNTIME_LORA_UPDATING
value : "True"
command : [ "vllm" ]
args :
- "serve"
- "microsoft/Phi-3-mini-4k-instruct"
- "--dtype"
- "auto"
- "--max-model-len"
- "512"
- "--enable-lora"
- "--gpu-memory-utilization"
- "0.8"
- "--download-dir"
- "/vllm-storage"
resources :
limits :
nvidia.com/gpu : 1
volumeMounts :
- name : vllm-storage
mountPath : /vllm-storage
# Init container for adapter loading
- name : model-loader
image : ghcr.io/kyryl-opens-ml/app-fastapi:latest
command : [ "/bin/sh" , "-c" ]
args :
- |
echo "Waiting for vLLM server..."
while ! curl -s http://localhost:8000/health >/dev/null; do
sleep 5
done
echo "Loading adapter from registry..."
python serving-llm/client.py load-from-registry \
truskovskiyk/ml-in-production-practice/modal_generative_example:latest \
sql-default-model
python serving-llm/client.py load-adapter \
sql-default-model \
./sql-default-model
echo "Adapter loaded successfully"
volumeMounts :
- name : vllm-storage
mountPath : /vllm-storage
volumes :
- name : vllm-storage
persistentVolumeClaim :
claimName : vllm-storage-pvc
Architecture:
PVC : Persistent storage for models and adapters
app-vllm : Main container running vLLM server
model-loader : Sidecar that downloads and registers adapters
Deployment Steps
Create GPU cluster
# For Minikube with GPU support
curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube_latest_amd64.deb
sudo dpkg -i minikube_latest_amd64.deb
minikube start --driver docker --container-runtime docker --gpus all
Create secrets
export WANDB_API_KEY = 'your-key'
kubectl create secret generic wandb \
--from-literal=WANDB_API_KEY= $WANDB_API_KEY
Deploy vLLM
kubectl create -f k8s/vllm-inference.yaml
Monitor startup
# Check main container
kubectl logs -l app=app-vllm -c app-vllm -f
# Check model loader
kubectl logs -l app=app-vllm -c model-loader -f
Port forward
kubectl port-forward --address 0.0.0.0 svc/app-vllm 8000:8000
Testing Deployment
# List models
python serving-llm/client.py list-of-models
# Test base model
python serving-llm/client.py test-client microsoft/Phi-3-mini-4k-instruct
# Test adapter
python serving-llm/client.py test-client sql-default-model
Model Loader Sidecar
Health Check Loop
while ! curl -s http://localhost:8000/health > /dev/null ; do
echo "vLLM server not ready. Retrying in 5 seconds..."
sleep 5
done
Waits for vLLM server to start before loading adapters.
Adapter Loading
# Download from W&B
python serving-llm/client.py load-from-registry \
truskovskiyk/ml-in-production-practice/modal_generative_example:latest \
sql-default-model
if [ $? -ne 0 ]; then
echo "Failed to load model from registry."
exit 1
fi
# Register with vLLM
python serving-llm/client.py load-adapter \
sql-default-model \
./sql-default-model
if [ $? -ne 0 ]; then
echo "Failed to load adapter."
exit 1
fi
PagedAttention
vLLM uses paged memory management for KV cache:
Traditional:
[Request 1 KV: ████████░░░░] (wasted memory)
[Request 2 KV: ████░░░░░░░░] (wasted memory)
PagedAttention:
[Page 1][Page 2][Page 3] (shared pool)
Benefits:
2-4x higher throughput
Near-zero waste in KV cache memory
Dynamic memory allocation
Continuous Batching
Static batching:
[Req1 Req2 Req3] -> Wait for all -> [Req4 Req5 Req6]
Continuous batching:
[Req1 Req2] -> [Req2 Req3] -> [Req3 Req4] (rolling)
Advantages:
Higher GPU utilization
Lower latency for short requests
Better throughput for mixed workloads
Memory Configuration
# Aggressive memory usage (higher throughput)
vllm serve model --gpu-memory-utilization 0.95
# Conservative (more stability)
vllm serve model --gpu-memory-utilization 0.7
# With tensor parallelism for large models
vllm serve model --tensor-parallel-size 2
Multi-Adapter Serving
Loading Multiple Adapters
# Load SQL adapter
python serving-llm/client.py load-from-registry \
org/project/sql-adapter:v1 sql-adapter
python serving-llm/client.py load-adapter sql-adapter ./sql-adapter
# Load summarization adapter
python serving-llm/client.py load-from-registry \
org/project/summarization-adapter:v1 summarization
python serving-llm/client.py load-adapter summarization ./summarization
# List all available models
python serving-llm/client.py list-of-models
Routing Requests
def route_request ( task : str , query : str ) -> str :
client = OpenAI( base_url = "http://localhost:8000/v1" , api_key = "key" )
# Select adapter based on task
model_map = {
"sql" : "sql-adapter" ,
"summarize" : "summarization" ,
"general" : "microsoft/Phi-3-mini-4k-instruct"
}
model = model_map.get(task, "microsoft/Phi-3-mini-4k-instruct" )
response = client.chat.completions.create(
model = model,
messages = [{ "role" : "user" , "content" : query}]
)
return response.choices[ 0 ].message.content
Monitoring and Debugging
Server Logs
# vLLM server logs
kubectl logs < pod-nam e > -c app-vllm
# Adapter loader logs
kubectl logs < pod-nam e > -c model-loader
Metrics Endpoint
curl http://localhost:8000/metrics
Key metrics:
vllm:num_requests_running: Active requests
vllm:num_requests_waiting: Queued requests
vllm:gpu_cache_usage_perc: GPU memory usage
vllm:time_to_first_token_seconds: TTFT latency
vllm:time_per_output_token_seconds: Generation speed
Troubleshooting
Problem: Model not appearing in list after loadSolutions: # Check server logs
kubectl logs < po d > -c app-vllm | grep -i lora
# Verify adapter files exist
kubectl exec < po d > -c model-loader -- ls -la ./sql-default-model
# Retry loading
python serving-llm/client.py load-adapter sql-default-model ./sql-default-model
Problem: Out of memory during inferenceSolutions:
Reduce --gpu-memory-utilization to 0.7
Decrease --max-model-len to 256
Use quantization: --quantization awq
Enable tensor parallelism for multi-GPU
Problem: High latency for requestsSolutions:
Check GPU utilization: nvidia-smi
Increase --max-num-seqs for more batching
Reduce --max-model-len if not needed
Use speculative decoding: --speculative-model
Best Practices
Adapter Management Version adapters in registry and update via CI/CD
Memory Tuning Start with 0.8 GPU utilization and adjust based on OOM errors
Monitoring Track TTFT and tokens/second for performance
Batching Enable continuous batching for mixed workload latency
Next Steps
Practice Tasks Complete Module 5 practice assignments
Resources