Skip to main content

Overview

vLLM is a high-throughput, memory-efficient inference engine for large language models. It supports:
  • PagedAttention: Efficient memory management for KV cache
  • Continuous batching: Dynamic request batching for high throughput
  • LoRA adapters: Runtime loading of fine-tuned adapters
  • OpenAI-compatible API: Drop-in replacement for OpenAI clients

Architecture

Key components:
  • vLLM Server: Serves base model with OpenAI API
  • LoRA Adapters: Task-specific fine-tuned weights
  • Model Registry: W&B artifact storage
  • Client: Python client for adapter management

Server Configuration

Command Line Setup

# Enable runtime LoRA updates
export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True

# Start server
vllm serve microsoft/Phi-3-mini-4k-instruct \
  --dtype auto \
  --max-model-len 512 \
  --enable-lora \
  --gpu-memory-utilization 0.8 \
  --download-dir ./vllm-storage
Parameter breakdown:
ParameterPurposeValue
--dtypeData type for weightsauto (FP16/BF16)
--max-model-lenMaximum sequence length512 tokens
--enable-loraEnable LoRA adapter supportRequired
--gpu-memory-utilizationGPU memory fraction0.8 (80%)
--download-dirModel cache directory./vllm-storage
Set VLLM_ALLOW_RUNTIME_LORA_UPDATING=True to load adapters after server start.

Storage Requirements

mkdir -p vllm-storage
Disk space:
  • Base model (Phi-3-mini): ~7 GB
  • LoRA adapters: ~50-100 MB each
  • Total recommended: 50 GB for multiple adapters

Client Implementation

CLI Tool

The client (serving-llm/client.py) provides commands for adapter management:
serving-llm/client.py
import requests
import wandb
from openai import OpenAI
from pathlib import Path

DEFAULT_BASE_URL = "http://localhost:8000/v1"

def load_from_registry(model_name: str, model_path: Path):
    with wandb.init() as run:
        artifact = run.use_artifact(model_name, type="model")
        artifact_dir = artifact.download(root=model_path)
        print(f"{artifact_dir}")

def load_adapter(lora_name: str, lora_path: str, url: str = DEFAULT_BASE_URL):
    url = f"{url}/load_lora_adapter"
    payload = {"lora_name": lora_name, "lora_path": lora_path}
    response = requests.post(url, json=payload)
    print(response)

def list_of_models(url: str = DEFAULT_BASE_URL):
    url = f"{url}/models"
    response = requests.get(url)
    models = response.json()
    print(json.dumps(models, indent=4))

def test_client(
    model: str,
    context: str = EXAMPLE_CONTEXT,
    query: str = EXAMPLE_QUERY,
    url: str = DEFAULT_BASE_URL,
):
    client = OpenAI(base_url=url, api_key="any-api-key")
    messages = [{"content": f"{context}\n Input: {query}", "role": "user"}]
    completion = client.chat.completions.create(model=model, messages=messages)
    print(completion.choices[0].message.content)

Client Commands

python serving-llm/client.py list-of-models
Output:
{
  "object": "list",
  "data": [
    {
      "id": "microsoft/Phi-3-mini-4k-instruct",
      "object": "model",
      "owned_by": "vllm"
    },
    {
      "id": "sql-default-model",
      "object": "model",
      "owned_by": "vllm"
    }
  ]
}

Adapter Management

Loading Workflow

1

Download from registry

python serving-llm/client.py load-from-registry \
  truskovskiyk/ml-in-production-practice/modal_generative_example:latest \
  sql-default-model
Downloads adapter weights to local directory
2

Register with vLLM

python serving-llm/client.py load-adapter \
  sql-default-model \
  ./sql-default-model
Sends POST request to /v1/load_lora_adapter
3

Verify loading

python serving-llm/client.py list-of-models
Check adapter appears in model list
4

Test inference

python serving-llm/client.py test-client sql-default-model
Run sample query with adapter

Unloading Adapters

def unload_adapter(lora_name: str, url: str = DEFAULT_BASE_URL):
    url = f"{url}/unload_lora_adapter"
    payload = {"lora_name": lora_name}
    response = requests.post(url, json=payload)
    result = response.json()
    print(json.dumps(result, indent=4))
Usage:
python serving-llm/client.py unload-adapter sql-default-model

OpenAI Client Integration

Chat Completions

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="any-api-key"  # Not validated by vLLM
)

# Use base model
response = client.chat.completions.create(
    model="microsoft/Phi-3-mini-4k-instruct",
    messages=[{"role": "user", "content": "Translate to SQL: Show all users"}]
)

print(response.choices[0].message.content)

Using LoRA Adapters

# Use loaded adapter
response = client.chat.completions.create(
    model="sql-default-model",  # Adapter name
    messages=[
        {
            "role": "user",
            "content": f"{database_schema}\n Input: {natural_language_query}"
        }
    ]
)

print(response.choices[0].message.content)
Example context:
EXAMPLE_CONTEXT = """
CREATE TABLE salesperson (
    salesperson_id INT,
    name TEXT,
    region TEXT
);

CREATE TABLE timber_sales (
    sales_id INT,
    salesperson_id INT,
    volume REAL,
    sale_date DATE
);
"""

EXAMPLE_QUERY = "What is the total volume of timber sold by each salesperson?"

Kubernetes Deployment

Manifest Structure

k8s/vllm-inference.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vllm-storage-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi
  storageClassName: standard
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-vllm
spec:
  replicas: 1
  template:
    spec:
      containers:
        # Main vLLM server
        - name: app-vllm
          image: vllm/vllm-openai:latest
          env:
            - name: VLLM_ALLOW_RUNTIME_LORA_UPDATING
              value: "True"
          command: ["vllm"]
          args:
            - "serve"
            - "microsoft/Phi-3-mini-4k-instruct"
            - "--dtype"
            - "auto"
            - "--max-model-len"
            - "512"
            - "--enable-lora"
            - "--gpu-memory-utilization"
            - "0.8"
            - "--download-dir"
            - "/vllm-storage"
          resources:
            limits:
              nvidia.com/gpu: 1
          volumeMounts:
            - name: vllm-storage
              mountPath: /vllm-storage
        
        # Init container for adapter loading
        - name: model-loader
          image: ghcr.io/kyryl-opens-ml/app-fastapi:latest
          command: ["/bin/sh", "-c"]
          args:
            - |
              echo "Waiting for vLLM server..."
              while ! curl -s http://localhost:8000/health >/dev/null; do
                sleep 5
              done
              
              echo "Loading adapter from registry..."
              python serving-llm/client.py load-from-registry \
                truskovskiyk/ml-in-production-practice/modal_generative_example:latest \
                sql-default-model
              
              python serving-llm/client.py load-adapter \
                sql-default-model \
                ./sql-default-model
              
              echo "Adapter loaded successfully"
          volumeMounts:
            - name: vllm-storage
              mountPath: /vllm-storage
      
      volumes:
        - name: vllm-storage
          persistentVolumeClaim:
            claimName: vllm-storage-pvc
Architecture:
  • PVC: Persistent storage for models and adapters
  • app-vllm: Main container running vLLM server
  • model-loader: Sidecar that downloads and registers adapters

Deployment Steps

1

Create GPU cluster

# For Minikube with GPU support
curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube_latest_amd64.deb
sudo dpkg -i minikube_latest_amd64.deb
minikube start --driver docker --container-runtime docker --gpus all
2

Create secrets

export WANDB_API_KEY='your-key'
kubectl create secret generic wandb \
  --from-literal=WANDB_API_KEY=$WANDB_API_KEY
3

Deploy vLLM

kubectl create -f k8s/vllm-inference.yaml
4

Monitor startup

# Check main container
kubectl logs -l app=app-vllm -c app-vllm -f

# Check model loader
kubectl logs -l app=app-vllm -c model-loader -f
5

Port forward

kubectl port-forward --address 0.0.0.0 svc/app-vllm 8000:8000

Testing Deployment

# List models
python serving-llm/client.py list-of-models

# Test base model
python serving-llm/client.py test-client microsoft/Phi-3-mini-4k-instruct

# Test adapter
python serving-llm/client.py test-client sql-default-model

Model Loader Sidecar

Health Check Loop

while ! curl -s http://localhost:8000/health >/dev/null; do
  echo "vLLM server not ready. Retrying in 5 seconds..."
  sleep 5
done
Waits for vLLM server to start before loading adapters.

Adapter Loading

# Download from W&B
python serving-llm/client.py load-from-registry \
  truskovskiyk/ml-in-production-practice/modal_generative_example:latest \
  sql-default-model

if [ $? -ne 0 ]; then
  echo "Failed to load model from registry."
  exit 1
fi

# Register with vLLM
python serving-llm/client.py load-adapter \
  sql-default-model \
  ./sql-default-model

if [ $? -ne 0 ]; then
  echo "Failed to load adapter."
  exit 1
fi

Performance Optimization

PagedAttention

vLLM uses paged memory management for KV cache:
Traditional:
[Request 1 KV: ████████░░░░] (wasted memory)
[Request 2 KV: ████░░░░░░░░] (wasted memory)

PagedAttention:
[Page 1][Page 2][Page 3] (shared pool)
Benefits:
  • 2-4x higher throughput
  • Near-zero waste in KV cache memory
  • Dynamic memory allocation

Continuous Batching

Static batching:
[Req1 Req2 Req3] -> Wait for all -> [Req4 Req5 Req6]

Continuous batching:
[Req1 Req2] -> [Req2 Req3] -> [Req3 Req4] (rolling)
Advantages:
  • Higher GPU utilization
  • Lower latency for short requests
  • Better throughput for mixed workloads

Memory Configuration

# Aggressive memory usage (higher throughput)
vllm serve model --gpu-memory-utilization 0.95

# Conservative (more stability)
vllm serve model --gpu-memory-utilization 0.7

# With tensor parallelism for large models
vllm serve model --tensor-parallel-size 2

Multi-Adapter Serving

Loading Multiple Adapters

# Load SQL adapter
python serving-llm/client.py load-from-registry \
  org/project/sql-adapter:v1 sql-adapter
python serving-llm/client.py load-adapter sql-adapter ./sql-adapter

# Load summarization adapter
python serving-llm/client.py load-from-registry \
  org/project/summarization-adapter:v1 summarization
python serving-llm/client.py load-adapter summarization ./summarization

# List all available models
python serving-llm/client.py list-of-models

Routing Requests

def route_request(task: str, query: str) -> str:
    client = OpenAI(base_url="http://localhost:8000/v1", api_key="key")
    
    # Select adapter based on task
    model_map = {
        "sql": "sql-adapter",
        "summarize": "summarization",
        "general": "microsoft/Phi-3-mini-4k-instruct"
    }
    
    model = model_map.get(task, "microsoft/Phi-3-mini-4k-instruct")
    
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": query}]
    )
    
    return response.choices[0].message.content

Monitoring and Debugging

Server Logs

# vLLM server logs
kubectl logs <pod-name> -c app-vllm

# Adapter loader logs
kubectl logs <pod-name> -c model-loader

Metrics Endpoint

curl http://localhost:8000/metrics
Key metrics:
  • vllm:num_requests_running: Active requests
  • vllm:num_requests_waiting: Queued requests
  • vllm:gpu_cache_usage_perc: GPU memory usage
  • vllm:time_to_first_token_seconds: TTFT latency
  • vllm:time_per_output_token_seconds: Generation speed

Troubleshooting

Problem: Model not appearing in list after loadSolutions:
# Check server logs
kubectl logs <pod> -c app-vllm | grep -i lora

# Verify adapter files exist
kubectl exec <pod> -c model-loader -- ls -la ./sql-default-model

# Retry loading
python serving-llm/client.py load-adapter sql-default-model ./sql-default-model
Problem: Out of memory during inferenceSolutions:
  • Reduce --gpu-memory-utilization to 0.7
  • Decrease --max-model-len to 256
  • Use quantization: --quantization awq
  • Enable tensor parallelism for multi-GPU
Problem: High latency for requestsSolutions:
  • Check GPU utilization: nvidia-smi
  • Increase --max-num-seqs for more batching
  • Reduce --max-model-len if not needed
  • Use speculative decoding: --speculative-model

Best Practices

Adapter Management

Version adapters in registry and update via CI/CD

Memory Tuning

Start with 0.8 GPU utilization and adjust based on OOM errors

Monitoring

Track TTFT and tokens/second for performance

Batching

Enable continuous batching for mixed workload latency

Next Steps

Practice Tasks

Complete Module 5 practice assignments

Resources

Build docs developers (and LLMs) love