Overview
VLLMOffline provides high-performance offline inference using the vLLM library. It supports distributed inference across multiple GPUs with tensor and pipeline parallelism, making it ideal for deploying large language models locally.
Class Definition
src/remem/llm/vllm_offline.py:36
Key Features
- Multi-GPU tensor and pipeline parallelism
- Automatic prefix caching for efficiency
- Batched inference with progress tracking
- JSON schema-guided generation
- Support for quantized models (BitsAndBytes)
- Chat template conversion
Initialization
Global configuration object containing model settings
Directory for cache files. Defaults to
{global_config.save_dir}/llm_cacheCustom cache filename. Defaults to
{model_name}_cache.sqliteAdditional configuration options:
model_name(str): Model name or path (required if not in global_config)num_gpus(int): Number of GPUs to useseed(int): Random seed (default: 0)gpu_memory_utilization(float): GPU memory utilization (default: 0.93)quantization(str): Quantization method (e.g., “bitsandbytes”)
src/remem/llm/vllm_offline.py:41
Parallelism Configuration
The client automatically configures parallelism based on model size: Small Models (4B, 7B, 8B):src/remem/llm/vllm_offline.py:48
Core Methods
infer
List of chat messages. Each message is a dictionary with
role and content keysMaximum number of tokens to generate
Additional generation parameters (reserved for future use)
A tuple containing:
response(str): The generated textmetadata(dict): Contains:prompt_tokens: Number of input tokenscompletion_tokens: Number of generated tokensprompt: Original input messages
cache_hit: Always None (no caching for offline inference)
src/remem/llm/vllm_offline.py:82
Implementation Details:
The method converts chat messages to token IDs using the model’s tokenizer:
batch_infer
Batch of message sequences
Maximum tokens per completion
Name of JSON schema template for guided generation. Available templates defined in
remem.utils.llm_utils.JSON_SCHEMAList of (response, metadata, cache_hit) tuples in the same order as input
src/remem/llm/vllm_offline.py:98
JSON-Guided Generation
Use schema-guided generation for structured outputs:src/remem/llm/vllm_offline.py:104
vLLM Configuration
The vLLM engine is initialized with:Enables automatic prefix caching for repeated prompts
Uses CUDA graphs for better performance
Allows loading models with custom code
Maximum number of sequences to process in parallel
src/remem/llm/vllm_offline.py:60
Chat Template Conversion
Messages are converted to model-specific formats using the tokenizer:- Apply chat template
- Tokenize without special tokens
- Return input IDs for vLLM
src/remem/llm/vllm_offline.py:19
Environment Configuration
The client sets:src/remem/llm/vllm_offline.py:57
Batch Metadata
When usingbatch_infer, you get both per-request and overall metadata:
src/remem/llm/vllm_offline.py:124
Offline vs Online Mode
VLLMOffline (Offline Mode):- Runs models locally on your GPUs
- No API calls or network requests
- Full control over model and hardware
- Supports quantization and distributed inference
- No per-token costs
- Requires GPU resources
- Uses OpenAI API or compatible endpoints
- Requires internet connection and API key
- Pay-per-use pricing model
- No local GPU requirements
- Response caching to reduce costs
- Supports latest OpenAI models
Performance Tips
- Enable prefix caching for repeated prompts (enabled by default)
- Use batching for multiple requests to maximize GPU utilization
- Tune GPU memory utilization based on your hardware (default: 0.93)
- Configure parallelism appropriately for your model size
- Use quantization for large models on limited GPU memory
Error Handling
Batch inference includes error handling:src/remem/llm/vllm_offline.py:110
See Also
- BaseLLM Interface - Base class documentation
- OpenAI LLM Client - Online inference alternative
- Configuration - BaseConfig documentation
- vLLM Documentation - Official vLLM docs