Skip to main content

Overview

VLLMOffline provides high-performance offline inference using the vLLM library. It supports distributed inference across multiple GPUs with tensor and pipeline parallelism, making it ideal for deploying large language models locally.

Class Definition

from remem.llm.vllm_offline import VLLMOffline
Location: src/remem/llm/vllm_offline.py:36

Key Features

  • Multi-GPU tensor and pipeline parallelism
  • Automatic prefix caching for efficiency
  • Batched inference with progress tracking
  • JSON schema-guided generation
  • Support for quantized models (BitsAndBytes)
  • Chat template conversion

Initialization

def __init__(self, global_config, cache_dir=None, cache_filename=None, **kwargs)
Parameters:
global_config
BaseConfig
required
Global configuration object containing model settings
cache_dir
str
default:"None"
Directory for cache files. Defaults to {global_config.save_dir}/llm_cache
cache_filename
str
default:"None"
Custom cache filename. Defaults to {model_name}_cache.sqlite
**kwargs
dict
Additional configuration options:
  • model_name (str): Model name or path (required if not in global_config)
  • num_gpus (int): Number of GPUs to use
  • seed (int): Random seed (default: 0)
  • gpu_memory_utilization (float): GPU memory utilization (default: 0.93)
  • quantization (str): Quantization method (e.g., “bitsandbytes”)
Example:
from remem.utils.config_utils import BaseConfig
from remem.llm.vllm_offline import VLLMOffline

config = BaseConfig(
    llm_name="meta-llama/Llama-3.1-70B-Instruct",
    max_model_len=8192,
    max_num_seqs=256,
    vllm_tensor_parallel_size=4
)

llm = VLLMOffline(
    global_config=config,
    gpu_memory_utilization=0.90,
    seed=42
)
Location: src/remem/llm/vllm_offline.py:41

Parallelism Configuration

The client automatically configures parallelism based on model size: Small Models (4B, 7B, 8B):
tensor_parallel_size = 1
pipeline_parallel_size = 1
Large Models:
tensor_parallel_size = global_config.vllm_tensor_parallel_size
pipeline_parallel_size = 1
Quantized Models (BNB):
tensor_parallel_size = 1
pipeline_parallel_size = num_gpus  # from kwargs or global_config
quantization = "bitsandbytes"
load_format = "bitsandbytes"
Location: src/remem/llm/vllm_offline.py:48

Core Methods

infer

def infer(
    self, 
    messages: List[TextChatMessage], 
    max_completion_tokens=2048, 
    **kwargs
) -> Tuple[str, Dict, None]
Perform single-request inference. Parameters:
messages
List[TextChatMessage]
required
List of chat messages. Each message is a dictionary with role and content keys
max_completion_tokens
int
default:"2048"
Maximum number of tokens to generate
**kwargs
dict
Additional generation parameters (reserved for future use)
Returns:
response
Tuple[str, dict, None]
A tuple containing:
  • response (str): The generated text
  • metadata (dict): Contains:
    • prompt_tokens: Number of input tokens
    • completion_tokens: Number of generated tokens
    • prompt: Original input messages
  • cache_hit: Always None (no caching for offline inference)
Example:
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain quantum computing in simple terms."}
]

response, metadata, _ = llm.infer(messages, max_completion_tokens=512)
print(f"Response: {response}")
print(f"Tokens: {metadata['prompt_tokens']} input, {metadata['completion_tokens']} output")
Location: src/remem/llm/vllm_offline.py:82 Implementation Details: The method converts chat messages to token IDs using the model’s tokenizer:
prompt_ids = convert_text_chat_messages_to_input_ids(messages, self.tokenizer)
vllm_output = self.client.generate(
    prompt_token_ids=prompt_ids,
    sampling_params=SamplingParams(max_tokens=max_completion_tokens, temperature=0)
)

batch_infer

def batch_infer(
    self, 
    messages_list: List[List[TextChatMessage]], 
    max_tokens=2048, 
    json_template=None
) -> List[Tuple[str, Dict, None]]
Perform batched inference with optional JSON schema guidance. Parameters:
messages_list
List[List[TextChatMessage]]
required
Batch of message sequences
max_tokens
int
default:"2048"
Maximum tokens per completion
json_template
str
default:"None"
Name of JSON schema template for guided generation. Available templates defined in remem.utils.llm_utils.JSON_SCHEMA
Returns:
results
List[Tuple[str, dict, None]]
List of (response, metadata, cache_hit) tuples in the same order as input
Example:
messages_batch = [
    [{"role": "user", "content": "What is AI?"}],
    [{"role": "user", "content": "What is ML?"}],
    [{"role": "user", "content": "What is DL?"}]
]

results = llm.batch_infer(messages_batch, max_tokens=100)

for i, (response, metadata, _) in enumerate(results):
    print(f"Question {i+1}: {response}")
    print(f"  Tokens: {metadata['completion_tokens']}")
Location: src/remem/llm/vllm_offline.py:98

JSON-Guided Generation

Use schema-guided generation for structured outputs:
# Assuming JSON_SCHEMA contains a template named "entity_extraction"
results = llm.batch_infer(
    messages_batch,
    max_tokens=512,
    json_template="entity_extraction"
)

import json
for response, metadata, _ in results:
    entities = json.loads(response)
    print(entities)
Implementation:
if json_template is not None:
    from vllm.model_executor.guided_decoding.guided_fields import GuidedDecodingRequest
    guided = GuidedDecodingRequest(guided_json=JSON_SCHEMA[json_template])
    
    vllm_output = self.client.generate(
        prompt_token_ids=all_prompt_ids,
        sampling_params=SamplingParams(max_tokens=max_tokens, temperature=0),
        guided_options_request=guided
    )
Location: src/remem/llm/vllm_offline.py:104

vLLM Configuration

The vLLM engine is initialized with:
self.client = LLM(
    model=model_name,
    tensor_parallel_size=tensor_parallel_size,
    pipeline_parallel_size=pipeline_parallel_size,
    seed=kwargs.get("seed", 0),
    max_seq_len_to_capture=global_config.max_model_len,
    enable_prefix_caching=True,
    enforce_eager=False,
    gpu_memory_utilization=kwargs.get("gpu_memory_utilization", 0.93),
    max_model_len=global_config.max_model_len,
    quantization=kwargs.get("quantization", None),
    trust_remote_code=True,
    max_num_seqs=global_config.max_num_seqs
)
Key Settings:
enable_prefix_caching
bool
default:"True"
Enables automatic prefix caching for repeated prompts
enforce_eager
bool
default:"False"
Uses CUDA graphs for better performance
trust_remote_code
bool
default:"True"
Allows loading models with custom code
max_num_seqs
int
Maximum number of sequences to process in parallel
Location: src/remem/llm/vllm_offline.py:60

Chat Template Conversion

Messages are converted to model-specific formats using the tokenizer:
def convert_text_chat_messages_to_input_ids(
    messages: List[TextChatMessage], 
    tokenizer: PreTrainedTokenizer
) -> List[List[int]]
Process:
  1. Apply chat template
  2. Tokenize without special tokens
  3. Return input IDs for vLLM
Example template application:
prompt = tokenizer.apply_chat_template(
    conversation=messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False
)
encoded = tokenizer(prompt, add_special_tokens=False)
return encoded["input_ids"]
Location: src/remem/llm/vllm_offline.py:19

Environment Configuration

The client sets:
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
This ensures proper multiprocessing behavior across different platforms. Location: src/remem/llm/vllm_offline.py:57

Batch Metadata

When using batch_infer, you get both per-request and overall metadata:
overall_metadata = {
    "prompt_tokens": sum(all_prompt_tokens),
    "completion_tokens": sum(all_completion_tokens),
    "num_request": len(messages_list),
    "prompt": messages_list,
}
Location: src/remem/llm/vllm_offline.py:124

Offline vs Online Mode

VLLMOffline (Offline Mode):
  • Runs models locally on your GPUs
  • No API calls or network requests
  • Full control over model and hardware
  • Supports quantization and distributed inference
  • No per-token costs
  • Requires GPU resources
CacheOpenAI (Online Mode):
  • Uses OpenAI API or compatible endpoints
  • Requires internet connection and API key
  • Pay-per-use pricing model
  • No local GPU requirements
  • Response caching to reduce costs
  • Supports latest OpenAI models

Performance Tips

  1. Enable prefix caching for repeated prompts (enabled by default)
  2. Use batching for multiple requests to maximize GPU utilization
  3. Tune GPU memory utilization based on your hardware (default: 0.93)
  4. Configure parallelism appropriately for your model size
  5. Use quantization for large models on limited GPU memory

Error Handling

Batch inference includes error handling:
try:
    all_prompt_ids = convert_text_chat_messages_to_input_ids(messages_list, self.tokenizer)
    vllm_output = self.client.generate(...)
except Exception as e:
    logger.error("vllm offline batch infer error", str(e))
Location: src/remem/llm/vllm_offline.py:110

See Also

Build docs developers (and LLMs) love