LLM class provides a high-level Python interface for offline batch inference with Mini-SGLang. It extends the Scheduler class to handle request management and generation.
Constructor
Path to the model weights directory or Hugging Face model identifier
Data type for model weights and computation. Common values:
torch.bfloat16(default)torch.float16torch.float32
Additional configuration options passed to
SchedulerConfig, including:max_running_req: Maximum number of concurrent requestspage_size: KV cache page sizeattention_backend: Attention implementation (“auto”, “fa”, “fi”, “trtllm”)cuda_graph_bs: Batch sizes for CUDA graph capture
Methods
generate()
Generate text completions for a batch of prompts.Input prompts as either:
- List of strings (will be tokenized automatically)
- List of token ID lists (pre-tokenized)
Sampling configuration(s). Can be:
- A single
SamplingParamsobject (applied to all prompts) - A list of
SamplingParams(one per prompt)
List[Dict[str, str | List[int]]]
Each dictionary contains:
"text": Generated text as a string"token_ids": Generated token IDs as a list
Usage Examples
Basic Generation
Batch Generation with Different Parameters
Pre-tokenized Input
Notes
- The
LLMclass runs in offline mode, processing batches synchronously - Internally manages tokenization via the model’s tokenizer
- Automatically handles request scheduling and batching
- Thread-safe for single-instance usage