Skip to main content
The LLM class provides a high-level Python interface for offline batch inference with Mini-SGLang. It extends the Scheduler class to handle request management and generation.

Constructor

model_path
str
required
Path to the model weights directory or Hugging Face model identifier
dtype
torch.dtype
default:"torch.bfloat16"
Data type for model weights and computation. Common values:
  • torch.bfloat16 (default)
  • torch.float16
  • torch.float32
**kwargs
dict
Additional configuration options passed to SchedulerConfig, including:
  • max_running_req: Maximum number of concurrent requests
  • page_size: KV cache page size
  • attention_backend: Attention implementation (“auto”, “fa”, “fi”, “trtllm”)
  • cuda_graph_bs: Batch sizes for CUDA graph capture
from minisgl.llm import LLM
import torch

llm = LLM(
    model_path="meta-llama/Llama-3.2-1B-Instruct",
    dtype=torch.bfloat16
)

Methods

generate()

Generate text completions for a batch of prompts.
prompts
List[str] | List[List[int]]
required
Input prompts as either:
  • List of strings (will be tokenized automatically)
  • List of token ID lists (pre-tokenized)
sampling_params
SamplingParams | List[SamplingParams]
required
Sampling configuration(s). Can be:
  • A single SamplingParams object (applied to all prompts)
  • A list of SamplingParams (one per prompt)
Returns: List[Dict[str, str | List[int]]] Each dictionary contains:
  • "text": Generated text as a string
  • "token_ids": Generated token IDs as a list
from minisgl.core import SamplingParams

prompts = [
    "What is the capital of France?",
    "Explain quantum computing in simple terms."
]

sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=256
)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output["text"])
    print(f"Generated {len(output['token_ids'])} tokens")

Usage Examples

Basic Generation

from minisgl.llm import LLM
from minisgl.core import SamplingParams

llm = LLM(model_path="meta-llama/Llama-3.2-1B-Instruct")

prompts = ["Once upon a time"]
sampling_params = SamplingParams(temperature=0.7, max_tokens=100)

outputs = llm.generate(prompts, sampling_params)
print(outputs[0]["text"])

Batch Generation with Different Parameters

prompts = [
    "Write a haiku about coding",
    "Explain machine learning",
    "What is 2+2?"
]

sampling_params = [
    SamplingParams(temperature=0.9, max_tokens=50),   # Creative
    SamplingParams(temperature=0.7, max_tokens=200),  # Balanced
    SamplingParams(temperature=0.0, max_tokens=10),   # Greedy
]

outputs = llm.generate(prompts, sampling_params)

Pre-tokenized Input

# Using token IDs directly
token_ids = [[1, 2643, 338, 263, 4688]]  # Pre-tokenized prompt

sampling_params = SamplingParams(max_tokens=50)
outputs = llm.generate(token_ids, sampling_params)

Notes

  • The LLM class runs in offline mode, processing batches synchronously
  • Internally manages tokenization via the model’s tokenizer
  • Automatically handles request scheduling and batching
  • Thread-safe for single-instance usage

Build docs developers (and LLMs) love