LLM Class

The LLM class provides a high-level Python interface for offline batch inference with Mini-SGLang. It extends the Scheduler class to handle request management and generation.

Constructor

model_path

str

required

Path to the model weights directory or Hugging Face model identifier

dtype

torch.dtype

default:"torch.bfloat16"

Data type for model weights and computation. Common values:

torch.bfloat16 (default)
torch.float16
torch.float32

**kwargs

dict

Additional configuration options passed to SchedulerConfig, including:

max_running_req: Maximum number of concurrent requests
page_size: KV cache page size
attention_backend: Attention implementation (“auto”, “fa”, “fi”, “trtllm”)
cuda_graph_bs: Batch sizes for CUDA graph capture

from minisgl.llm import LLM
import torch

llm = LLM(
    model_path="meta-llama/Llama-3.2-1B-Instruct",
    dtype=torch.bfloat16
)

Methods

generate()

Generate text completions for a batch of prompts.

prompts

List[str] | List[List[int]]

required

Input prompts as either:

List of strings (will be tokenized automatically)
List of token ID lists (pre-tokenized)

sampling_params

SamplingParams | List[SamplingParams]

required

Sampling configuration(s). Can be:

A single SamplingParams object (applied to all prompts)
A list of SamplingParams (one per prompt)

Returns: List[Dict[str, str | List[int]]] Each dictionary contains:

"text": Generated text as a string
"token_ids": Generated token IDs as a list

from minisgl.core import SamplingParams

prompts = [
    "What is the capital of France?",
    "Explain quantum computing in simple terms."
]

sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=256
)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output["text"])
    print(f"Generated {len(output['token_ids'])} tokens")

Usage Examples

Basic Generation

from minisgl.llm import LLM
from minisgl.core import SamplingParams

llm = LLM(model_path="meta-llama/Llama-3.2-1B-Instruct")

prompts = ["Once upon a time"]
sampling_params = SamplingParams(temperature=0.7, max_tokens=100)

outputs = llm.generate(prompts, sampling_params)
print(outputs[0]["text"])

Batch Generation with Different Parameters

prompts = [
    "Write a haiku about coding",
    "Explain machine learning",
    "What is 2+2?"
]

sampling_params = [
    SamplingParams(temperature=0.9, max_tokens=50),   # Creative
    SamplingParams(temperature=0.7, max_tokens=200),  # Balanced
    SamplingParams(temperature=0.0, max_tokens=10),   # Greedy
]

outputs = llm.generate(prompts, sampling_params)

Pre-tokenized Input

# Using token IDs directly
token_ids = [[1, 2643, 338, 263, 4688]]  # Pre-tokenized prompt

sampling_params = SamplingParams(max_tokens=50)
outputs = llm.generate(token_ids, sampling_params)

Notes

The LLM class runs in offline mode, processing batches synchronously
Internally manages tokenization via the model’s tokenizer
Automatically handles request scheduling and batching
Thread-safe for single-instance usage

API Endpoints

Python API

Architecture

Constructor

Methods

generate()

Usage Examples

Basic Generation

Batch Generation with Different Parameters

Pre-tokenized Input

Notes

Build docs developers (and LLMs) love

API Endpoints

Python API

Architecture

​Constructor

​Methods

​generate()

​Usage Examples

​Basic Generation

​Batch Generation with Different Parameters

​Pre-tokenized Input

​Notes

Build docs developers (and LLMs) love

Constructor

Methods

generate()

Usage Examples

Basic Generation

Batch Generation with Different Parameters

Pre-tokenized Input

Notes