Engine
TheEngine class is the main entry point to the SGLang inference engine. It provides a Python API for text generation and embedding tasks.
Architecture
The engine consists of three components:- TokenizerManager: Tokenizes requests and sends them to the scheduler
- Scheduler (subprocess): Receives requests, schedules batches, forwards them, and sends output tokens to the detokenizer
- DetokenizerManager (subprocess): Detokenizes output tokens and sends results back to the tokenizer manager
- The HTTP server, Engine, and TokenizerManager all run in the main process
- Inter-process communication is done through IPC via the ZMQ library
Initialization
Constructor Parameters
Path to the model on Hugging Face or local filesystem.
Path to the tokenizer. Defaults to
model_path if not specified.Tensor parallelism size. Number of GPUs to use for model parallelism.
Whether to trust remote code when loading the model.
Maximum context length. Auto-detected from model config if not specified.
Fraction of GPU memory to use for static allocation (model weights + KV cache).
Logging level. Options: “debug”, “info”, “warning”, “error”.
Methods
generate
Generate text completions synchronously.Input prompt(s). Can be a single string or list of strings for batching.
Sampling parameters. See SamplingParams for details.
Token IDs for text. Use either
prompt or input_ids, not both.Image input(s) for multimodal models. Can be:
- Single image (file path, URL, or base64 string)
- List of images (one per request)
- List of lists of images (multiple images per request)
Whether to return log probabilities.
Whether to stream the response token by token.
Data parallel rank to route the request to when using data parallelism.
Union[Dict, Iterator[Dict]] - Response dictionary or iterator if streaming
Response Format:
async_generate
Generate text completions asynchronously.Parameters are identical to
generate(). Returns Union[Dict, AsyncIterator[Dict]] for streaming.encode
Generate embeddings for input text.Text or messages to encode.
Image data for multimodal embedding models.
Output embedding dimensions (if model supports dimensionality reduction).
Dict - Embeddings dictionary
async_encode
Asynchronous version ofencode().
score
Score the probability of label tokens appearing after (query + item) pairs.Query text or pre-tokenized token IDs.
Item text(s) or pre-tokenized token IDs to score.
List of token IDs to compute probabilities for.
Whether to normalize probabilities using softmax.
If True, prepend items to query. Otherwise append items to query.
ScoreResult with scores and prompt_tokens fields.
Session Management
open_session
Open a session for multi-turn conversation with shared context.Maximum string length capacity for the session.
Optional session ID. A UUID is generated if not provided.
Use low-overhead path for realtime streaming (append-only mode).
Auto-close session after this many seconds of inactivity.
str - The session ID
close_session
Close a session and release its resources.Weight Management
update_weights_from_disk
Update model weights from disk without restarting the engine.load_lora_adapter
Load a LoRA adapter without restarting the engine.unload_lora_adapter
Unload a LoRA adapter.Profiling and Monitoring
get_server_info
Get server configuration and runtime information.start_profile / stop_profile
Start and stop performance profiling.flush_cache
Flush the KV cache.Shutdown
shutdown
Shutdown the engine and all subprocesses.You can also use the engine as a context manager:
Usage Examples
Basic Text Generation
Streaming Generation
Batch Generation
Multimodal Generation
See Also
- Runtime - HTTP server wrapper for the Engine
- SamplingParams - Sampling parameter configuration
- ServerArgs - Complete server configuration options
