Skip to main content

Engine

The Engine class is the main entry point to the SGLang inference engine. It provides a Python API for text generation and embedding tasks.

Architecture

The engine consists of three components:
  1. TokenizerManager: Tokenizes requests and sends them to the scheduler
  2. Scheduler (subprocess): Receives requests, schedules batches, forwards them, and sends output tokens to the detokenizer
  3. DetokenizerManager (subprocess): Detokenizes output tokens and sends results back to the tokenizer manager
  • The HTTP server, Engine, and TokenizerManager all run in the main process
  • Inter-process communication is done through IPC via the ZMQ library

Initialization

from sglang import Engine

engine = Engine(
    model_path="meta-llama/Llama-3.1-8B-Instruct",
    tp_size=1
)

Constructor Parameters

model_path
str
required
Path to the model on Hugging Face or local filesystem.
tokenizer_path
Optional[str]
default:"None"
Path to the tokenizer. Defaults to model_path if not specified.
tp_size
int
default:"1"
Tensor parallelism size. Number of GPUs to use for model parallelism.
trust_remote_code
bool
default:"False"
Whether to trust remote code when loading the model.
context_length
Optional[int]
default:"None"
Maximum context length. Auto-detected from model config if not specified.
mem_fraction_static
Optional[float]
default:"None"
Fraction of GPU memory to use for static allocation (model weights + KV cache).
log_level
str
default:"error"
Logging level. Options: “debug”, “info”, “warning”, “error”.
For the complete list of parameters, see ServerArgs.

Methods

generate

Generate text completions synchronously.
response = engine.generate(
    prompt="Once upon a time",
    sampling_params={"temperature": 0.8, "max_new_tokens": 128}
)
print(response["text"])
prompt
Optional[Union[List[str], str]]
Input prompt(s). Can be a single string or list of strings for batching.
sampling_params
Optional[Union[List[Dict], Dict]]
Sampling parameters. See SamplingParams for details.
input_ids
Optional[Union[List[List[int]], List[int]]]
Token IDs for text. Use either prompt or input_ids, not both.
image_data
Optional[MultimodalDataInputFormat]
Image input(s) for multimodal models. Can be:
  • Single image (file path, URL, or base64 string)
  • List of images (one per request)
  • List of lists of images (multiple images per request)
return_logprob
Optional[Union[List[bool], bool]]
default:"False"
Whether to return log probabilities.
stream
bool
default:"False"
Whether to stream the response token by token.
routed_dp_rank
Optional[int]
default:"None"
Data parallel rank to route the request to when using data parallelism.
Returns: Union[Dict, Iterator[Dict]] - Response dictionary or iterator if streaming Response Format:
{
    "text": "generated text",
    "meta_info": {
        "prompt_tokens": 10,
        "completion_tokens": 20,
        "finish_reason": "stop"
    }
}

async_generate

Generate text completions asynchronously.
import asyncio

async def generate_text():
    response = await engine.async_generate(
        prompt="Tell me a story",
        sampling_params={"temperature": 0.8, "max_new_tokens": 256}
    )
    print(response["text"])

asyncio.run(generate_text())
Parameters are identical to generate(). Returns Union[Dict, AsyncIterator[Dict]] for streaming.

encode

Generate embeddings for input text.
embeddings = engine.encode(
    prompt="Hello, world!"
)
print(embeddings["embedding"])
prompt
Union[str, List[str], List[Dict], List[List[Dict]]]
required
Text or messages to encode.
image_data
Optional[MultimodalDataInputFormat]
Image data for multimodal embedding models.
dimensions
Optional[int]
Output embedding dimensions (if model supports dimensionality reduction).
Returns: Dict - Embeddings dictionary

async_encode

Asynchronous version of encode().
async def get_embeddings():
    embeddings = await engine.async_encode(
        prompt="Hello, world!"
    )
    print(embeddings["embedding"])

score

Score the probability of label tokens appearing after (query + item) pairs.
result = engine.score(
    query="Is the following city the capital of France?",
    items=["Paris", "London", "Berlin"],
    label_token_ids=[2332, 1223],  # Token IDs for "Yes" and "No"
    apply_softmax=True
)
print(result.scores)  # [[0.9, 0.1], [0.2, 0.8], [0.1, 0.9]]
query
Optional[Union[str, List[int]]]
required
Query text or pre-tokenized token IDs.
items
Optional[Union[str, List[str], List[List[int]]]]
required
Item text(s) or pre-tokenized token IDs to score.
label_token_ids
Optional[List[int]]
List of token IDs to compute probabilities for.
apply_softmax
bool
default:"False"
Whether to normalize probabilities using softmax.
item_first
bool
default:"False"
If True, prepend items to query. Otherwise append items to query.
Returns: ScoreResult with scores and prompt_tokens fields.

Session Management

open_session

Open a session for multi-turn conversation with shared context.
session_id = engine.open_session(
    capacity_of_str_len=10000,
    streaming=True,
    timeout=300.0
)
capacity_of_str_len
int
required
Maximum string length capacity for the session.
session_id
Optional[str]
Optional session ID. A UUID is generated if not provided.
streaming
bool
default:"False"
Use low-overhead path for realtime streaming (append-only mode).
timeout
Optional[float]
Auto-close session after this many seconds of inactivity.
Returns: str - The session ID

close_session

Close a session and release its resources.
engine.close_session(session_id="my-session")

Weight Management

update_weights_from_disk

Update model weights from disk without restarting the engine.
engine.update_weights_from_disk(
    model_path="/path/to/new/weights",
    load_format="safetensors"
)

load_lora_adapter

Load a LoRA adapter without restarting the engine.
engine.load_lora_adapter(
    lora_name="my_adapter",
    lora_path="/path/to/lora",
    pinned=True
)

unload_lora_adapter

Unload a LoRA adapter.
engine.unload_lora_adapter(lora_name="my_adapter")

Profiling and Monitoring

get_server_info

Get server configuration and runtime information.
info = engine.get_server_info()
print(info["model_path"])
print(info["internal_states"])

start_profile / stop_profile

Start and stop performance profiling.
engine.start_profile()
# Run your workload
engine.stop_profile()

flush_cache

Flush the KV cache.
engine.flush_cache()

Shutdown

shutdown

Shutdown the engine and all subprocesses.
engine.shutdown()
You can also use the engine as a context manager:
with Engine(model_path="meta-llama/Llama-3.1-8B-Instruct") as engine:
    response = engine.generate(prompt="Hello")
    print(response["text"])
# Engine is automatically shut down

Usage Examples

Basic Text Generation

from sglang import Engine

engine = Engine(model_path="meta-llama/Llama-3.1-8B-Instruct")

response = engine.generate(
    prompt="What is the capital of France?",
    sampling_params={"temperature": 0.0, "max_new_tokens": 32}
)

print(response["text"])
engine.shutdown()

Streaming Generation

response = engine.generate(
    prompt="Write a short story",
    sampling_params={"temperature": 0.8, "max_new_tokens": 256},
    stream=True
)

for chunk in response:
    print(chunk["text"], end="", flush=True)

Batch Generation

prompts = [
    "What is AI?",
    "Explain quantum computing",
    "What is machine learning?"
]

responses = engine.generate(
    prompt=prompts,
    sampling_params={"temperature": 0.7, "max_new_tokens": 100}
)

for response in responses:
    print(response["text"])

Multimodal Generation

response = engine.generate(
    prompt="Describe this image",
    image_data="https://example.com/image.jpg",
    sampling_params={"max_new_tokens": 128}
)

print(response["text"])

See Also