Engine

The Engine class is the main entry point to the SGLang inference engine. It provides a Python API for text generation and embedding tasks.

Architecture

The engine consists of three components:

TokenizerManager: Tokenizes requests and sends them to the scheduler
Scheduler (subprocess): Receives requests, schedules batches, forwards them, and sends output tokens to the detokenizer
DetokenizerManager (subprocess): Detokenizes output tokens and sends results back to the tokenizer manager

The HTTP server, Engine, and TokenizerManager all run in the main process
Inter-process communication is done through IPC via the ZMQ library

Initialization

from sglang import Engine

engine = Engine(
    model_path="meta-llama/Llama-3.1-8B-Instruct",
    tp_size=1
)

Constructor Parameters

model_path

str

required

Path to the model on Hugging Face or local filesystem.

tokenizer_path

Optional[str]

default:"None"

Path to the tokenizer. Defaults to model_path if not specified.

tp_size

int

default:"1"

Tensor parallelism size. Number of GPUs to use for model parallelism.

trust_remote_code

bool

default:"False"

Whether to trust remote code when loading the model.

context_length

Optional[int]

default:"None"

Maximum context length. Auto-detected from model config if not specified.

mem_fraction_static

Optional[float]

default:"None"

Fraction of GPU memory to use for static allocation (model weights + KV cache).

log_level

str

default:"error"

Logging level. Options: “debug”, “info”, “warning”, “error”.

For the complete list of parameters, see ServerArgs.

Methods

generate

Generate text completions synchronously.

response = engine.generate(
    prompt="Once upon a time",
    sampling_params={"temperature": 0.8, "max_new_tokens": 128}
)
print(response["text"])

prompt

Optional[Union[List[str], str]]

Input prompt(s). Can be a single string or list of strings for batching.

sampling_params

Optional[Union[List[Dict], Dict]]

Sampling parameters. See SamplingParams for details.

input_ids

Optional[Union[List[List[int]], List[int]]]

Token IDs for text. Use either prompt or input_ids, not both.

image_data

Optional[MultimodalDataInputFormat]

Image input(s) for multimodal models. Can be:

Single image (file path, URL, or base64 string)
List of images (one per request)
List of lists of images (multiple images per request)

return_logprob

Optional[Union[List[bool], bool]]

default:"False"

Whether to return log probabilities.

stream

bool

default:"False"

Whether to stream the response token by token.

routed_dp_rank

Optional[int]

default:"None"

Data parallel rank to route the request to when using data parallelism.

Returns: Union[Dict, Iterator[Dict]] - Response dictionary or iterator if streaming Response Format:

{
    "text": "generated text",
    "meta_info": {
        "prompt_tokens": 10,
        "completion_tokens": 20,
        "finish_reason": "stop"
    }
}

async_generate

Generate text completions asynchronously.

import asyncio

async def generate_text():
    response = await engine.async_generate(
        prompt="Tell me a story",
        sampling_params={"temperature": 0.8, "max_new_tokens": 256}
    )
    print(response["text"])

asyncio.run(generate_text())

Parameters are identical to generate(). Returns Union[Dict, AsyncIterator[Dict]] for streaming.

encode

Generate embeddings for input text.

embeddings = engine.encode(
    prompt="Hello, world!"
)
print(embeddings["embedding"])

prompt

Union[str, List[str], List[Dict], List[List[Dict]]]

required

Text or messages to encode.

image_data

Optional[MultimodalDataInputFormat]

Image data for multimodal embedding models.

dimensions

Optional[int]

Output embedding dimensions (if model supports dimensionality reduction).

Returns: Dict - Embeddings dictionary

async_encode

Asynchronous version of encode().

async def get_embeddings():
    embeddings = await engine.async_encode(
        prompt="Hello, world!"
    )
    print(embeddings["embedding"])

score

Score the probability of label tokens appearing after (query + item) pairs.

result = engine.score(
    query="Is the following city the capital of France?",
    items=["Paris", "London", "Berlin"],
    label_token_ids=[2332, 1223],  # Token IDs for "Yes" and "No"
    apply_softmax=True
)
print(result.scores)  # [[0.9, 0.1], [0.2, 0.8], [0.1, 0.9]]

query

Optional[Union[str, List[int]]]

required

Query text or pre-tokenized token IDs.

items

Optional[Union[str, List[str], List[List[int]]]]

required

Item text(s) or pre-tokenized token IDs to score.

label_token_ids

Optional[List[int]]

List of token IDs to compute probabilities for.

apply_softmax

bool

default:"False"

Whether to normalize probabilities using softmax.

item_first

bool

default:"False"

If True, prepend items to query. Otherwise append items to query.

Returns: ScoreResult with scores and prompt_tokens fields.

Session Management

open_session

Open a session for multi-turn conversation with shared context.

session_id = engine.open_session(
    capacity_of_str_len=10000,
    streaming=True,
    timeout=300.0
)

capacity_of_str_len

int

required

Maximum string length capacity for the session.

session_id

Optional[str]

Optional session ID. A UUID is generated if not provided.

streaming

bool

default:"False"

Use low-overhead path for realtime streaming (append-only mode).

timeout

Optional[float]

Auto-close session after this many seconds of inactivity.

Returns: str - The session ID

close_session

Close a session and release its resources.

engine.close_session(session_id="my-session")

Weight Management

update_weights_from_disk

Update model weights from disk without restarting the engine.

engine.update_weights_from_disk(
    model_path="/path/to/new/weights",
    load_format="safetensors"
)

load_lora_adapter

Load a LoRA adapter without restarting the engine.

engine.load_lora_adapter(
    lora_name="my_adapter",
    lora_path="/path/to/lora",
    pinned=True
)

unload_lora_adapter

Unload a LoRA adapter.

engine.unload_lora_adapter(lora_name="my_adapter")

Profiling and Monitoring

get_server_info

Get server configuration and runtime information.

info = engine.get_server_info()
print(info["model_path"])
print(info["internal_states"])

start_profile / stop_profile

Start and stop performance profiling.

engine.start_profile()
# Run your workload
engine.stop_profile()

flush_cache

Flush the KV cache.

engine.flush_cache()

Shutdown

shutdown

Shutdown the engine and all subprocesses.

engine.shutdown()

You can also use the engine as a context manager:

with Engine(model_path="meta-llama/Llama-3.1-8B-Instruct") as engine:
    response = engine.generate(prompt="Hello")
    print(response["text"])
# Engine is automatically shut down

Usage Examples

Basic Text Generation

from sglang import Engine

engine = Engine(model_path="meta-llama/Llama-3.1-8B-Instruct")

response = engine.generate(
    prompt="What is the capital of France?",
    sampling_params={"temperature": 0.0, "max_new_tokens": 32}
)

print(response["text"])
engine.shutdown()

Streaming Generation

response = engine.generate(
    prompt="Write a short story",
    sampling_params={"temperature": 0.8, "max_new_tokens": 256},
    stream=True
)

for chunk in response:
    print(chunk["text"], end="", flush=True)

Batch Generation

prompts = [
    "What is AI?",
    "Explain quantum computing",
    "What is machine learning?"
]

responses = engine.generate(
    prompt=prompts,
    sampling_params={"temperature": 0.7, "max_new_tokens": 100}
)

for response in responses:
    print(response["text"])

Multimodal Generation

response = engine.generate(
    prompt="Describe this image",
    image_data="https://example.com/image.jpg",
    sampling_params={"max_new_tokens": 128}
)

print(response["text"])

Python API

Frontend API

HTTP API

CLI Reference

Engine

Engine

Architecture

Initialization

Constructor Parameters

Methods

generate

async_generate

encode

async_encode

score

Session Management

open_session

close_session

Weight Management

update_weights_from_disk

load_lora_adapter

unload_lora_adapter

Profiling and Monitoring

get_server_info

start_profile / stop_profile

flush_cache

Shutdown

shutdown

Usage Examples

Basic Text Generation

Streaming Generation

Batch Generation

Multimodal Generation

See Also

Python API

Frontend API

HTTP API

CLI Reference

​Engine

​Architecture

​Initialization

​Constructor Parameters

​Methods

​generate

​async_generate

​encode

​async_encode

​score

​Session Management

​open_session

​close_session

​Weight Management

​update_weights_from_disk

​load_lora_adapter

​unload_lora_adapter

​Profiling and Monitoring

​get_server_info

​start_profile / stop_profile

​flush_cache

​Shutdown

​shutdown

​Usage Examples

​Basic Text Generation

​Streaming Generation

​Batch Generation

​Multimodal Generation

​See Also

Engine

Architecture

Initialization

Constructor Parameters

Methods

generate

async_generate

encode

async_encode

score

Session Management

open_session

close_session

Weight Management

update_weights_from_disk

load_lora_adapter

unload_lora_adapter

Profiling and Monitoring

get_server_info

start_profile / stop_profile

flush_cache

Shutdown

shutdown

Usage Examples

Basic Text Generation

Streaming Generation

Batch Generation

Multimodal Generation

See Also