Runtime

The Runtime class is a wrapper for launching the SGLang HTTP server programmatically from Python. It’s primarily used with the SGLang frontend language.

For offline processing without the frontend language, use the Engine class instead.

RuntimeEndpoint

The RuntimeEndpoint class provides a client interface to communicate with a running SGLang server.

Initialization

from sglang.lang.backend.runtime_endpoint import RuntimeEndpoint

endpoint = RuntimeEndpoint(
    base_url="http://localhost:30000",
    api_key="your-api-key"
)

base_url

str

required

Base URL of the SGLang server.

api_key

Optional[str]

default:"None"

API key for authentication.

verify

Optional[str]

default:"None"

SSL certificate verification path.

chat_template_name

Optional[str]

default:"None"

Name of the chat template to use. Auto-detected from model if not specified.

Methods

get_model_name

Get the model path/name from the server.

model_name = endpoint.get_model_name()
print(model_name)  # "meta-llama/Llama-3.1-8B-Instruct"

get_server_info

Get server configuration and status information.

info = endpoint.get_server_info()
print(info["model_path"])
print(info["tp_size"])

flush_cache

Flush the KV cache on the server.

endpoint.flush_cache()

cache_prefix

Pre-cache a prefix string in the KV cache.

endpoint.cache_prefix("System: You are a helpful assistant.")

start_profile / stop_profile

Start and stop server profiling.

endpoint.start_profile()
# Run your workload
endpoint.stop_profile()

Runtime

The Runtime class launches an HTTP server in a separate process and provides an endpoint to interact with it.

Initialization

from sglang.lang.backend.runtime_endpoint import Runtime

runtime = Runtime(
    model_path="meta-llama/Llama-3.1-8B-Instruct",
    tp_size=1,
    log_level="info"
)

model_path

str

required

Path to the model on Hugging Face or local filesystem. See ServerArgs for more details.

log_level

str

default:"error"

Log level for the server. Options: “debug”, “info”, “warning”, “error”.

launch_timeout

float

default:"300.0"

Timeout in seconds for waiting for the server to start.

**kwargs

Additional keyword arguments are passed to ServerArgs.

Attributes

url

str

Base URL of the launched server (e.g., “http://127.0.0.1:30000”).

generate_url

str

Full URL for the generate endpoint.

endpoint

RuntimeEndpoint

RuntimeEndpoint instance for interacting with the server.

Methods

generate

Synchronous text generation.

response_json = runtime.generate(
    prompt="What is machine learning?",
    sampling_params={"temperature": 0.7, "max_new_tokens": 100}
)

import json
response = json.loads(response_json)
print(response["text"])

prompt

Union[str, List[str]]

required

Input prompt(s).

sampling_params

Optional[Dict]

Sampling parameters dictionary.

return_logprob

Optional[Union[List[bool], bool]]

default:"False"

Whether to return log probabilities.

lora_path

Optional[List[Optional[str]]]

LoRA adapter path(s) for each request.

Returns: str - JSON string containing the response

async_generate

Asynchronous streaming text generation.

import asyncio

async def stream_text():
    async for chunk in runtime.async_generate(
        prompt="Tell me a story",
        sampling_params={"temperature": 0.8, "max_new_tokens": 256}
    ):
        print(chunk, end="", flush=True)

asyncio.run(stream_text())

Alias: add_request can also be used for async_generate.

encode

Generate embeddings.

embeddings_json = runtime.encode(
    prompt="Hello, world!"
)

import json
embeddings = json.loads(embeddings_json)
print(embeddings["embedding"])

prompt

Union[str, List[str], List[Dict], List[List[Dict]]]

required

Text to encode.

Returns: str - JSON string containing embeddings

get_server_info

Get server information asynchronously.

import asyncio

async def get_info():
    info = await runtime.get_server_info()
    print(info["model_path"])

asyncio.run(get_info())

Returns: Dict - Server information dictionary

get_tokenizer

Get the tokenizer used by the server.

tokenizer = runtime.get_tokenizer()
tokens = tokenizer.encode("Hello, world!")

Returns: Hugging Face tokenizer instance

start_profile / stop_profile

Start and stop server profiling.

runtime.start_profile()
# Run your workload
runtime.stop_profile()

cache_prefix

Pre-cache a prefix string.

runtime.cache_prefix("System: You are a helpful assistant.")

shutdown

Shutdown the server and clean up resources.

runtime.shutdown()

The shutdown method is automatically called when the Runtime object is deleted or when the Python program terminates.

Usage Examples

Basic Server Launch

from sglang.lang.backend.runtime_endpoint import Runtime
import json

# Launch the server
runtime = Runtime(
    model_path="meta-llama/Llama-3.1-8B-Instruct",
    tp_size=1,
    log_level="info"
)

# Generate text
response_json = runtime.generate(
    prompt="What is the capital of France?",
    sampling_params={"temperature": 0.0, "max_new_tokens": 32}
)

response = json.loads(response_json)
print(response["text"])

# Shutdown
runtime.shutdown()

Streaming with Async

import asyncio
from sglang.lang.backend.runtime_endpoint import Runtime

async def main():
    runtime = Runtime(
        model_path="meta-llama/Llama-3.1-8B-Instruct",
        tp_size=1
    )
    
    # Stream the response
    async for chunk in runtime.async_generate(
        prompt="Write a poem about AI",
        sampling_params={"temperature": 0.8, "max_new_tokens": 200}
    ):
        print(chunk, end="", flush=True)
    
    runtime.shutdown()

asyncio.run(main())

Using RuntimeEndpoint with Existing Server

from sglang.lang.backend.runtime_endpoint import RuntimeEndpoint

# Connect to an existing server
endpoint = RuntimeEndpoint(
    base_url="http://localhost:30000",
    api_key="my-secret-key"
)

# Get model info
model_name = endpoint.get_model_name()
print(f"Connected to: {model_name}")

# Get server info
info = endpoint.get_server_info()
print(f"TP Size: {info['tp_size']}")
print(f"Max Total Tokens: {info['max_total_tokens']}")

# Cache a common prefix
endpoint.cache_prefix("You are a helpful AI assistant.")

Batch Requests

import json

runtime = Runtime(
    model_path="meta-llama/Llama-3.1-8B-Instruct",
    tp_size=1
)

prompts = [
    "What is AI?",
    "Explain machine learning",
    "What is deep learning?"
]

response_json = runtime.generate(
    prompt=prompts,
    sampling_params={"temperature": 0.7, "max_new_tokens": 100}
)

responses = json.loads(response_json)
for i, response in enumerate(responses):
    print(f"Response {i+1}: {response['text']}")

runtime.shutdown()

Using with Frontend Language (SGLang)

import sglang as sgl
from sglang.lang.backend.runtime_endpoint import Runtime

# Launch runtime
runtime = Runtime(
    model_path="meta-llama/Llama-3.1-8B-Instruct",
    tp_size=1
)

# Set backend
sgl.set_default_backend(runtime.endpoint)

# Use SGLang frontend
@sgl.function
def chatbot(s, user_message):
    s += sgl.system("You are a helpful assistant.")
    s += sgl.user(user_message)
    s += sgl.assistant(sgl.gen("response", max_tokens=100))

state = chatbot.run(user_message="What is machine learning?")
print(state["response"])

runtime.shutdown()

Differences from Engine

Feature	Runtime	Engine
Use case	Frontend language (SGLang)	Direct Python API
Server	Launches HTTP server	In-process
API	HTTP-based	Direct function calls
Overhead	Higher (HTTP serialization)	Lower (in-process)
Multi-client	Yes (via HTTP)	No (single process)

Use Engine for offline batch processing and Runtime when you need:

An HTTP server for multiple clients
Integration with the SGLang frontend language
Remote access to the model

Python API

Frontend API

HTTP API

CLI Reference

Runtime

Runtime

RuntimeEndpoint

Initialization

Methods

get_model_name

get_server_info

flush_cache

cache_prefix

start_profile / stop_profile

Runtime

Initialization

Attributes

Methods

generate

async_generate

encode

get_server_info

get_tokenizer

start_profile / stop_profile

cache_prefix

shutdown

Usage Examples

Basic Server Launch

Streaming with Async

Using RuntimeEndpoint with Existing Server

Batch Requests

Using with Frontend Language (SGLang)

Differences from Engine

See Also

Python API

Frontend API

HTTP API

CLI Reference

​Runtime

​RuntimeEndpoint

​Initialization

​Methods

​get_model_name

​get_server_info

​flush_cache

​cache_prefix

​start_profile / stop_profile

​Runtime

​Initialization

​Attributes

​Methods

​generate

​async_generate

​encode

​get_server_info

​get_tokenizer

​start_profile / stop_profile

​cache_prefix

​shutdown

​Usage Examples

​Basic Server Launch

​Streaming with Async

​Using RuntimeEndpoint with Existing Server

​Batch Requests

​Using with Frontend Language (SGLang)

​Differences from Engine

​See Also

Runtime

RuntimeEndpoint

Initialization

Methods

get_model_name

get_server_info

flush_cache

cache_prefix

start_profile / stop_profile

Runtime

Initialization

Attributes

Methods

generate

async_generate

encode

get_server_info

get_tokenizer

start_profile / stop_profile

cache_prefix

shutdown

Usage Examples

Basic Server Launch

Streaming with Async

Using RuntimeEndpoint with Existing Server

Batch Requests

Using with Frontend Language (SGLang)

Differences from Engine

See Also