Runtime
The Runtime class is a wrapper for launching the SGLang HTTP server programmatically from Python. It’s primarily used with the SGLang frontend language.
For offline processing without the frontend language, use the Engine class instead.
RuntimeEndpoint
The RuntimeEndpoint class provides a client interface to communicate with a running SGLang server.
Initialization
from sglang.lang.backend.runtime_endpoint import RuntimeEndpoint
endpoint = RuntimeEndpoint(
base_url="http://localhost:30000",
api_key="your-api-key"
)
Base URL of the SGLang server.
api_key
Optional[str]
default:"None"
API key for authentication.
verify
Optional[str]
default:"None"
SSL certificate verification path.
chat_template_name
Optional[str]
default:"None"
Name of the chat template to use. Auto-detected from model if not specified.
Methods
get_model_name
Get the model path/name from the server.
model_name = endpoint.get_model_name()
print(model_name) # "meta-llama/Llama-3.1-8B-Instruct"
get_server_info
Get server configuration and status information.
info = endpoint.get_server_info()
print(info["model_path"])
print(info["tp_size"])
flush_cache
Flush the KV cache on the server.
cache_prefix
Pre-cache a prefix string in the KV cache.
endpoint.cache_prefix("System: You are a helpful assistant.")
start_profile / stop_profile
Start and stop server profiling.
endpoint.start_profile()
# Run your workload
endpoint.stop_profile()
Runtime
The Runtime class launches an HTTP server in a separate process and provides an endpoint to interact with it.
Initialization
from sglang.lang.backend.runtime_endpoint import Runtime
runtime = Runtime(
model_path="meta-llama/Llama-3.1-8B-Instruct",
tp_size=1,
log_level="info"
)
Path to the model on Hugging Face or local filesystem. See ServerArgs for more details.
Log level for the server. Options: “debug”, “info”, “warning”, “error”.
Timeout in seconds for waiting for the server to start.
Additional keyword arguments are passed to ServerArgs.
Attributes
Full URL for the generate endpoint.
RuntimeEndpoint instance for interacting with the server.
Methods
generate
Synchronous text generation.
response_json = runtime.generate(
prompt="What is machine learning?",
sampling_params={"temperature": 0.7, "max_new_tokens": 100}
)
import json
response = json.loads(response_json)
print(response["text"])
prompt
Union[str, List[str]]
required
Input prompt(s).
Sampling parameters dictionary.
return_logprob
Optional[Union[List[bool], bool]]
default:"False"
Whether to return log probabilities.
lora_path
Optional[List[Optional[str]]]
LoRA adapter path(s) for each request.
Returns: str - JSON string containing the response
async_generate
Asynchronous streaming text generation.
import asyncio
async def stream_text():
async for chunk in runtime.async_generate(
prompt="Tell me a story",
sampling_params={"temperature": 0.8, "max_new_tokens": 256}
):
print(chunk, end="", flush=True)
asyncio.run(stream_text())
Alias: add_request can also be used for async_generate.
encode
Generate embeddings.
embeddings_json = runtime.encode(
prompt="Hello, world!"
)
import json
embeddings = json.loads(embeddings_json)
print(embeddings["embedding"])
prompt
Union[str, List[str], List[Dict], List[List[Dict]]]
required
Text to encode.
Returns: str - JSON string containing embeddings
get_server_info
Get server information asynchronously.
import asyncio
async def get_info():
info = await runtime.get_server_info()
print(info["model_path"])
asyncio.run(get_info())
Returns: Dict - Server information dictionary
get_tokenizer
Get the tokenizer used by the server.
tokenizer = runtime.get_tokenizer()
tokens = tokenizer.encode("Hello, world!")
Returns: Hugging Face tokenizer instance
start_profile / stop_profile
Start and stop server profiling.
runtime.start_profile()
# Run your workload
runtime.stop_profile()
cache_prefix
Pre-cache a prefix string.
runtime.cache_prefix("System: You are a helpful assistant.")
shutdown
Shutdown the server and clean up resources.
The shutdown method is automatically called when the Runtime object is deleted or when the Python program terminates.
Usage Examples
Basic Server Launch
from sglang.lang.backend.runtime_endpoint import Runtime
import json
# Launch the server
runtime = Runtime(
model_path="meta-llama/Llama-3.1-8B-Instruct",
tp_size=1,
log_level="info"
)
# Generate text
response_json = runtime.generate(
prompt="What is the capital of France?",
sampling_params={"temperature": 0.0, "max_new_tokens": 32}
)
response = json.loads(response_json)
print(response["text"])
# Shutdown
runtime.shutdown()
Streaming with Async
import asyncio
from sglang.lang.backend.runtime_endpoint import Runtime
async def main():
runtime = Runtime(
model_path="meta-llama/Llama-3.1-8B-Instruct",
tp_size=1
)
# Stream the response
async for chunk in runtime.async_generate(
prompt="Write a poem about AI",
sampling_params={"temperature": 0.8, "max_new_tokens": 200}
):
print(chunk, end="", flush=True)
runtime.shutdown()
asyncio.run(main())
Using RuntimeEndpoint with Existing Server
from sglang.lang.backend.runtime_endpoint import RuntimeEndpoint
# Connect to an existing server
endpoint = RuntimeEndpoint(
base_url="http://localhost:30000",
api_key="my-secret-key"
)
# Get model info
model_name = endpoint.get_model_name()
print(f"Connected to: {model_name}")
# Get server info
info = endpoint.get_server_info()
print(f"TP Size: {info['tp_size']}")
print(f"Max Total Tokens: {info['max_total_tokens']}")
# Cache a common prefix
endpoint.cache_prefix("You are a helpful AI assistant.")
Batch Requests
import json
runtime = Runtime(
model_path="meta-llama/Llama-3.1-8B-Instruct",
tp_size=1
)
prompts = [
"What is AI?",
"Explain machine learning",
"What is deep learning?"
]
response_json = runtime.generate(
prompt=prompts,
sampling_params={"temperature": 0.7, "max_new_tokens": 100}
)
responses = json.loads(response_json)
for i, response in enumerate(responses):
print(f"Response {i+1}: {response['text']}")
runtime.shutdown()
Using with Frontend Language (SGLang)
import sglang as sgl
from sglang.lang.backend.runtime_endpoint import Runtime
# Launch runtime
runtime = Runtime(
model_path="meta-llama/Llama-3.1-8B-Instruct",
tp_size=1
)
# Set backend
sgl.set_default_backend(runtime.endpoint)
# Use SGLang frontend
@sgl.function
def chatbot(s, user_message):
s += sgl.system("You are a helpful assistant.")
s += sgl.user(user_message)
s += sgl.assistant(sgl.gen("response", max_tokens=100))
state = chatbot.run(user_message="What is machine learning?")
print(state["response"])
runtime.shutdown()
Differences from Engine
| Feature | Runtime | Engine |
|---|
| Use case | Frontend language (SGLang) | Direct Python API |
| Server | Launches HTTP server | In-process |
| API | HTTP-based | Direct function calls |
| Overhead | Higher (HTTP serialization) | Lower (in-process) |
| Multi-client | Yes (via HTTP) | No (single process) |
Use Engine for offline batch processing and Runtime when you need:
- An HTTP server for multiple clients
- Integration with the SGLang frontend language
- Remote access to the model
See Also