Setting the Default Backend
Before executing SGLang functions, you must set a default backend:Local Runtime
sgl.Runtime - Local Model Server
Run models locally using SGLang’s high-performance runtime:
model_path(str): HuggingFace model path or local path to modeltokenizer_path(str): Path to tokenizer (defaults to model_path)port(int): Port for the HTTP server (auto-allocated if not specified)host(str): Host address (default: “127.0.0.1”)tp_size(int): Tensor parallelism size for multi-GPUlog_level(str): Logging level (“error”, “warning”, “info”, “debug”)launch_timeout(float): Timeout for server startup (default: 300s)- Additional parameters from
ServerArgs(see server documentation)
sgl.RuntimeEndpoint - Connect to Running Server
Connect to an already-running SGLang server:
base_url(str): URL of the running SGLang serverapi_key(Optional[str]): API key for authenticationverify(Optional[str]): SSL verification (path to cert or False)chat_template_name(Optional[str]): Override chat template
Starting a Server Separately
You can also start the server via command line:RuntimeEndpoint:
OpenAI
sgl.OpenAI - OpenAI API
Use OpenAI models:
model_name(str): OpenAI model nameis_chat_model(Optional[bool]): Whether this is a chat model (auto-detected)chat_template(Optional[ChatTemplate]): Custom chat templateapi_key(str): API key (defaults to OPENAI_API_KEY env var)base_url(str): Custom base URL for API- Other parameters passed to
openai.OpenAI()
Azure OpenAI
Azure Configuration
Use Azure OpenAI Service:Anthropic
sgl.Anthropic - Claude Models
Use Anthropic’s Claude models:
model_name(str): Claude model nameapi_key(str): API key (defaults to ANTHROPIC_API_KEY env var)- Other parameters passed to
anthropic.Anthropic()
Other Cloud Providers
Google Vertex AI
Use Google’s Gemini models via Vertex AI:LiteLLM (Multiple Providers)
Use LiteLLM to access multiple providers with a unified interface:Backend Utilities
Getting Server Information
Flushing Cache
Clear the KV cache on the server:Profiling
For Runtime backends, enable profiling:Complete Examples
Multi-Backend Function
Local Runtime with Multimodal Model
Batch Processing with Local Runtime
Together AI via LiteLLM
Backend Comparison
| Backend | Local/Cloud | Multimodal | Streaming | Batch | Best For |
|---|---|---|---|---|---|
Runtime | Local | Yes | Yes | Yes | Production, local deployment |
RuntimeEndpoint | Remote | Yes | Yes | Yes | Distributed systems |
OpenAI | Cloud | Yes | Yes | Yes | Quick prototyping, GPT models |
Anthropic | Cloud | No | Yes | Yes | Claude models |
VertexAI | Cloud | Yes | Yes | Yes | Google Cloud integration |
LiteLLM | Cloud | Varies | Yes | Yes | Multi-provider support |
Best Practices
- Development vs Production: Use
OpenAIorAnthropicfor prototyping,Runtimefor production - Resource Management: Always call
runtime.shutdown()when done with local runtimes - Error Handling: Wrap backend initialization in try-except blocks
- API Keys: Use environment variables instead of hardcoding keys
- Timeout Configuration: Set appropriate timeouts for your use case
- Model Selection: Choose models based on task requirements (speed vs quality)
- Batch Processing: Use local
Runtimefor high-throughput batch jobs - Testing: Test with multiple backends to ensure compatibility
