Overview
SGLang’s native Python API provides direct access to the inference engine without going through HTTP. This is ideal for embedding SGLang into your application or for maximum performance.Installation
Install SGLang:Engine Initialization
Basic Usage
With Configuration
Context Manager
Text Generation
Single Prompt
Batch Generation
Token IDs Input
Pass pre-tokenized input:Streaming Generation
Synchronous Streaming
Async Streaming
Sampling Parameters
Control generation behavior with sampling parameters:Structured Output
JSON Schema
Constrain output to match a JSON schema:Regex Constraints
EBNF Grammar
Embeddings
Generate embeddings with embedding models:Logprobs and Token Information
Get detailed token-level information:Multimodal Inputs
Images
Multiple Images
Video
LoRA Adapters
Load Adapters at Startup
Dynamic Loading
Sessions
Sessions allow efficient multi-turn conversations with shared context:Cache Management
Flush Cache
Clear the KV cache:Freeze Garbage Collection
Improve performance by freezing GC after warmup:Advanced Features
Custom Logit Processor
Hidden States
Access model hidden states:Priority Scheduling
Set request priority (requires--enable-priority-scheduling):
Profiling and Monitoring
Start Profiling
Get Server Info
Engine Configuration
All server arguments are available when creating an Engine:Error Handling
Cleanup
Always shut down the engine when done:See Also
- Offline Engine - Engine without server components
- Sampling Parameters - Parameter reference
- Server Arguments - Configuration options
