Skip to main content
The @sgl.function decorator is the foundation of SGLang’s frontend language. It transforms a regular Python function into an SGLang program that can be executed with various backends and execution modes.

Basic Usage

Defining a Function

Use the @sgl.function decorator to create an SGLang function:
import sglang as sgl

@sgl.function
def text_qa(s, question):
    s += "Q: " + question + "\n"
    s += "A:" + sgl.gen("answer", stop="\n")
The first parameter s is the state object that manages the conversation context. All other parameters become inputs to your function.

Running Functions

Once defined, SGLang functions gain special methods for execution:
# Single execution
state = text_qa.run(question="What is the capital of France?")
print(state["answer"])

# Batch execution
states = text_qa.run_batch(
    [
        {"question": "What is the capital of the United Kingdom?"},
        {"question": "What is the capital of France?"},
    ]
)

# Streaming execution
state = text_qa.run(question="What is the capital of France?", stream=True)
for out in state.text_iter():
    print(out, end="", flush=True)

The State Object

The state object (s) is the core of every SGLang function. It provides methods and operators to build prompts and control execution flow.

Appending Content

Use the += operator to append text to the state:
@sgl.function
def example(s, name):
    s += "Hello, "
    s += name
    s += "!"

Accessing Variables

Use dictionary-style access to retrieve generated content:
@sgl.function
def example(s):
    s += "Tell me a number: " + sgl.gen("number", max_tokens=10)
    s += f"\nYou said: {s['number']}"

Role Management

For chat models, use role methods to structure conversations:
@sgl.function
def chat_example(s, user_message):
    s += sgl.system("You are a helpful assistant.")
    s += sgl.user(user_message)
    s += sgl.assistant(sgl.gen("response", max_tokens=256))
Alternatively, use context managers for complex role structures:
@sgl.function
def chat_with_context(s, user_message):
    with s.user():
        s += "Context: This is important.\n"
        s += user_message
    
    with s.assistant():
        s += sgl.gen("response", max_tokens=256)

Execution Methods

.run() - Single Execution

Execute a single request:
state = my_function.run(
    param1="value1",
    param2="value2",
    # Sampling parameters
    temperature=0.7,
    max_tokens=100,
    stream=False
)
Parameters:
  • Function arguments (positional and keyword)
  • Sampling parameters (temperature, max_tokens, top_p, etc.)
  • stream (bool): Enable streaming output
  • backend (BaseBackend): Override the default backend
Returns:
  • ProgramState: A state object containing results

.run_batch() - Batch Execution

Process multiple inputs efficiently:
states = my_function.run_batch(
    [
        {"param1": "value1", "param2": "value2"},
        {"param1": "value3", "param2": "value4"},
    ],
    # Sampling parameters apply to all
    temperature=0.7,
    num_threads="auto",
    progress_bar=True
)
Parameters:
  • batch_arguments (List[Dict]): List of argument dictionaries
  • Sampling parameters (applied to all requests)
  • num_threads (int | “auto”): Number of parallel threads
  • progress_bar (bool): Show progress bar
  • backend (BaseBackend): Override the default backend
Returns:
  • List[ProgramState]: List of state objects

Generator-Style Batch Processing

For large batches, use generator mode to process results as they complete:
for state in my_function.run_batch(
    batch_arguments,
    generator_style=True
):
    # Process each result as it becomes available
    print(state["answer"])

Advanced Features

Parallel Sampling with Fork/Join

Generate multiple responses in parallel and gather results:
@sgl.function
def parallel_sample(s, question, n):
    s += "Question: " + question + "\n"
    
    # Fork into n parallel branches
    forks = s.fork(n)
    
    # Each fork generates independently
    forks += "Reasoning:" + sgl.gen("reasoning", stop="\n") + "\n"
    forks += "Answer:" + sgl.gen("answer", stop="\n") + "\n"
    
    # Join results back (optional)
    forks.join()

state = parallel_sample.run(question="Compute 5 + 2 + 4.", n=5, temperature=1.0)

# Access results from each fork
for i in range(5):
    print(f"Fork {i}: reasoning={state['reasoning'][i]}, answer={state['answer'][i]}")
Fork Methods:
  • s.fork(n): Create n parallel branches
  • forks[i]: Access individual fork
  • forks += expr: Apply expression to all forks
  • forks.join(): Merge results back

Copy Context

Create a temporary copy of the state:
@sgl.function
def with_copy(s):
    s += "Original context\n"
    
    with s.copy() as copied:
        copied += "This is in the copy\n"
        copied += sgl.gen("temp", max_tokens=10)
    
    # Original state is unchanged
    s += "Back to original\n"

Variable Scopes

Capture specific sections of generated text:
@sgl.function
def with_scope(s):
    with s.var_scope("section"):
        s += "This entire section "
        s += "will be captured "
        s += "in the variable."
    
    print(s["section"])  # Contains the full section text

API Speculative Execution

For chat-based API backends (OpenAI, Anthropic), SGLang can speculatively execute multiple generation calls in a single API request:
@sgl.function(num_api_spec_tokens=200)
def multi_gen_chat(s, question):
    s += sgl.user(question)
    s += sgl.assistant(
        "Let me think: " + 
        sgl.gen("thought", max_tokens=50) +
        "\nAnswer: " +
        sgl.gen("answer", max_tokens=100)
    )
This sends a single API request with max_tokens=200 instead of two separate requests. Syntax:
@sgl.function(num_api_spec_tokens=int)

State Object Reference

Properties

state.text()          # Get full generated text
state.messages()      # Get conversation messages (chat format)
state["var_name"]     # Access a generated variable
state.error()         # Get any error that occurred

Methods

state.sync()                          # Wait for async operations
state.text_iter()                     # Iterate over streaming text
state.text_iter(var_name="answer")    # Stream a specific variable
state.text_async_iter()               # Async streaming iterator
state.get_var("name")                 # Get variable value
state.set_var("name", value)          # Set variable value
state.get_meta_info("name")           # Get generation metadata
state.fork(n)                         # Create parallel branches

Setting Default Backend

Before running functions, set a default backend:
import sglang as sgl

# Local Runtime
runtime = sgl.Runtime(model_path="meta-llama/Llama-2-7b-chat-hf")
sgl.set_default_backend(runtime)

# OpenAI
sgl.set_default_backend(sgl.OpenAI("gpt-3.5-turbo"))

# Anthropic
sgl.set_default_backend(sgl.Anthropic("claude-3-haiku-20240307"))

# Remote Runtime Endpoint
sgl.set_default_backend(sgl.RuntimeEndpoint("http://localhost:30000"))
You can also override the backend per-call:
state = my_function.run(
    question="What is AI?",
    backend=sgl.OpenAI("gpt-4")
)

Complete Example

Here’s a complete example demonstrating multiple features:
import sglang as sgl

@sgl.function
def multi_turn_question(s, question_1, question_2):
    s += sgl.system("You are a helpful assistant.")
    s += sgl.user(question_1)
    s += sgl.assistant(sgl.gen("answer_1", max_tokens=256))
    s += sgl.user(question_2)
    s += sgl.assistant(sgl.gen("answer_2", max_tokens=256))

if __name__ == "__main__":
    # Set backend
    sgl.set_default_backend(sgl.OpenAI("gpt-3.5-turbo"))
    
    # Single execution
    state = multi_turn_question.run(
        question_1="What is the capital of the United States?",
        question_2="List two local attractions.",
    )
    
    for m in state.messages():
        print(m["role"], ":", m["content"])
    
    print("\n-- answer_1 --\n", state["answer_1"])
    
    # Batch execution
    states = multi_turn_question.run_batch(
        [
            {
                "question_1": "What is the capital of the United States?",
                "question_2": "List two local attractions.",
            },
            {
                "question_1": "What is the capital of France?",
                "question_2": "What is the population of this city?",
            },
        ]
    )
    
    for s in states:
        print(s.messages())