Completions

The completions endpoint generates text based on a prompt. This endpoint is compatible with OpenAI’s /v1/completions API.

Request

curl http://localhost:30000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "prompt": "Once upon a time",
    "max_tokens": 128,
    "temperature": 0.8
  }'

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

response = client.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    prompt="Once upon a time",
    max_tokens=128,
    temperature=0.8
)

print(response.choices[0].text)

Parameters

Required

model

string

required

Model name. Supports LoRA adapters via base-model:adapter-name syntax.

prompt

string | array

required

The prompt(s) to generate completions for. Can be:

A single string
An array of strings for batch processing
An array of token IDs
An array of arrays of token IDs for batch processing

Sampling Parameters

max_tokens

integer

default:"16"

Maximum number of tokens to generate.

temperature

number

default:"1.0"

Sampling temperature between 0 and 2. Higher values make output more random.

top_p

number

default:"1.0"

Nucleus sampling threshold. Only tokens with cumulative probability >= top_p are considered.

top_k

integer

default:"-1"

Only sample from the top K tokens. -1 disables this.

min_p

number

default:"0.0"

Minimum probability threshold for sampling.

integer

default:"1"

Number of completions to generate for each prompt.

seed

integer

Random seed for deterministic generation.

stop

string | array

Stop sequences. Generation stops when these sequences are encountered.

stop_token_ids

array

Stop token IDs. Generation stops when these token IDs are encountered.

Penalization

frequency_penalty

number

default:"0.0"

Penalizes tokens based on their frequency in the generated text. Range: [-2.0, 2.0].

presence_penalty

number

default:"0.0"

Penalizes tokens based on whether they appear in the generated text. Range: [-2.0, 2.0].

repetition_penalty

number

default:"1.0"

Penalizes repeated tokens. 1.0 means no penalty.

Structured Output

response_format

object

Format of the response. Options:

{"type": "text"} - Plain text (default)
{"type": "json_object"} - Valid JSON object
{"type": "json_schema", "json_schema": {...}} - JSON matching a schema

json_schema

string

JSON schema string for constrained generation.

regex

string

Regular expression pattern for constrained generation.

ebnf

string

EBNF grammar for constrained generation.

Other Parameters

stream

boolean

default:"false"

Whether to stream the response.

stream_options

object

Streaming options:

include_usage: Include usage statistics in final chunk
continuous_usage_stats: Include usage stats in each chunk

echo

boolean

default:"false"

Whether to echo the prompt in the completion.

logprobs

integer

Number of top log probabilities to return for each token.

logit_bias

object

Bias certain tokens. Maps token IDs to bias values between -100 and 100.

best_of

integer

Generate best_of completions and return the best one.

suffix

string

Text to append after the completion.

SGLang Extensions

ignore_eos

boolean

default:"false"

Continue generation even after EOS token.

skip_special_tokens

boolean

default:"true"

Whether to skip special tokens in the output.

no_stop_trim

boolean

default:"false"

Do not trim stop sequences from output.

stop_regex

string | array

Regular expression(s) to use as stop conditions.

min_tokens

integer

default:"0"

Minimum number of tokens to generate.

lora_path

string

Path to LoRA adapter weights.

return_hidden_states

boolean

default:"false"

Return hidden states from the model.

return_routed_experts

boolean

default:"false"

Return expert routing information for MoE models.

return_cached_tokens_details

boolean

default:"false"

Return detailed cache hit information.

Response

string

Unique identifier for the completion.

object

string

Always "text_completion".

created

integer

Unix timestamp of creation time.

model

string

Model used for generation.

choices

array

Array of completion choices.

index

integer

Choice index in the array.

text

string

Generated text.

logprobs

object | null

Log probability information if requested.

tokens

array

List of generated tokens.

token_logprobs

array

Log probabilities for each token.

top_logprobs

array

Top log probabilities for each position.

text_offset

array

Character offsets for each token.

finish_reason

string

Reason for completion end:

stop: Natural stop or stop sequence
length: Max tokens reached
content_filter: Content filtering
abort: Request aborted

matched_stop

integer | string | null

The stop sequence that was matched, if any.

usage

object

Token usage statistics.

prompt_tokens

integer

Number of tokens in the prompt.

completion_tokens

integer

Number of tokens in the completion.

total_tokens

integer

Total tokens used (prompt + completion).

prompt_tokens_details

object

Details about prompt tokens.

cached_tokens

integer

Number of cached tokens from prefix cache.

sglext

object

SGLang-specific extensions (only present when requested).

routed_experts

string

Expert routing information for MoE models.

cached_tokens_details

object

Detailed cache hit information.

device

integer

Tokens from device (GPU) cache.

host

integer

Tokens from host (CPU) cache.

storage

integer

Tokens from L3 storage backend (if enabled).

storage_backend

string

Type of storage backend used.

Streaming Response

When stream=true, the response is sent as Server-Sent Events (SSE):

data: {"id":"cmpl-123","object":"text_completion","created":1234567890,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"text":"Once","logprobs":null,"finish_reason":null}]}

data: {"id":"cmpl-123","object":"text_completion","created":1234567890,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"text":" upon","logprobs":null,"finish_reason":null}]}

data: {"id":"cmpl-123","object":"text_completion","created":1234567890,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"text":"","logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":4,"completion_tokens":20,"total_tokens":24}}

data: [DONE]

Examples

Basic Completion

from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

response = client.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    prompt="Write a function to compute fibonacci numbers",
    max_tokens=256,
    temperature=0.7
)

print(response.choices[0].text)

Streaming Completion

stream = client.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    prompt="Tell me a story about",
    max_tokens=512,
    stream=True
)

for chunk in stream:
    if chunk.choices[0].text:
        print(chunk.choices[0].text, end="")

JSON Output

response = client.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    prompt="Generate a user profile:",
    max_tokens=128,
    response_format={"type": "json_object"}
)

import json
profile = json.loads(response.choices[0].text)
print(profile)

Batch Processing

response = client.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    prompt=[
        "Translate to French: Hello",
        "Translate to Spanish: Hello",
        "Translate to German: Hello"
    ],
    max_tokens=10
)

for choice in response.choices:
    print(f"Translation {choice.index}: {choice.text}")

Python API

Frontend API

HTTP API

CLI Reference

Completions

Completions

Request

Parameters

Required

Sampling Parameters

Penalization

Structured Output

Other Parameters

SGLang Extensions

Response

Streaming Response

Examples

Basic Completion

Streaming Completion

JSON Output

Batch Processing

See Also

Python API

Frontend API

HTTP API

CLI Reference

​Completions

​Request

​Parameters

​Required

​Sampling Parameters

​Penalization

​Structured Output

​Other Parameters

​SGLang Extensions

​Response

​Streaming Response

​Examples

​Basic Completion

​Streaming Completion

​JSON Output

​Batch Processing

​See Also

Completions

Request

Parameters

Required

Sampling Parameters

Penalization

Structured Output

Other Parameters

SGLang Extensions

Response

Streaming Response

Examples

Basic Completion

Streaming Completion

JSON Output

Batch Processing

See Also