Skip to main content

Completions

The completions endpoint generates text based on a prompt. This endpoint is compatible with OpenAI’s /v1/completions API.

Request

curl http://localhost:30000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "prompt": "Once upon a time",
    "max_tokens": 128,
    "temperature": 0.8
  }'
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

response = client.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    prompt="Once upon a time",
    max_tokens=128,
    temperature=0.8
)

print(response.choices[0].text)

Parameters

Required

model
string
required
Model name. Supports LoRA adapters via base-model:adapter-name syntax.
prompt
string | array
required
The prompt(s) to generate completions for. Can be:
  • A single string
  • An array of strings for batch processing
  • An array of token IDs
  • An array of arrays of token IDs for batch processing

Sampling Parameters

max_tokens
integer
default:"16"
Maximum number of tokens to generate.
temperature
number
default:"1.0"
Sampling temperature between 0 and 2. Higher values make output more random.
top_p
number
default:"1.0"
Nucleus sampling threshold. Only tokens with cumulative probability >= top_p are considered.
top_k
integer
default:"-1"
Only sample from the top K tokens. -1 disables this.
min_p
number
default:"0.0"
Minimum probability threshold for sampling.
n
integer
default:"1"
Number of completions to generate for each prompt.
seed
integer
Random seed for deterministic generation.
stop
string | array
Stop sequences. Generation stops when these sequences are encountered.
stop_token_ids
array
Stop token IDs. Generation stops when these token IDs are encountered.

Penalization

frequency_penalty
number
default:"0.0"
Penalizes tokens based on their frequency in the generated text. Range: [-2.0, 2.0].
presence_penalty
number
default:"0.0"
Penalizes tokens based on whether they appear in the generated text. Range: [-2.0, 2.0].
repetition_penalty
number
default:"1.0"
Penalizes repeated tokens. 1.0 means no penalty.

Structured Output

response_format
object
Format of the response. Options:
  • {"type": "text"} - Plain text (default)
  • {"type": "json_object"} - Valid JSON object
  • {"type": "json_schema", "json_schema": {...}} - JSON matching a schema
json_schema
string
JSON schema string for constrained generation.
regex
string
Regular expression pattern for constrained generation.
ebnf
string
EBNF grammar for constrained generation.

Other Parameters

stream
boolean
default:"false"
Whether to stream the response.
stream_options
object
Streaming options:
  • include_usage: Include usage statistics in final chunk
  • continuous_usage_stats: Include usage stats in each chunk
echo
boolean
default:"false"
Whether to echo the prompt in the completion.
logprobs
integer
Number of top log probabilities to return for each token.
logit_bias
object
Bias certain tokens. Maps token IDs to bias values between -100 and 100.
best_of
integer
Generate best_of completions and return the best one.
suffix
string
Text to append after the completion.

SGLang Extensions

ignore_eos
boolean
default:"false"
Continue generation even after EOS token.
skip_special_tokens
boolean
default:"true"
Whether to skip special tokens in the output.
no_stop_trim
boolean
default:"false"
Do not trim stop sequences from output.
stop_regex
string | array
Regular expression(s) to use as stop conditions.
min_tokens
integer
default:"0"
Minimum number of tokens to generate.
lora_path
string
Path to LoRA adapter weights.
return_hidden_states
boolean
default:"false"
Return hidden states from the model.
return_routed_experts
boolean
default:"false"
Return expert routing information for MoE models.
return_cached_tokens_details
boolean
default:"false"
Return detailed cache hit information.

Response

id
string
Unique identifier for the completion.
object
string
Always "text_completion".
created
integer
Unix timestamp of creation time.
model
string
Model used for generation.
choices
array
Array of completion choices.
index
integer
Choice index in the array.
text
string
Generated text.
logprobs
object | null
Log probability information if requested.
tokens
array
List of generated tokens.
token_logprobs
array
Log probabilities for each token.
top_logprobs
array
Top log probabilities for each position.
text_offset
array
Character offsets for each token.
finish_reason
string
Reason for completion end:
  • stop: Natural stop or stop sequence
  • length: Max tokens reached
  • content_filter: Content filtering
  • abort: Request aborted
matched_stop
integer | string | null
The stop sequence that was matched, if any.
usage
object
Token usage statistics.
prompt_tokens
integer
Number of tokens in the prompt.
completion_tokens
integer
Number of tokens in the completion.
total_tokens
integer
Total tokens used (prompt + completion).
prompt_tokens_details
object
Details about prompt tokens.
cached_tokens
integer
Number of cached tokens from prefix cache.
sglext
object
SGLang-specific extensions (only present when requested).
routed_experts
string
Expert routing information for MoE models.
cached_tokens_details
object
Detailed cache hit information.
device
integer
Tokens from device (GPU) cache.
host
integer
Tokens from host (CPU) cache.
storage
integer
Tokens from L3 storage backend (if enabled).
storage_backend
string
Type of storage backend used.

Streaming Response

When stream=true, the response is sent as Server-Sent Events (SSE):
data: {"id":"cmpl-123","object":"text_completion","created":1234567890,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"text":"Once","logprobs":null,"finish_reason":null}]}

data: {"id":"cmpl-123","object":"text_completion","created":1234567890,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"text":" upon","logprobs":null,"finish_reason":null}]}

data: {"id":"cmpl-123","object":"text_completion","created":1234567890,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"text":"","logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":4,"completion_tokens":20,"total_tokens":24}}

data: [DONE]

Examples

Basic Completion

from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

response = client.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    prompt="Write a function to compute fibonacci numbers",
    max_tokens=256,
    temperature=0.7
)

print(response.choices[0].text)

Streaming Completion

stream = client.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    prompt="Tell me a story about",
    max_tokens=512,
    stream=True
)

for chunk in stream:
    if chunk.choices[0].text:
        print(chunk.choices[0].text, end="")

JSON Output

response = client.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    prompt="Generate a user profile:",
    max_tokens=128,
    response_format={"type": "json_object"}
)

import json
profile = json.loads(response.choices[0].text)
print(profile)

Batch Processing

response = client.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    prompt=[
        "Translate to French: Hello",
        "Translate to Spanish: Hello",
        "Translate to German: Hello"
    ],
    max_tokens=10
)

for choice in response.choices:
    print(f"Translation {choice.index}: {choice.text}")

See Also