Completions
The completions endpoint generates text based on a prompt. This endpoint is compatible with OpenAI’s/v1/completions API.
Request
Parameters
Required
Model name. Supports LoRA adapters via
base-model:adapter-name syntax.The prompt(s) to generate completions for. Can be:
- A single string
- An array of strings for batch processing
- An array of token IDs
- An array of arrays of token IDs for batch processing
Sampling Parameters
Maximum number of tokens to generate.
Sampling temperature between 0 and 2. Higher values make output more random.
Nucleus sampling threshold. Only tokens with cumulative probability >= top_p are considered.
Only sample from the top K tokens. -1 disables this.
Minimum probability threshold for sampling.
Number of completions to generate for each prompt.
Random seed for deterministic generation.
Stop sequences. Generation stops when these sequences are encountered.
Stop token IDs. Generation stops when these token IDs are encountered.
Penalization
Penalizes tokens based on their frequency in the generated text. Range: [-2.0, 2.0].
Penalizes tokens based on whether they appear in the generated text. Range: [-2.0, 2.0].
Penalizes repeated tokens. 1.0 means no penalty.
Structured Output
Format of the response. Options:
{"type": "text"}- Plain text (default){"type": "json_object"}- Valid JSON object{"type": "json_schema", "json_schema": {...}}- JSON matching a schema
JSON schema string for constrained generation.
Regular expression pattern for constrained generation.
EBNF grammar for constrained generation.
Other Parameters
Whether to stream the response.
Streaming options:
include_usage: Include usage statistics in final chunkcontinuous_usage_stats: Include usage stats in each chunk
Whether to echo the prompt in the completion.
Number of top log probabilities to return for each token.
Bias certain tokens. Maps token IDs to bias values between -100 and 100.
Generate
best_of completions and return the best one.Text to append after the completion.
SGLang Extensions
Continue generation even after EOS token.
Whether to skip special tokens in the output.
Do not trim stop sequences from output.
Regular expression(s) to use as stop conditions.
Minimum number of tokens to generate.
Path to LoRA adapter weights.
Return hidden states from the model.
Return expert routing information for MoE models.
Return detailed cache hit information.
Response
Unique identifier for the completion.
Always
"text_completion".Unix timestamp of creation time.
Model used for generation.
Array of completion choices.
Choice index in the array.
Generated text.
Reason for completion end:
stop: Natural stop or stop sequencelength: Max tokens reachedcontent_filter: Content filteringabort: Request aborted
The stop sequence that was matched, if any.
Token usage statistics.
Number of tokens in the prompt.
Number of tokens in the completion.
Total tokens used (prompt + completion).
SGLang-specific extensions (only present when requested).
Expert routing information for MoE models.
Streaming Response
Whenstream=true, the response is sent as Server-Sent Events (SSE):
Examples
Basic Completion
Streaming Completion
JSON Output
Batch Processing
See Also
- Chat Completions - Conversational format
- Sampling Parameters - Detailed parameter guide
- Server Args - Server configuration
