/v1/completions endpoint provides simple text completion capabilities. Given a prompt, it returns the predicted continuation.
Endpoint
Request Format
Required Parameters
Model identifier. Can be the model path, alias (set via
--alias), or any string when using a single model.The text prompt to complete. Can be:
- A string:
"Once upon a time" - An array of token IDs:
[12, 34, 56] - An array of strings for batch completion:
["prompt1", "prompt2"] - Mixed tokens and strings:
[12, 34, "string", 56]
Optional Parameters
Maximum number of tokens to generate. -1 means unlimited.
Sampling temperature between 0 and 2. Higher values make output more random.
Nucleus sampling: only tokens with cumulative probability up to
top_p are considered.Limit token selection to the K most probable tokens. 0 = disabled.
Minimum probability threshold relative to the most likely token.
Stream partial completions as Server-Sent Events.
Array of sequences where generation should stop. Stop words are not included in the output.
Penalize tokens based on whether they appear in the text. Range: -2.0 to 2.0.
Penalize tokens based on their frequency. Range: -2.0 to 2.0.
Control repetition of token sequences.
Number of completions to generate for each prompt.
Random seed for reproducible outputs. -1 = random seed.
Include the log probabilities on the most likely tokens. Maximum: 5.
Echo back the prompt in addition to the completion.
Text that comes after the completion. Useful for code infilling.
llama.cpp-Specific Parameters
Enable Mirostat sampling. 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0.
Mirostat target entropy (τ).
Mirostat learning rate (η).
BNF-like grammar to constrain generation.
JSON schema to constrain output to valid JSON matching the schema.
Reuse KV cache from previous requests for efficiency.
Request Examples
Response Format
Standard Response
Unique identifier for the completion.
Always
"text_completion" for non-streaming responses.Unix timestamp of creation time.
The model used for completion.
Array of completion choices. Each choice contains:
text(string) - The generated textindex(number) - Choice indexlogprobs(object | null) - Log probabilities if requestedfinish_reason(string) - Why generation stopped:stop,length, ornull
Token usage statistics:
prompt_tokens(number) - Tokens in promptcompletion_tokens(number) - Tokens generatedtotal_tokens(number) - Sum of prompt and completion
Example Response
Streaming Responses
Whenstream: true, the server sends Server-Sent Events:
Log Probabilities
Request token probabilities with thelogprobs parameter:
Batch Completions
Generate multiple completions from different prompts:Code Infilling
Use thesuffix parameter for code completion:
For more advanced code infilling, use the native
/infill endpoint which supports repository-level context.Grammar-Constrained Generation
Constrain output using BNF grammar:JSON Schema Constraint
Force valid JSON output matching a schema:Mirostat Sampling
Enable Mirostat for controlled perplexity:Performance Tips
- Prompt caching: Keep
cache_prompt: trueto reuse KV cache - Batch processing: Send multiple prompts in one request for efficiency
- Streaming: Use
stream: truefor better perceived latency - Stop sequences: Define clear stop conditions to avoid over-generation
- Token limits: Set appropriate
max_tokensto prevent excessive computation
Error Responses
400- Missing or invalid parameters401- Invalid API key503- Server not ready (model loading)
Differences from Chat Completions
| Feature | Completions | Chat Completions |
|---|---|---|
| Input format | Raw text prompt | Structured messages |
| Use case | Simple continuation | Conversational AI |
| Context | Single prompt | Multi-turn dialogue |
| System prompts | Not supported | Supported |
| Function calling | Not supported | Supported |
| Best for | Text generation, code | Chat, assistants |

