Skip to main content
The /v1/completions endpoint provides simple text completion capabilities. Given a prompt, it returns the predicted continuation.

Endpoint

POST /v1/completions

Request Format

Required Parameters

model
string
required
Model identifier. Can be the model path, alias (set via --alias), or any string when using a single model.
prompt
string | array
required
The text prompt to complete. Can be:
  • A string: "Once upon a time"
  • An array of token IDs: [12, 34, 56]
  • An array of strings for batch completion: ["prompt1", "prompt2"]
  • Mixed tokens and strings: [12, 34, "string", 56]

Optional Parameters

max_tokens
number
default:"-1"
Maximum number of tokens to generate. -1 means unlimited.
temperature
number
default:"0.8"
Sampling temperature between 0 and 2. Higher values make output more random.
top_p
number
default:"0.95"
Nucleus sampling: only tokens with cumulative probability up to top_p are considered.
top_k
number
default:"40"
Limit token selection to the K most probable tokens. 0 = disabled.
min_p
number
default:"0.05"
Minimum probability threshold relative to the most likely token.
stream
boolean
default:"false"
Stream partial completions as Server-Sent Events.
stop
array
Array of sequences where generation should stop. Stop words are not included in the output.
presence_penalty
number
default:"0.0"
Penalize tokens based on whether they appear in the text. Range: -2.0 to 2.0.
frequency_penalty
number
default:"0.0"
Penalize tokens based on their frequency. Range: -2.0 to 2.0.
repeat_penalty
number
default:"1.1"
Control repetition of token sequences.
n
number
default:"1"
Number of completions to generate for each prompt.
seed
number
default:"-1"
Random seed for reproducible outputs. -1 = random seed.
logprobs
number
Include the log probabilities on the most likely tokens. Maximum: 5.
echo
boolean
default:"false"
Echo back the prompt in addition to the completion.
suffix
string
Text that comes after the completion. Useful for code infilling.

llama.cpp-Specific Parameters

mirostat
number
default:"0"
Enable Mirostat sampling. 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0.
mirostat_tau
number
default:"5.0"
Mirostat target entropy (τ).
mirostat_eta
number
default:"0.1"
Mirostat learning rate (η).
grammar
string
BNF-like grammar to constrain generation.
json_schema
object
JSON schema to constrain output to valid JSON matching the schema.
cache_prompt
boolean
default:"true"
Reuse KV cache from previous requests for efficiency.

Request Examples

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer no-key" \
  -d '{
    "model": "gpt-3.5-turbo",
    "prompt": "Once upon a time",
    "max_tokens": 50,
    "temperature": 0.7
  }'

Response Format

Standard Response

id
string
Unique identifier for the completion.
object
string
Always "text_completion" for non-streaming responses.
created
number
Unix timestamp of creation time.
model
string
The model used for completion.
choices
array
Array of completion choices. Each choice contains:
  • text (string) - The generated text
  • index (number) - Choice index
  • logprobs (object | null) - Log probabilities if requested
  • finish_reason (string) - Why generation stopped: stop, length, or null
usage
object
Token usage statistics:
  • prompt_tokens (number) - Tokens in prompt
  • completion_tokens (number) - Tokens generated
  • total_tokens (number) - Sum of prompt and completion

Example Response

{
  "id": "cmpl-abc123",
  "object": "text_completion",
  "created": 1677652288,
  "model": "gpt-3.5-turbo",
  "choices": [
    {
      "text": " in a small village nestled in the mountains. The villagers lived peaceful lives, tending to their crops and livestock. One day, a mysterious stranger arrived with tales of distant lands and ancient treasures.",
      "index": 0,
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 4,
    "completion_tokens": 46,
    "total_tokens": 50
  }
}

Streaming Responses

When stream: true, the server sends Server-Sent Events:
data: {"id":"cmpl-123","object":"text_completion","created":1677652288,"choices":[{"text":" in","index":0,"logprobs":null,"finish_reason":null}],"model":"gpt-3.5-turbo"}

data: {"id":"cmpl-123","object":"text_completion","created":1677652288,"choices":[{"text":" a","index":0,"logprobs":null,"finish_reason":null}],"model":"gpt-3.5-turbo"}

data: {"id":"cmpl-123","object":"text_completion","created":1677652288,"choices":[{"text":" small","index":0,"logprobs":null,"finish_reason":null}],"model":"gpt-3.5-turbo"}

data: {"id":"cmpl-123","object":"text_completion","created":1677652288,"choices":[{"text":"","index":0,"logprobs":null,"finish_reason":"stop"}],"model":"gpt-3.5-turbo"}

data: [DONE]

Log Probabilities

Request token probabilities with the logprobs parameter:
{
  "model": "gpt-3.5-turbo",
  "prompt": "The sky is",
  "max_tokens": 5,
  "logprobs": 3
}
Response includes probability data:
{
  "choices": [
    {
      "text": " blue",
      "logprobs": {
        "tokens": [" blue"],
        "token_logprobs": [-0.0234],
        "top_logprobs": [
          {
            " blue": -0.0234,
            " clear": -3.8234,
            " dark": -5.1234
          }
        ],
        "text_offset": [0]
      },
      "finish_reason": "length"
    }
  ]
}

Batch Completions

Generate multiple completions from different prompts:
{
  "model": "gpt-3.5-turbo",
  "prompt": [
    "The capital of France is",
    "The capital of Germany is",
    "The capital of Italy is"
  ],
  "max_tokens": 5
}
Response contains completions for each prompt:
{
  "choices": [
    {"text": " Paris", "index": 0, "finish_reason": "stop"},
    {"text": " Berlin", "index": 1, "finish_reason": "stop"},
    {"text": " Rome", "index": 2, "finish_reason": "stop"}
  ]
}

Code Infilling

Use the suffix parameter for code completion:
{
  "model": "codellama",
  "prompt": "def fibonacci(n):\n    ",
  "suffix": "\n    return result",
  "max_tokens": 100
}
The model will generate code that fits between the prompt and suffix.
For more advanced code infilling, use the native /infill endpoint which supports repository-level context.

Grammar-Constrained Generation

Constrain output using BNF grammar:
{
  "model": "gpt-3.5-turbo",
  "prompt": "Generate a simple arithmetic expression:",
  "max_tokens": 20,
  "grammar": "root ::= expr\nexpr ::= term (('+' | '-') term)*\nterm ::= factor (('*' | '/') factor)*\nfactor ::= [0-9]+ | '(' expr ')'"
}

JSON Schema Constraint

Force valid JSON output matching a schema:
{
  "model": "gpt-3.5-turbo",
  "prompt": "Generate a product:",
  "max_tokens": 100,
  "json_schema": {
    "type": "object",
    "properties": {
      "name": {"type": "string"},
      "price": {"type": "number", "minimum": 0},
      "in_stock": {"type": "boolean"},
      "tags": {
        "type": "array",
        "items": {"type": "string"},
        "minItems": 1,
        "maxItems": 5
      }
    },
    "required": ["name", "price"]
  }
}

Mirostat Sampling

Enable Mirostat for controlled perplexity:
{
  "model": "gpt-3.5-turbo",
  "prompt": "Write a creative story:",
  "max_tokens": 200,
  "mirostat": 2,
  "mirostat_tau": 5.0,
  "mirostat_eta": 0.1
}
Mirostat dynamically adjusts sampling to maintain target entropy (τ), useful for balancing coherence and creativity.

Performance Tips

  1. Prompt caching: Keep cache_prompt: true to reuse KV cache
  2. Batch processing: Send multiple prompts in one request for efficiency
  3. Streaming: Use stream: true for better perceived latency
  4. Stop sequences: Define clear stop conditions to avoid over-generation
  5. Token limits: Set appropriate max_tokens to prevent excessive computation

Error Responses

{
  "error": {
    "message": "prompt is required",
    "type": "invalid_request_error",
    "code": 400
  }
}
Common errors:
  • 400 - Missing or invalid parameters
  • 401 - Invalid API key
  • 503 - Server not ready (model loading)

Differences from Chat Completions

FeatureCompletionsChat Completions
Input formatRaw text promptStructured messages
Use caseSimple continuationConversational AI
ContextSingle promptMulti-turn dialogue
System promptsNot supportedSupported
Function callingNot supportedSupported
Best forText generation, codeChat, assistants