Skip to main content

Create Chat Completion

Creates a model response for the given chat conversation.
curl http://127.0.0.1:1337/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer secret-key-123" \
  -d '{
    "model": "llama3-8b-instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Tell me a joke."}
    ],
    "temperature": 0.7,
    "max_tokens": 150
  }'

Request Body

model
string
required
The ID of the model to use. Must match a model available in Jan.Example: llama3-8b-instruct, qwen2.5-7b-instruct
messages
array
required
A list of messages comprising the conversation so far.Each message has:
  • role (string, required): One of system, user, assistant, or tool
  • content (string or array): The message content. Can be a string or an array of content parts (for multimodal messages)
  • name (string, optional): The name of the message author
  • tool_calls (array, optional): Tool calls made by the assistant
  • tool_call_id (string, optional): The ID of the tool call this message is responding to
temperature
number
default:"0.7"
Sampling temperature between 0 and 2. Higher values make output more random, lower values more deterministic.
max_tokens
number
The maximum number of tokens to generate. Set to null or omit for unlimited generation (up to context limit).
top_p
number
default:"0.95"
Nucleus sampling: only tokens with cumulative probability up to top_p are considered.
top_k
number
default:"40"
Only the top K most likely tokens are considered for generation.
min_p
number
Minimum probability threshold for token selection.
stream
boolean
default:false
If true, returns a stream of Server-Sent Events (SSE) as the model generates tokens.
stop
string | array
Up to 4 sequences where the API will stop generating further tokens.
presence_penalty
number
default:"0"
Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far.
frequency_penalty
number
default:"0"
Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text.
repeat_penalty
number
default:"1.1"
Penalty for repeating tokens. Values > 1 discourage repetition.
repeat_last_n
number
default:"64"
Number of previous tokens to consider for repeat penalty.
seed
number
Random seed for reproducible generation.
tools
array
A list of tools the model may call. Each tool has:
  • type (string): Currently only "function" is supported
  • function (object): Function definition with name, description, and parameters
tool_choice
string | object
Controls which (if any) function is called by the model.
  • "none": Model will not call any function
  • "auto": Model can pick between generating a message or calling a function
  • "required": Model must call one or more functions
  • {"type": "function", "function": {"name": "my_function"}}: Forces a specific function call

Advanced Parameters

dynatemp_range
number
Dynamic temperature range for sampling.
dynatemp_exponent
number
Dynamic temperature exponent.
typical_p
number
Typical probability mass for sampling.
mirostat
number
Enable Mirostat sampling. 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0.
mirostat_tau
number
Mirostat target entropy.
mirostat_eta
number
Mirostat learning rate.
logit_bias
object
Modify the likelihood of specified tokens appearing. Maps token IDs to bias values (-100 to 100).
cache_prompt
boolean
Enable KV cache for the prompt.

Response

id
string
A unique identifier for the chat completion.
object
string
The object type, always chat.completion.
created
number
Unix timestamp (in seconds) of when the completion was created.
model
string
The model used for the completion.
choices
array
A list of chat completion choices. Can be more than one if n is greater than 1.Each choice contains:
  • index (number): The index of this choice
  • message (object): The generated message
    • role (string): Always assistant
    • content (string): The content of the message
    • tool_calls (array, optional): Tool calls made by the model
  • finish_reason (string): Why generation stopped (stop, length, tool_calls, content_filter)
usage
object
Token usage information.
  • prompt_tokens (number): Number of tokens in the prompt
  • completion_tokens (number): Number of tokens in the completion
  • total_tokens (number): Total tokens used
system_fingerprint
string
System fingerprint for the backend.

Example Response

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1699896916,
  "model": "llama3-8b-instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Why did the scarecrow win an award? Because he was outstanding in his field!"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 18,
    "total_tokens": 38
  },
  "system_fingerprint": "llamacpp-b1-e4912fc"
}

Streaming

When stream is set to true, the API returns Server-Sent Events (SSE) as the model generates tokens.

Streaming Request

cURL
curl http://127.0.0.1:1337/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer secret-key-123" \
  -d '{
    "model": "llama3-8b-instruct",
    "messages": [{"role": "user", "content": "Count to 5"}],
    "stream": true
  }'

Streaming Response

Each chunk is a JSON object prefixed with data: :
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1699896916,"model":"llama3-8b-instruct","choices":[{"index":0,"delta":{"role":"assistant","content":"1"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1699896916,"model":"llama3-8b-instruct","choices":[{"index":0,"delta":{"content":", 2"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1699896916,"model":"llama3-8b-instruct","choices":[{"index":0,"delta":{"content":", 3, 4, 5"},"finish_reason":"stop"}]}

data: [DONE]

Streaming Response Fields

id
string
Unique identifier for the chat completion (consistent across all chunks).
object
string
Always chat.completion.chunk.
created
number
Unix timestamp.
model
string
The model used.
choices
array
Array of choices.
  • index (number): Choice index
  • delta (object): Content delta
    • role (string, optional): Set in first chunk
    • content (string, optional): Incremental content
  • finish_reason (string | null): Reason for stopping (only in final chunk)
prompt_progress
object
Jan-specific field showing prompt processing progress.
  • cache (number): Tokens already in KV cache
  • processed (number): Tokens processed so far
  • total (number): Total prompt tokens
  • time_ms (number): Time spent processing

Multimodal Messages

Jan supports vision models that can process images alongside text.

Image Input

cURL
curl http://127.0.0.1:1337/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer secret-key-123" \
  -d '{
    "model": "llava-v1.6-7b",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "What is in this image?"},
          {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
        ]
      }
    ]
  }'

Content Array Format

When using multimodal messages, the content field is an array of objects:
content[].type
string
required
The type of content: text, image_url, or input_audio.
content[].text
string
Text content (when type is text).
content[].image_url
object
Image content (when type is image_url).
  • url (string): URL or base64-encoded data URI

Function Calling

Jan supports function calling for compatible models.

Request with Tools

{
  "model": "llama3-8b-instruct",
  "messages": [
    {"role": "user", "content": "What's the weather in San Francisco?"}
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather in a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g. San Francisco, CA"
            },
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"]
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

Response with Tool Call

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1699896916,
  "model": "llama3-8b-instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": null,
        "tool_calls": [
          {
            "id": "call_abc123",
            "type": "function",
            "function": {
              "name": "get_weather",
              "arguments": "{\"location\": \"San Francisco, CA\", \"unit\": \"fahrenheit\"}"
            }
          }
        ]
      },
      "finish_reason": "tool_calls"
    }
  ],
  "usage": {
    "prompt_tokens": 82,
    "completion_tokens": 18,
    "total_tokens": 100
  }
}

Error Handling

Finish Reasons

  • stop: Natural stop point or stop sequence reached
  • length: Maximum token limit reached (context overflow)
  • tool_calls: Model called a function
  • content_filter: Content was filtered

Context Overflow

When the conversation exceeds the model’s context window, the API returns finish_reason: "length". You’ll need to truncate the conversation history or use a model with a larger context window.

Build docs developers (and LLMs) love