Skip to main content

Foundry Local REST API Reference

The Foundry Local REST API provides endpoints for managing AI models, performing inference, and controlling the local inference service. All endpoints are compatible with the OpenAI Chat Completions API format.
This API is under active development and may include breaking changes without notice. Monitor the changelog before building production applications.

Base URL

http://localhost:5272

Authentication

For local usage, no authentication is required. The API uses a default placeholder API key.

Chat Completions

POST /v1/chat/completions

Process chat completion requests with local AI models. Fully compatible with the OpenAI Chat Completions API.
model
string
required
The specific model to use for completion (e.g., qwen2.5-0.5b-instruct-generic-cpu)
messages
array
required
The conversation history as a list of message objects. Each message requires:
  • role (string): Message sender’s role - system, user, or assistant
  • content (string): The actual message text
temperature
number
Controls randomness (0 to 2). Higher values (0.8) create varied outputs, lower values (0.2) are focused
top_p
number
Controls token selection diversity (0 to 1). Value of 0.1 considers only top 10% probability tokens
max_tokens
integer
Maximum tokens to generate in the completion
stream
boolean
When true, sends partial message responses as server-sent events
presence_penalty
number
Value between -2.0 and 2.0. Positive values encourage new topics
frequency_penalty
number
Value between -2.0 and 2.0. Positive values discourage repetition
Request Example:
{
  "model": "qwen2.5-0.5b-instruct-generic-cpu",
  "messages": [
    {
      "role": "user",
      "content": "Hello, how are you?"
    }
  ],
  "temperature": 0.7,
  "max_tokens": 100
}
id
string
Unique identifier for the chat completion
choices
array
List of completion choices generated
  • index (integer): Position of this choice
  • message (object): Generated message with role and content
  • finish_reason (string): Why generation stopped (stop, length, function_call)
usage
object
Token usage statistics
  • prompt_tokens: Tokens in the prompt
  • completion_tokens: Tokens in the completion
  • total_tokens: Total tokens used

Model Management

GET /foundry/list

Get a list of available Foundry Local models in the catalog.
models
array
Array of model objects with:
  • name: Model identifier
  • displayName: Human-readable name
  • version: Model version
  • modelType: Format (e.g., ONNX)
  • task: Primary task (e.g., chat-completion)
  • fileSizeMb: Size in megabytes
  • supportsToolCalling: Tool calling support

GET /openai/models

List cached models, including local and registered external models. Response Example:
["Phi-4-mini-instruct-generic-cpu", "phi-3.5-mini-instruct-generic-cpu"]

POST /openai/download

Download a model from the catalog to local storage.
Large model downloads can take significant time. Set a high timeout to avoid early termination.
model
object
required
Model specification:
  • Uri (string): Model URI to download
  • Name (string): Model name
  • ProviderType (string): Provider (e.g., AzureFoundryLocal, HuggingFace)
Request Example:
{
  "model": {
    "Uri": "azureml://registries/azureml/models/Phi-4-mini-instruct-generic-cpu/versions/4",
    "ProviderType": "AzureFoundryLocal",
    "Name": "Phi-4-mini-instruct-generic-cpu:4"
  }
}

GET /openai/load/

Load a model into memory for faster inference.
name
string
required
The model name to load
ttl
integer
Time to live in seconds. Overrides automatic unload settings
ep
string
Execution provider: dml, cuda, qnn, cpu, webgpu
Example:
GET /openai/load/Phi-4-mini-instruct-generic-cpu?ttl=3600&ep=dml

GET /openai/unload/

Unload a model from memory.
name
string
required
The model name to unload
force
boolean
If true, ignores TTL settings and unloads immediately

GET /openai/loadedmodels

Get the list of currently loaded models. Response:
["Phi-4-mini-instruct-generic-cpu", "phi-3.5-mini-instruct-generic-cpu"]

Service Status

GET /openai/status

Get server status information.
Endpoints
array
HTTP server binding endpoints
ModelDirPath
string
Directory where local models are stored
PipeName
string
Current NamedPipe server name
Response Example:
{
  "Endpoints": ["http://localhost:5272"],
  "ModelDirPath": "/path/to/models",
  "PipeName": "inference_agent"
}

Token Counting

POST /v1/chat/completions/tokenizer/encode/count

Count tokens for a chat completion request without performing inference.
model
string
required
Model to use for tokenization
messages
array
required
Array of message objects with role and content
Example:
{
  "messages": [
    {
      "role": "system",
      "content": "This is a system message"
    },
    {
      "role": "user",
      "content": "Hello, what is Microsoft?"
    }
  ],
  "model": "Phi-4-mini-instruct-cuda-gpu"
}

GPU Management

GET /openai/getgpudevice

Get the current GPU device ID. Response: Integer representing the GPU device ID

GET /openai/setgpudevice/

Set the active GPU device.
deviceId
integer
required
The GPU device ID to use
Example:
GET /openai/setgpudevice/1

Error Handling

All API errors return standard HTTP status codes:
  • 200 - Success
  • 400 - Bad Request (invalid parameters)
  • 404 - Not Found (model or resource doesn’t exist)
  • 500 - Internal Server Error

Rate Limits

No rate limits are enforced for local usage. Performance is limited by hardware capabilities.

SDK Reference

Use the Python SDK for easier integration

JavaScript SDK

Node.js and browser integration

Build docs developers (and LLMs) love