API endpoints - OminiX-MLX

Audio endpoints

Create transcription

Transcribe audio to text using Qwen3-ASR models. Compatible with OpenAI Whisper API format.

curl http://localhost:8080/v1/audio/transcriptions \
  -F [email protected] \
  -F language=Chinese

Request parameters

file

Audio file to transcribe. Supported formats: WAV, MP3, M4A, FLAC.Multipart upload only. Use file_path for JSON requests.

file_path

string

Path to audio file on server filesystem.JSON requests only. Use file for multipart uploads.

language

string

Language of the audio content. Specifying the language improves accuracy.Supported languages: Chinese, English, Spanish, French, German, Japanese, Korean, Arabic, Russian, Hindi, Thai, Vietnamese, and 15+ more.Example: "English", "Chinese"

model

string

default:"qwen3-asr"

Model identifier. Currently only qwen3-asr is supported.

response_format

string

default:"json"

Response format. Options: json, text, verbose_json

json: Returns transcription text only
text: Returns plain text response
verbose_json: Includes timestamps and metadata

Response format

text

string

Transcribed text from the audio file

duration

number

Audio duration in seconds (verbose_json only)

language

string

Detected or specified language (verbose_json only)

Example response

{
  "text": "Welcome to OminiX-MLX, a high-performance ML inference framework for Apple Silicon."
}

Verbose response

{
  "text": "Welcome to OminiX-MLX, a high-performance ML inference framework for Apple Silicon.",
  "duration": 5.2,
  "language": "English",
  "segments": [
    {
      "start": 0.0,
      "end": 2.1,
      "text": "Welcome to OminiX-MLX,"
    },
    {
      "start": 2.1,
      "end": 5.2,
      "text": "a high-performance ML inference framework for Apple Silicon."
    }
  ]
}

Chat endpoints

Create chat completion

Available in MiniCPM-SALA server. Coming to OminiX-API unified server soon.

Generate chat completions using LLM models with OpenAI-compatible format.

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "minicpm-sala-9b",
    "messages": [
      {"role": "user", "content": "Explain quantum computing"}
    ],
    "temperature": 0.7,
    "max_tokens": 256
  }'

Request parameters

model

string

required

Model identifier to use for completionExample: "minicpm-sala-9b"

messages

array

required

Array of message objects representing the conversation historyEach message contains:

role: One of system, user, or assistant
content: Message text content

temperature

number

default:"0.7"

Sampling temperature between 0 and 2. Higher values make output more random.

0.0: Deterministic (greedy sampling)
0.7: Balanced creativity and coherence
1.5+: More creative but less focused

max_tokens

integer

default:"2048"

Maximum number of tokens to generate

top_p

number

default:"1.0"

Nucleus sampling threshold. Alternative to temperature.

stream

boolean

default:"false"

Whether to stream response tokens (not yet implemented)

Response format

string

Unique completion identifier

object

string

Object type, always "chat.completion"

model

string

Model used for completion

choices

array

Array of completion choices (currently always length 1)Each choice contains:

index: Choice index (always 0)
message: Response message with role and content
finish_reason: Why generation stopped ("stop", "length", etc.)

usage

object

Token usage statisticsContains:

prompt_tokens: Input token count
completion_tokens: Generated token count
total_tokens: Sum of prompt and completion tokens

Example response

{
  "id": "chatcmpl-18a3f2b4c5d6e7f8",
  "object": "chat.completion",
  "model": "minicpm-sala-9b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Quantum computing uses quantum mechanical phenomena like superposition and entanglement to perform computations. Unlike classical bits that are either 0 or 1, quantum bits (qubits) can exist in multiple states simultaneously..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 156,
    "total_tokens": 168
  }
}

Model management endpoints

List models

Retrieve list of available models with metadata including path, size, quantization, and loaded status.

cURL

curl http://localhost:8080/v1/models

Response format

object

string

Object type, always "list"

data

array

Array of model objects

Model object fields

string

Model identifier (directory name)

object

string

Object type, always "model"

owned_by

string

Model owner (e.g., "mlx-community", "local")

path

string

Absolute path to model directory

loaded

boolean

Whether this model is currently loaded in the server

repo_id

string

HuggingFace repository ID (if downloaded via API)

quantization

object

Quantization configurationContains:

bits: Quantization bit width (4, 8, etc.)
group_size: Group size for quantization

size_bytes

integer

Total size of model files in bytes

downloaded_at

string

Unix timestamp when model was downloaded

Example response

{
  "object": "list",
  "data": [
    {
      "id": "qwen3-asr-1.7b",
      "object": "model",
      "owned_by": "mlx-community",
      "path": "/Users/you/.ominix/models/qwen3-asr-1.7b",
      "loaded": true,
      "repo_id": "mlx-community/Qwen3-ASR-1.7B-8bit",
      "quantization": {
        "bits": 8,
        "group_size": 64
      },
      "size_bytes": 2460000000,
      "downloaded_at": "1672531200"
    },
    {
      "id": "MiniCPM-SALA-9B-8bit",
      "object": "model",
      "owned_by": "moxin-org",
      "path": "/Users/you/.ominix/models/MiniCPM-SALA-9B-8bit",
      "loaded": false,
      "repo_id": "moxin-org/MiniCPM4-SALA-9B-8bit-mlx",
      "quantization": {
        "bits": 8,
        "group_size": 64
      },
      "size_bytes": 9600000000
    }
  ]
}

Download model

Download a model from HuggingFace Hub to the local models directory.

cURL

curl -X POST http://localhost:8080/v1/models/download \
  -H "Content-Type: application/json" \
  -d '{"repo_id": "mlx-community/Qwen3-ASR-1.7B-8bit"}'

Request parameters

repo_id

string

required

HuggingFace repository ID in format owner/model-nameExample: "mlx-community/Qwen3-ASR-1.7B-8bit"

Response format

status

string

Download status, always "downloading"

string

Model identifier (extracted from repo_id)

repo_id

string

HuggingFace repository ID

Example response

{
  "status": "downloading",
  "id": "Qwen3-ASR-1.7B-8bit",
  "repo_id": "mlx-community/Qwen3-ASR-1.7B-8bit"
}

Download runs asynchronously. Use GET /v1/models to check when the model is available.

Downloaded files

The server downloads these essential files:

config.json - Model configuration
tokenizer.json - Tokenizer vocabulary
tokenizer_config.json - Tokenizer settings
model.safetensors - Model weights (single file)
model.safetensors.index.json + shards - Model weights (sharded)

Delete model

Delete a downloaded model from the local filesystem.

cURL

curl -X DELETE http://localhost:8080/v1/models/Qwen3-ASR-1.7B-8bit

Path parameters

string

required

Model identifier to delete (from model list)

Response format

string

Model identifier that was deleted

deleted

boolean

Always true if successful

Example response

{
  "id": "Qwen3-ASR-1.7B-8bit",
  "deleted": true
}

You cannot delete the currently loaded model. Stop the server or load a different model first.

Health endpoint

Health check

Check server health and readiness.

cURL

curl http://localhost:8080/health

Response format

{
  "status": "ok"
}

Returns 200 OK when server is healthy and ready to accept requests.

Error responses

All endpoints return OpenAI-compatible error responses:

{
  "error": {
    "message": "Detailed error message",
    "type": "error_type",
    "code": "error_code"
  }
}

Error types

HTTP Status	Error Type	Description
400	`invalid_request_error`	Malformed request or invalid parameters
404	`not_found`	Endpoint or resource not found
409	`conflict`	Resource conflict (e.g., model already exists)
500	`server_error`	Internal server error or inference failure

Common error codes

Code	Description
`invalid_audio_format`	Unsupported audio file format
`invalid_language`	Unsupported language specified
`model_not_found`	Requested model not available
`inference_failed`	Model inference error
`download_failed`	Model download error

Get Started

Core Concepts

Language Models

Vision-Language Models

Speech Recognition

Text-to-Speech

Image Generation

API Server

Advanced

​Audio endpoints

​Create transcription

​Request parameters

​Response format

​Example response

​Verbose response

​Chat endpoints

​Create chat completion

​Request parameters

​Response format

​Example response

​Model management endpoints

​List models

​Response format

​Model object fields

​Example response

​Download model

​Request parameters

​Response format

​Example response

​Downloaded files

​Delete model

​Path parameters

​Response format

​Example response

​Health endpoint

​Health check

​Response format

​Error responses

​Error types

​Common error codes

Build docs developers (and LLMs) love

Audio endpoints

Create transcription

Request parameters

Response format

Example response

Verbose response

Chat endpoints

Create chat completion

Request parameters

Response format

Example response

Model management endpoints

List models

Response format

Model object fields

Example response

Download model

Request parameters

Response format

Example response

Downloaded files

Delete model

Path parameters

Response format

Example response

Health endpoint

Health check

Response format

Error responses

Error types

Common error codes