Skip to main content

Audio endpoints

Create transcription

Transcribe audio to text using Qwen3-ASR models. Compatible with OpenAI Whisper API format.
curl http://localhost:8080/v1/audio/transcriptions \
  -F [email protected] \
  -F language=Chinese

Request parameters

file
file
Audio file to transcribe. Supported formats: WAV, MP3, M4A, FLAC.Multipart upload only. Use file_path for JSON requests.
file_path
string
Path to audio file on server filesystem.JSON requests only. Use file for multipart uploads.
language
string
Language of the audio content. Specifying the language improves accuracy.Supported languages: Chinese, English, Spanish, French, German, Japanese, Korean, Arabic, Russian, Hindi, Thai, Vietnamese, and 15+ more.Example: "English", "Chinese"
model
string
default:"qwen3-asr"
Model identifier. Currently only qwen3-asr is supported.
response_format
string
default:"json"
Response format. Options: json, text, verbose_json
  • json: Returns transcription text only
  • text: Returns plain text response
  • verbose_json: Includes timestamps and metadata

Response format

text
string
Transcribed text from the audio file
duration
number
Audio duration in seconds (verbose_json only)
language
string
Detected or specified language (verbose_json only)

Example response

{
  "text": "Welcome to OminiX-MLX, a high-performance ML inference framework for Apple Silicon."
}

Verbose response

{
  "text": "Welcome to OminiX-MLX, a high-performance ML inference framework for Apple Silicon.",
  "duration": 5.2,
  "language": "English",
  "segments": [
    {
      "start": 0.0,
      "end": 2.1,
      "text": "Welcome to OminiX-MLX,"
    },
    {
      "start": 2.1,
      "end": 5.2,
      "text": "a high-performance ML inference framework for Apple Silicon."
    }
  ]
}

Chat endpoints

Create chat completion

Available in MiniCPM-SALA server. Coming to OminiX-API unified server soon.
Generate chat completions using LLM models with OpenAI-compatible format.
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "minicpm-sala-9b",
    "messages": [
      {"role": "user", "content": "Explain quantum computing"}
    ],
    "temperature": 0.7,
    "max_tokens": 256
  }'

Request parameters

model
string
required
Model identifier to use for completionExample: "minicpm-sala-9b"
messages
array
required
Array of message objects representing the conversation historyEach message contains:
  • role: One of system, user, or assistant
  • content: Message text content
temperature
number
default:"0.7"
Sampling temperature between 0 and 2. Higher values make output more random.
  • 0.0: Deterministic (greedy sampling)
  • 0.7: Balanced creativity and coherence
  • 1.5+: More creative but less focused
max_tokens
integer
default:"2048"
Maximum number of tokens to generate
top_p
number
default:"1.0"
Nucleus sampling threshold. Alternative to temperature.
stream
boolean
default:"false"
Whether to stream response tokens (not yet implemented)

Response format

id
string
Unique completion identifier
object
string
Object type, always "chat.completion"
model
string
Model used for completion
choices
array
Array of completion choices (currently always length 1)Each choice contains:
  • index: Choice index (always 0)
  • message: Response message with role and content
  • finish_reason: Why generation stopped ("stop", "length", etc.)
usage
object
Token usage statisticsContains:
  • prompt_tokens: Input token count
  • completion_tokens: Generated token count
  • total_tokens: Sum of prompt and completion tokens

Example response

{
  "id": "chatcmpl-18a3f2b4c5d6e7f8",
  "object": "chat.completion",
  "model": "minicpm-sala-9b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Quantum computing uses quantum mechanical phenomena like superposition and entanglement to perform computations. Unlike classical bits that are either 0 or 1, quantum bits (qubits) can exist in multiple states simultaneously..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 156,
    "total_tokens": 168
  }
}

Model management endpoints

List models

Retrieve list of available models with metadata including path, size, quantization, and loaded status.
cURL
curl http://localhost:8080/v1/models

Response format

object
string
Object type, always "list"
data
array
Array of model objects

Model object fields

id
string
Model identifier (directory name)
object
string
Object type, always "model"
owned_by
string
Model owner (e.g., "mlx-community", "local")
path
string
Absolute path to model directory
loaded
boolean
Whether this model is currently loaded in the server
repo_id
string
HuggingFace repository ID (if downloaded via API)
quantization
object
Quantization configurationContains:
  • bits: Quantization bit width (4, 8, etc.)
  • group_size: Group size for quantization
size_bytes
integer
Total size of model files in bytes
downloaded_at
string
Unix timestamp when model was downloaded

Example response

{
  "object": "list",
  "data": [
    {
      "id": "qwen3-asr-1.7b",
      "object": "model",
      "owned_by": "mlx-community",
      "path": "/Users/you/.ominix/models/qwen3-asr-1.7b",
      "loaded": true,
      "repo_id": "mlx-community/Qwen3-ASR-1.7B-8bit",
      "quantization": {
        "bits": 8,
        "group_size": 64
      },
      "size_bytes": 2460000000,
      "downloaded_at": "1672531200"
    },
    {
      "id": "MiniCPM-SALA-9B-8bit",
      "object": "model",
      "owned_by": "moxin-org",
      "path": "/Users/you/.ominix/models/MiniCPM-SALA-9B-8bit",
      "loaded": false,
      "repo_id": "moxin-org/MiniCPM4-SALA-9B-8bit-mlx",
      "quantization": {
        "bits": 8,
        "group_size": 64
      },
      "size_bytes": 9600000000
    }
  ]
}

Download model

Download a model from HuggingFace Hub to the local models directory.
cURL
curl -X POST http://localhost:8080/v1/models/download \
  -H "Content-Type: application/json" \
  -d '{"repo_id": "mlx-community/Qwen3-ASR-1.7B-8bit"}'

Request parameters

repo_id
string
required
HuggingFace repository ID in format owner/model-nameExample: "mlx-community/Qwen3-ASR-1.7B-8bit"

Response format

status
string
Download status, always "downloading"
id
string
Model identifier (extracted from repo_id)
repo_id
string
HuggingFace repository ID

Example response

{
  "status": "downloading",
  "id": "Qwen3-ASR-1.7B-8bit",
  "repo_id": "mlx-community/Qwen3-ASR-1.7B-8bit"
}
Download runs asynchronously. Use GET /v1/models to check when the model is available.

Downloaded files

The server downloads these essential files:
  • config.json - Model configuration
  • tokenizer.json - Tokenizer vocabulary
  • tokenizer_config.json - Tokenizer settings
  • model.safetensors - Model weights (single file)
  • model.safetensors.index.json + shards - Model weights (sharded)

Delete model

Delete a downloaded model from the local filesystem.
cURL
curl -X DELETE http://localhost:8080/v1/models/Qwen3-ASR-1.7B-8bit

Path parameters

id
string
required
Model identifier to delete (from model list)

Response format

id
string
Model identifier that was deleted
deleted
boolean
Always true if successful

Example response

{
  "id": "Qwen3-ASR-1.7B-8bit",
  "deleted": true
}
You cannot delete the currently loaded model. Stop the server or load a different model first.

Health endpoint

Health check

Check server health and readiness.
cURL
curl http://localhost:8080/health

Response format

{
  "status": "ok"
}
Returns 200 OK when server is healthy and ready to accept requests.

Error responses

All endpoints return OpenAI-compatible error responses:
{
  "error": {
    "message": "Detailed error message",
    "type": "error_type",
    "code": "error_code"
  }
}

Error types

HTTP StatusError TypeDescription
400invalid_request_errorMalformed request or invalid parameters
404not_foundEndpoint or resource not found
409conflictResource conflict (e.g., model already exists)
500server_errorInternal server error or inference failure

Common error codes

CodeDescription
invalid_audio_formatUnsupported audio file format
invalid_languageUnsupported language specified
model_not_foundRequested model not available
inference_failedModel inference error
download_failedModel download error

Build docs developers (and LLMs) love