Skip to main content

Overview

OminiX-API is a unified HTTP server that provides OpenAI-compatible endpoints for all OminiX model types. It’s designed as a single, extensible server that can handle LLM, VLM, ASR, TTS, and image generation models.
Currently, OminiX-API supports ASR (Speech Recognition) with Qwen3-ASR models. Support for LLM, VLM, and TTS is coming soon.

Installation

Build from source

cd OminiX-MLX
cargo build --release -p ominix-api

Features

The server uses Cargo feature flags to include model support:
FeatureDescriptionDefault
asrEnable speech recognition endpoints
Additional features (LLM, VLM, TTS) are coming soon and will be available in future releases.
Build with specific features:
cargo build --release -p ominix-api --features asr

Quick start

Download an ASR model

huggingface-cli download mlx-community/Qwen3-ASR-1.7B-8bit \
  --local-dir ~/.OminiX/models/qwen3-asr-1.7b

Start the server

cargo run --release -p ominix-api -- \
  --asr-model ~/.OminiX/models/qwen3-asr-1.7b \
  --port 8080

Test transcription

curl http://localhost:8080/v1/audio/transcriptions \
  -F [email protected] \
  -F language=English

Configuration

Command-line options

--asr-model
string
Path to ASR model directory. Required when using ASR features.Example: ~/.OminiX/models/qwen3-asr-1.7b
--port
integer
default:"8080"
HTTP server port
--models-dir
string
default:"~/.ominix/models"
Directory for storing downloaded models

Environment variables

HF_TOKEN
string
HuggingFace API token for downloading gated models. Can also be read from ~/.cache/huggingface/token

Architecture

Technology stack

ComponentTechnologyPurpose
HTTP ServerHyper 1.xAsync HTTP/1.1 server
RuntimeTokioAsync task execution
Serializationserde_jsonJSON request/response
Model Downloadhf-hubHuggingFace model fetching
Audio Encodingbase64Binary audio handling
CLIclapArgument parsing

Crate structure

[dependencies]
# HTTP server
hyper = { version = "1", features = ["full"] }
hyper-util = { version = "0.1", features = ["tokio"] }
http-body-util = "0.1"
tokio = { version = "1", features = ["full"] }

# Model crates (feature-gated)
qwen3-asr-mlx = { path = "../qwen3-asr-mlx", optional = true }

Request handling flow

  1. HTTP Request → Hyper server receives request
  2. Route Matching → Path and method determine handler
  3. Request Parsing → JSON/multipart body deserialization
  4. Inference → Model processes input (ASR, LLM, etc.)
  5. Response Formatting → OpenAI-compatible JSON response
  6. HTTP Response → Hyper sends response with CORS headers

Model management

Configuration file

OminiX-API maintains a configuration file at ~/.ominix/config.json:
{
  "models_dir": "/Users/you/.ominix/models",
  "models": [
    {
      "id": "qwen3-asr-1.7b",
      "repo_id": "mlx-community/Qwen3-ASR-1.7B-8bit",
      "path": "/Users/you/.ominix/models/qwen3-asr-1.7b",
      "quantization": {
        "bits": 8,
        "group_size": 64
      },
      "size_bytes": 2460000000,
      "downloaded_at": "1672531200"
    }
  ]
}

Automatic model scanning

The server automatically scans the models_dir on startup and when listing models:
  • Detects new model directories with config.json
  • Removes entries for deleted models
  • Extracts quantization info from model config
  • Calculates model size from .safetensors files

Supported models

Speech recognition (ASR)

ModelLanguagesPerformanceMemoryModel ID
Qwen3-ASR-1.7B-8bit30+ languages30x real-time2.5GBmlx-community/Qwen3-ASR-1.7B-8bit
Qwen3-ASR-0.6B-8bit30+ languages50x real-time1.0GBmlx-community/Qwen3-ASR-0.6B-8bit
The 1.7B model provides better accuracy, while the 0.6B model offers faster transcription with slightly lower accuracy.

Supported languages

Qwen3-ASR models support:
  • Chinese (Mandarin)
  • English
  • Spanish, French, German, Italian, Portuguese
  • Japanese, Korean, Arabic, Russian
  • Hindi, Thai, Vietnamese, Indonesian
  • Plus 15+ additional languages
Specify language in transcription requests for optimal results.

Dependencies

OminiX-API requires:
  • macOS 14.0+ (Sonoma or later)
  • Apple Silicon (M1/M2/M3/M4)
  • Rust 1.82+
  • Xcode Command Line Tools
xcode-select --install

Error handling

The API returns OpenAI-compatible error responses:
{
  "error": {
    "message": "Audio file format not supported",
    "type": "invalid_request_error",
    "code": "invalid_audio_format"
  }
}

Common error types

TypeHTTP StatusDescription
invalid_request_error400Malformed request or invalid parameters
not_found404Endpoint or resource not found
conflict409Resource already exists (e.g., model already downloaded)
server_error500Internal server error or inference failure

Performance tuning

Batch size

ASR inference processes audio in batches for optimal GPU utilization:
  • Automatic batching: Server determines optimal batch size
  • Memory consideration: Larger batches use more GPU memory
  • Throughput vs latency: Larger batches increase throughput but may increase latency

Concurrent requests

The server handles concurrent requests efficiently:
  • Async HTTP layer: Handles thousands of concurrent connections
  • Serial inference: Inference runs sequentially (single model instance)
  • Request queuing: Requests are queued if inference is busy
For high-throughput production deployments, consider running multiple server instances behind a load balancer.

Next steps

API endpoints

Complete endpoint reference with examples

Speech recognition

Learn more about ASR models and capabilities

Build docs developers (and LLMs) love