OminiX-API server - OminiX-MLX

Overview

OminiX-API is a unified HTTP server that provides OpenAI-compatible endpoints for all OminiX model types. It’s designed as a single, extensible server that can handle LLM, VLM, ASR, TTS, and image generation models.

Currently, OminiX-API supports ASR (Speech Recognition) with Qwen3-ASR models. Support for LLM, VLM, and TTS is coming soon.

Installation

Build from source

cd OminiX-MLX
cargo build --release -p ominix-api

Features

The server uses Cargo feature flags to include model support:

Feature	Description	Default
`asr`	Enable speech recognition endpoints	✓

Additional features (LLM, VLM, TTS) are coming soon and will be available in future releases.

Build with specific features:

cargo build --release -p ominix-api --features asr

Quick start

Download an ASR model

huggingface-cli download mlx-community/Qwen3-ASR-1.7B-8bit \
  --local-dir ~/.OminiX/models/qwen3-asr-1.7b

Start the server

cargo run --release -p ominix-api -- \
  --asr-model ~/.OminiX/models/qwen3-asr-1.7b \
  --port 8080

Test transcription

Multipart upload
JSON file path

curl http://localhost:8080/v1/audio/transcriptions \
  -F [email protected] \
  -F language=English

curl http://localhost:8080/v1/audio/transcriptions \
  -H "Content-Type: application/json" \
  -d '{
    "file_path": "audio.wav",
    "language": "Chinese"
  }'

Configuration

Command-line options

--asr-model

string

Path to ASR model directory. Required when using ASR features.Example: ~/.OminiX/models/qwen3-asr-1.7b

--port

integer

default:"8080"

HTTP server port

--models-dir

string

default:"~/.ominix/models"

Directory for storing downloaded models

Environment variables

HF_TOKEN

string

HuggingFace API token for downloading gated models. Can also be read from ~/.cache/huggingface/token

Architecture

Technology stack

Component	Technology	Purpose
HTTP Server	Hyper 1.x	Async HTTP/1.1 server
Runtime	Tokio	Async task execution
Serialization	serde_json	JSON request/response
Model Download	hf-hub	HuggingFace model fetching
Audio Encoding	base64	Binary audio handling
CLI	clap	Argument parsing

Crate structure

[dependencies]
# HTTP server
hyper = { version = "1", features = ["full"] }
hyper-util = { version = "0.1", features = ["tokio"] }
http-body-util = "0.1"
tokio = { version = "1", features = ["full"] }

# Model crates (feature-gated)
qwen3-asr-mlx = { path = "../qwen3-asr-mlx", optional = true }

Request handling flow

HTTP Request → Hyper server receives request
Route Matching → Path and method determine handler
Request Parsing → JSON/multipart body deserialization
Inference → Model processes input (ASR, LLM, etc.)
Response Formatting → OpenAI-compatible JSON response
HTTP Response → Hyper sends response with CORS headers

Model management

Configuration file

OminiX-API maintains a configuration file at ~/.ominix/config.json:

{
  "models_dir": "/Users/you/.ominix/models",
  "models": [
    {
      "id": "qwen3-asr-1.7b",
      "repo_id": "mlx-community/Qwen3-ASR-1.7B-8bit",
      "path": "/Users/you/.ominix/models/qwen3-asr-1.7b",
      "quantization": {
        "bits": 8,
        "group_size": 64
      },
      "size_bytes": 2460000000,
      "downloaded_at": "1672531200"
    }
  ]
}

Automatic model scanning

The server automatically scans the models_dir on startup and when listing models:

Detects new model directories with config.json
Removes entries for deleted models
Extracts quantization info from model config
Calculates model size from .safetensors files

Supported models

Speech recognition (ASR)

Model	Languages	Performance	Memory	Model ID
Qwen3-ASR-1.7B-8bit	30+ languages	30x real-time	2.5GB	`mlx-community/Qwen3-ASR-1.7B-8bit`
Qwen3-ASR-0.6B-8bit	30+ languages	50x real-time	1.0GB	`mlx-community/Qwen3-ASR-0.6B-8bit`

The 1.7B model provides better accuracy, while the 0.6B model offers faster transcription with slightly lower accuracy.

Supported languages

Qwen3-ASR models support:

Chinese (Mandarin)
English
Spanish, French, German, Italian, Portuguese
Japanese, Korean, Arabic, Russian
Hindi, Thai, Vietnamese, Indonesian
Plus 15+ additional languages

Specify language in transcription requests for optimal results.

Dependencies

OminiX-API requires:

macOS 14.0+ (Sonoma or later)
Apple Silicon (M1/M2/M3/M4)
Rust 1.82+
Xcode Command Line Tools

xcode-select --install

Error handling

The API returns OpenAI-compatible error responses:

{
  "error": {
    "message": "Audio file format not supported",
    "type": "invalid_request_error",
    "code": "invalid_audio_format"
  }
}

Common error types

Type	HTTP Status	Description
`invalid_request_error`	400	Malformed request or invalid parameters
`not_found`	404	Endpoint or resource not found
`conflict`	409	Resource already exists (e.g., model already downloaded)
`server_error`	500	Internal server error or inference failure

Performance tuning

Batch size

ASR inference processes audio in batches for optimal GPU utilization:

Automatic batching: Server determines optimal batch size
Memory consideration: Larger batches use more GPU memory
Throughput vs latency: Larger batches increase throughput but may increase latency

Concurrent requests

The server handles concurrent requests efficiently:

Async HTTP layer: Handles thousands of concurrent connections
Serial inference: Inference runs sequentially (single model instance)
Request queuing: Requests are queued if inference is busy

For high-throughput production deployments, consider running multiple server instances behind a load balancer.

Get Started

Core Concepts

Language Models

Vision-Language Models

Speech Recognition

Text-to-Speech

Image Generation

API Server

Advanced

​Overview

​Installation

​Build from source

​Features

​Quick start

​Download an ASR model

​Start the server

​Test transcription

​Configuration

​Command-line options

​Environment variables

​Architecture

​Technology stack

​Crate structure

​Request handling flow

​Model management

​Configuration file

​Automatic model scanning

​Supported models

​Speech recognition (ASR)

​Supported languages

​Dependencies

​Error handling

​Common error types

​Performance tuning

​Batch size

​Concurrent requests

​Next steps

API endpoints

Speech recognition

Build docs developers (and LLMs) love

Overview

Installation

Build from source

Features

Quick start

Download an ASR model

Start the server

Test transcription

Configuration

Command-line options

Environment variables

Architecture

Technology stack

Crate structure

Request handling flow

Model management

Configuration file

Automatic model scanning

Supported models

Speech recognition (ASR)

Supported languages

Dependencies

Error handling

Common error types

Performance tuning

Batch size

Concurrent requests

Next steps