Overview
OminiX-API is a unified HTTP server that provides OpenAI-compatible endpoints for all OminiX model types. It’s designed as a single, extensible server that can handle LLM, VLM, ASR, TTS, and image generation models.Currently, OminiX-API supports ASR (Speech Recognition) with Qwen3-ASR models. Support for LLM, VLM, and TTS is coming soon.
Installation
Build from source
Features
The server uses Cargo feature flags to include model support:| Feature | Description | Default |
|---|---|---|
asr | Enable speech recognition endpoints | ✓ |
Additional features (LLM, VLM, TTS) are coming soon and will be available in future releases.
Quick start
Download an ASR model
Start the server
Test transcription
- Multipart upload
- JSON file path
Configuration
Command-line options
Path to ASR model directory. Required when using ASR features.Example:
~/.OminiX/models/qwen3-asr-1.7bHTTP server port
Directory for storing downloaded models
Environment variables
HuggingFace API token for downloading gated models. Can also be read from
~/.cache/huggingface/tokenArchitecture
Technology stack
| Component | Technology | Purpose |
|---|---|---|
| HTTP Server | Hyper 1.x | Async HTTP/1.1 server |
| Runtime | Tokio | Async task execution |
| Serialization | serde_json | JSON request/response |
| Model Download | hf-hub | HuggingFace model fetching |
| Audio Encoding | base64 | Binary audio handling |
| CLI | clap | Argument parsing |
Crate structure
Request handling flow
- HTTP Request → Hyper server receives request
- Route Matching → Path and method determine handler
- Request Parsing → JSON/multipart body deserialization
- Inference → Model processes input (ASR, LLM, etc.)
- Response Formatting → OpenAI-compatible JSON response
- HTTP Response → Hyper sends response with CORS headers
Model management
Configuration file
OminiX-API maintains a configuration file at~/.ominix/config.json:
Automatic model scanning
The server automatically scans themodels_dir on startup and when listing models:
- Detects new model directories with
config.json - Removes entries for deleted models
- Extracts quantization info from model config
- Calculates model size from
.safetensorsfiles
Supported models
Speech recognition (ASR)
| Model | Languages | Performance | Memory | Model ID |
|---|---|---|---|---|
| Qwen3-ASR-1.7B-8bit | 30+ languages | 30x real-time | 2.5GB | mlx-community/Qwen3-ASR-1.7B-8bit |
| Qwen3-ASR-0.6B-8bit | 30+ languages | 50x real-time | 1.0GB | mlx-community/Qwen3-ASR-0.6B-8bit |
The 1.7B model provides better accuracy, while the 0.6B model offers faster transcription with slightly lower accuracy.
Supported languages
Qwen3-ASR models support:- Chinese (Mandarin)
- English
- Spanish, French, German, Italian, Portuguese
- Japanese, Korean, Arabic, Russian
- Hindi, Thai, Vietnamese, Indonesian
- Plus 15+ additional languages
Dependencies
OminiX-API requires:- macOS 14.0+ (Sonoma or later)
- Apple Silicon (M1/M2/M3/M4)
- Rust 1.82+
- Xcode Command Line Tools
Error handling
The API returns OpenAI-compatible error responses:Common error types
| Type | HTTP Status | Description |
|---|---|---|
invalid_request_error | 400 | Malformed request or invalid parameters |
not_found | 404 | Endpoint or resource not found |
conflict | 409 | Resource already exists (e.g., model already downloaded) |
server_error | 500 | Internal server error or inference failure |
Performance tuning
Batch size
ASR inference processes audio in batches for optimal GPU utilization:- Automatic batching: Server determines optimal batch size
- Memory consideration: Larger batches use more GPU memory
- Throughput vs latency: Larger batches increase throughput but may increase latency
Concurrent requests
The server handles concurrent requests efficiently:- Async HTTP layer: Handles thousands of concurrent connections
- Serial inference: Inference runs sequentially (single model instance)
- Request queuing: Requests are queued if inference is busy
Next steps
API endpoints
Complete endpoint reference with examples
Speech recognition
Learn more about ASR models and capabilities