Introduction
OminiX-MLX provides OpenAI-compatible API servers for running machine learning models on Apple Silicon. The API architecture enables easy integration with existing tools and frameworks while leveraging the performance benefits of MLX on M-series chips.Server implementations
OminiX-MLX includes multiple API server implementations:OminiX-API (unified server)
A unified HTTP server for all OminiX model types. Currently supports:- ASR (Speech Recognition): Qwen3-ASR with 30+ languages
- Extensible: Designed for LLM, VLM, and TTS support
OminiX-API server
Learn about the unified API server architecture and capabilities
Model-specific servers
Some model crates include their own OpenAI-compatible API servers:- MiniCPM-SALA: Full chat completion API with model management
- Future: Additional model-specific servers planned
Key features
OpenAI-compatible
Drop-in replacement for OpenAI API endpoints with familiar request/response formats
Metal acceleration
GPU-optimized inference using Apple’s Metal framework for maximum performance
Model management
Built-in model download, listing, and deletion via HTTP API
Async runtime
Tokio-based async server with dedicated inference workers for high throughput
Common capabilities
All API servers in the OminiX-MLX ecosystem share common patterns:Health monitoring
All servers expose a/health endpoint for monitoring and readiness checks:
Model management
Servers support listing available models and downloading from HuggingFace:CORS support
All servers include CORS headers for web application integration:Performance characteristics
| Model Type | Model Size | Throughput | Memory |
|---|---|---|---|
| ASR | Qwen3-ASR-1.7B-8bit | 30x real-time | 2.5GB |
| ASR | Qwen3-ASR-0.6B-8bit | 50x real-time | 1.0GB |
| LLM | MiniCPM-SALA-9B-8bit | 28 tok/s | 9.6GB |
| LLM | Qwen3-4B | 45 tok/s | 8GB |
Performance benchmarks measured on Apple M3 Max with 128GB RAM. Actual performance varies based on hardware configuration and quantization settings.
Architecture
Request flow
Key design principles
- Separation of concerns: HTTP handling (async) is separate from inference (blocking)
- Resource efficiency: Single model instance shared across requests
- Zero-copy: Unified memory architecture minimizes data transfers
- Lazy evaluation: MLX automatically optimizes compute graphs
Next steps
OminiX-API server
Explore the unified API server implementation
API endpoints
View complete endpoint reference with examples