Skip to main content

Introduction

OminiX-MLX provides OpenAI-compatible API servers for running machine learning models on Apple Silicon. The API architecture enables easy integration with existing tools and frameworks while leveraging the performance benefits of MLX on M-series chips.

Server implementations

OminiX-MLX includes multiple API server implementations:

OminiX-API (unified server)

A unified HTTP server for all OminiX model types. Currently supports:
  • ASR (Speech Recognition): Qwen3-ASR with 30+ languages
  • Extensible: Designed for LLM, VLM, and TTS support

OminiX-API server

Learn about the unified API server architecture and capabilities

Model-specific servers

Some model crates include their own OpenAI-compatible API servers:
  • MiniCPM-SALA: Full chat completion API with model management
  • Future: Additional model-specific servers planned

Key features

OpenAI-compatible

Drop-in replacement for OpenAI API endpoints with familiar request/response formats

Metal acceleration

GPU-optimized inference using Apple’s Metal framework for maximum performance

Model management

Built-in model download, listing, and deletion via HTTP API

Async runtime

Tokio-based async server with dedicated inference workers for high throughput

Common capabilities

All API servers in the OminiX-MLX ecosystem share common patterns:

Health monitoring

All servers expose a /health endpoint for monitoring and readiness checks:
curl http://localhost:8080/health

Model management

Servers support listing available models and downloading from HuggingFace:
# List models
curl http://localhost:8080/v1/models

# Download a model
curl -X POST http://localhost:8080/v1/models/download \
  -H "Content-Type: application/json" \
  -d '{"repo_id": "mlx-community/Qwen3-ASR-1.7B-8bit"}'

# Delete a model
curl -X DELETE http://localhost:8080/v1/models/Qwen3-ASR-1.7B-8bit

CORS support

All servers include CORS headers for web application integration:
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: GET, POST, DELETE, OPTIONS
Access-Control-Allow-Headers: Content-Type, Authorization

Performance characteristics

Model TypeModel SizeThroughputMemory
ASRQwen3-ASR-1.7B-8bit30x real-time2.5GB
ASRQwen3-ASR-0.6B-8bit50x real-time1.0GB
LLMMiniCPM-SALA-9B-8bit28 tok/s9.6GB
LLMQwen3-4B45 tok/s8GB
Performance benchmarks measured on Apple M3 Max with 128GB RAM. Actual performance varies based on hardware configuration and quantization settings.

Architecture

Request flow

┌─────────────┐
│   Client    │
│  (HTTP)     │
└──────┬──────┘
       │ POST /v1/...

┌─────────────────────────────────────────┐
│         Hyper HTTP Server               │
│         (Tokio Runtime)                 │
└──────┬──────────────────────────────────┘
       │ async channel

┌─────────────────────────────────────────┐
│      Inference Worker Thread            │
│  (dedicated blocking thread)            │
├─────────────────────────────────────────┤
│  • Load model (cached)                  │
│  • Tokenize input                       │
│  • Forward pass (Metal GPU)             │
│  • Sample tokens                        │
│  • Decode output                        │
└──────┬──────────────────────────────────┘
       │ result

┌─────────────────────────────────────────┐
│        JSON Response                    │
│    (OpenAI-compatible)                  │
└─────────────────────────────────────────┘

Key design principles

  1. Separation of concerns: HTTP handling (async) is separate from inference (blocking)
  2. Resource efficiency: Single model instance shared across requests
  3. Zero-copy: Unified memory architecture minimizes data transfers
  4. Lazy evaluation: MLX automatically optimizes compute graphs

Next steps

OminiX-API server

Explore the unified API server implementation

API endpoints

View complete endpoint reference with examples

Build docs developers (and LLMs) love