Overview
ominix-api provides a unified OpenAI-compatible HTTP API server that exposes OminiX-MLX model crates (ASR, TTS, VLM, LLM) through a single HTTP server. This implementation makes the models drop-in compatible with OpenAI client libraries.
The server implements the OpenAI API surface, allowing you to use standard OpenAI SDKs and tools with local OminiX-MLX models.
Installation
Build from source in the OminiX-MLX workspace:Usage
Basic server startup
Making requests
Server configuration
Command-line arguments
Path to the ASR model directory (e.g.,
~/.OminiX/models/qwen3-asr-1.7b)Server port to bind to
Directory for storing downloaded models
Environment variables
HuggingFace API token for downloading gated models. Can also be stored in
~/.cache/huggingface/tokenAPI endpoints
Audio transcription
POST /v1/audio/transcriptions
Transcribe audio to text using the Qwen3-ASR model
Request (multipart)
Audio file to transcribe (WAV format recommended)
Language of the audio (e.g., “English”, “Chinese”). If not specified, the model will auto-detect
Request (JSON)
Path to the audio file on the server
Language of the audio content
Response
The transcribed text from the audio file
The detected or specified language
Duration of the audio in seconds
Chat completions (planned)
POST /v1/chat/completions
Generate chat completions using LLM models
Request
Array of message objects with
role and content fieldsModel identifier to use for generation
Maximum number of tokens to generate
Sampling temperature between 0 and 2. Higher values make output more random
Response
Unique identifier for this completion
Always “chat.completion”
The model used for generation
Array of completion choices (typically one)
Token usage statistics
Text-to-speech (planned)
POST /v1/audio/speech
Generate speech audio from text using TTS models
qwen3-tts-mlx crate.
Request
The text to convert to speech
TTS model identifier
Voice preset to use for generation
Response
Returns binary audio data (WAV or MP3 format) with appropriateContent-Type header.
Model management
GET /v1/models
List all available models with metadata
Response
Always “list”
Array of model objects
POST /v1/models/download
Download a model from HuggingFace Hub
Request
HuggingFace repository ID (e.g., “OminiX/qwen3-asr-1.7b-8bit”)
Response
Download status: “downloading”, “complete”, or “failed”
Model identifier (derived from repo name)
HuggingFace repository ID
DELETE /v1/models/{id}
Delete a downloaded model from disk
Response
Model identifier that was deleted
Always true on success
Health check
GET /health
Check server health status
Response
Always “ok” if server is running
Features
The server supports feature flags to enable/disable specific model capabilities:Enable ASR (Automatic Speech Recognition) endpoint via
qwen3-asr-mlx crateDependencies
The server is built on the following core dependencies:HTTP server
- hyper v1 - High-performance HTTP server
- tokio v1 - Async runtime
- http-body-util - HTTP body utilities
Serialization
- serde + serde_json - JSON serialization
Model integration
- qwen3-asr-mlx - ASR model implementation (feature-gated)
- hf-hub - HuggingFace model downloading
Utilities
- clap - Command-line argument parsing
- base64 - Audio encoding/decoding
- dirs - System directory paths
Configuration file
The server maintains a configuration file at~/.ominix/config.json:
Error handling
All API endpoints follow OpenAI error response format:Error types
- invalid_request_error - Malformed request or invalid parameters
- server_error - Internal server error during processing
- not_found - Resource not found
- conflict - Operation conflicts with current state
CORS support
The server includes CORS headers to allow cross-origin requests:Access-Control-Allow-Origin: *Access-Control-Allow-Methods: GET, POST, DELETE, OPTIONSAccess-Control-Allow-Headers: Content-Type, Authorization
Performance considerations
Model loading
Models are loaded into memory on server startup and persist for the lifetime of the server process. Loading large models may take several seconds.Inference threading
Inference requests are processed sequentially on a dedicated worker thread to avoid GPU contention. For concurrent requests, consider running multiple server instances with different ports.Memory usage
Each loaded model consumes GPU memory proportional to its size. Ensure your system has sufficient memory for the models you plan to use:- 1.7B 8-bit model: ~2GB
- 9B 8-bit model: ~9GB
- 72B 4-bit model: ~36GB
Example integration
Python (OpenAI SDK)
JavaScript (OpenAI SDK)
cURL
License
MIT OR Apache-2.0Related crates
- qwen3-asr-mlx - ASR model implementation