Skip to main content
This crate is currently a work in progress. The implementation is planned but not yet complete.

Overview

ominix-api provides a unified OpenAI-compatible HTTP API server that exposes OminiX-MLX model crates (ASR, TTS, VLM, LLM) through a single HTTP server. This implementation makes the models drop-in compatible with OpenAI client libraries. The server implements the OpenAI API surface, allowing you to use standard OpenAI SDKs and tools with local OminiX-MLX models.

Installation

Build from source in the OminiX-MLX workspace:
cargo build --release -p ominix-api

Usage

Basic server startup

# Start server with ASR feature (enabled by default)
cargo run --release -p ominix-api -- \
    --asr-model ~/.OminiX/models/qwen3-asr-1.7b \
    --port 8080

Making requests

# OpenAI Whisper-compatible multipart upload
curl http://localhost:8080/v1/audio/transcriptions \
    -F [email protected] \
    -F language=Chinese

Server configuration

Command-line arguments

asr-model
string
Path to the ASR model directory (e.g., ~/.OminiX/models/qwen3-asr-1.7b)
port
number
default:"8080"
Server port to bind to
models-dir
string
default:"~/.ominix/models"
Directory for storing downloaded models

Environment variables

HF_TOKEN
string
HuggingFace API token for downloading gated models. Can also be stored in ~/.cache/huggingface/token

API endpoints

Audio transcription

POST /v1/audio/transcriptions

Transcribe audio to text using the Qwen3-ASR model

Request (multipart)

file
file
required
Audio file to transcribe (WAV format recommended)
language
string
Language of the audio (e.g., “English”, “Chinese”). If not specified, the model will auto-detect

Request (JSON)

file_path
string
required
Path to the audio file on the server
language
string
Language of the audio content

Response

text
string
The transcribed text from the audio file
language
string
The detected or specified language
duration
number
Duration of the audio in seconds

Chat completions (planned)

POST /v1/chat/completions

Generate chat completions using LLM models
This endpoint is planned for future implementation and will provide OpenAI-compatible chat completions.

Request

messages
array
required
Array of message objects with role and content fields
model
string
default:"minicpm-sala-9b"
Model identifier to use for generation
max_tokens
number
default:"2048"
Maximum number of tokens to generate
temperature
number
default:"0.7"
Sampling temperature between 0 and 2. Higher values make output more random

Response

id
string
Unique identifier for this completion
object
string
Always “chat.completion”
model
string
The model used for generation
choices
array
Array of completion choices (typically one)
usage
object
Token usage statistics

Text-to-speech (planned)

POST /v1/audio/speech

Generate speech audio from text using TTS models
This endpoint is planned for integration with the qwen3-tts-mlx crate.

Request

input
string
required
The text to convert to speech
model
string
default:"qwen3-tts"
TTS model identifier
voice
string
Voice preset to use for generation

Response

Returns binary audio data (WAV or MP3 format) with appropriate Content-Type header.

Model management

GET /v1/models

List all available models with metadata

Response

object
string
Always “list”
data
array
Array of model objects

POST /v1/models/download

Download a model from HuggingFace Hub

Request

repo_id
string
required
HuggingFace repository ID (e.g., “OminiX/qwen3-asr-1.7b-8bit”)

Response

status
string
Download status: “downloading”, “complete”, or “failed”
id
string
Model identifier (derived from repo name)
repo_id
string
HuggingFace repository ID

DELETE /v1/models/{id}

Delete a downloaded model from disk

Response

id
string
Model identifier that was deleted
deleted
boolean
Always true on success

Health check

GET /health

Check server health status

Response

status
string
Always “ok” if server is running

Features

The server supports feature flags to enable/disable specific model capabilities:
asr
feature
default:"enabled"
Enable ASR (Automatic Speech Recognition) endpoint via qwen3-asr-mlx crate
Build with specific features:
# ASR only (default)
cargo build --release -p ominix-api

# Disable ASR
cargo build --release -p ominix-api --no-default-features

Dependencies

The server is built on the following core dependencies:

HTTP server

  • hyper v1 - High-performance HTTP server
  • tokio v1 - Async runtime
  • http-body-util - HTTP body utilities

Serialization

  • serde + serde_json - JSON serialization

Model integration

  • qwen3-asr-mlx - ASR model implementation (feature-gated)
  • hf-hub - HuggingFace model downloading

Utilities

  • clap - Command-line argument parsing
  • base64 - Audio encoding/decoding
  • dirs - System directory paths

Configuration file

The server maintains a configuration file at ~/.ominix/config.json:
{
  "models_dir": "/Users/username/.ominix/models",
  "models": [
    {
      "id": "qwen3-asr-1.7b-8bit",
      "repo_id": "OminiX/qwen3-asr-1.7b-8bit",
      "path": "/Users/username/.ominix/models/qwen3-asr-1.7b-8bit",
      "quantization": {
        "bits": 8,
        "group_size": 64
      },
      "size_bytes": 1789231104,
      "downloaded_at": "2024-03-15T10:30:00Z"
    }
  ]
}
The configuration is automatically updated when models are downloaded or deleted through the API.

Error handling

All API endpoints follow OpenAI error response format:
{
  "error": {
    "message": "Description of what went wrong",
    "type": "invalid_request_error"
  }
}

Error types

  • invalid_request_error - Malformed request or invalid parameters
  • server_error - Internal server error during processing
  • not_found - Resource not found
  • conflict - Operation conflicts with current state

CORS support

The server includes CORS headers to allow cross-origin requests:
  • Access-Control-Allow-Origin: *
  • Access-Control-Allow-Methods: GET, POST, DELETE, OPTIONS
  • Access-Control-Allow-Headers: Content-Type, Authorization

Performance considerations

Model loading

Models are loaded into memory on server startup and persist for the lifetime of the server process. Loading large models may take several seconds.

Inference threading

Inference requests are processed sequentially on a dedicated worker thread to avoid GPU contention. For concurrent requests, consider running multiple server instances with different ports.

Memory usage

Each loaded model consumes GPU memory proportional to its size. Ensure your system has sufficient memory for the models you plan to use:
  • 1.7B 8-bit model: ~2GB
  • 9B 8-bit model: ~9GB
  • 72B 4-bit model: ~36GB

Example integration

Python (OpenAI SDK)

from openai import OpenAI

# Point to local OminiX-API server
client = OpenAI(
    api_key="not-needed",
    base_url="http://localhost:8080/v1"
)

# Transcribe audio
with open("audio.wav", "rb") as f:
    transcription = client.audio.transcriptions.create(
        model="qwen3-asr",
        file=f,
        language="English"
    )
print(transcription.text)

JavaScript (OpenAI SDK)

import OpenAI from 'openai';
import fs from 'fs';

const client = new OpenAI({
  apiKey: 'not-needed',
  baseURL: 'http://localhost:8080/v1'
});

// Transcribe audio
const transcription = await client.audio.transcriptions.create({
  model: 'qwen3-asr',
  file: fs.createReadStream('audio.wav'),
  language: 'English'
});
console.log(transcription.text);

cURL

# List available models
curl http://localhost:8080/v1/models

# Download a model
curl -X POST http://localhost:8080/v1/models/download \
  -H "Content-Type: application/json" \
  -d '{"repo_id": "OminiX/qwen3-asr-1.7b-8bit"}'

# Transcribe audio
curl http://localhost:8080/v1/audio/transcriptions \
  -F [email protected] \
  -F language=English

# Health check
curl http://localhost:8080/health

License

MIT OR Apache-2.0

Build docs developers (and LLMs) love