OminiX-API server - OminiX-MLX

This crate is currently a work in progress. The implementation is planned but not yet complete.

Overview

ominix-api provides a unified OpenAI-compatible HTTP API server that exposes OminiX-MLX model crates (ASR, TTS, VLM, LLM) through a single HTTP server. This implementation makes the models drop-in compatible with OpenAI client libraries. The server implements the OpenAI API surface, allowing you to use standard OpenAI SDKs and tools with local OminiX-MLX models.

Installation

Build from source in the OminiX-MLX workspace:

cargo build --release -p ominix-api

Usage

Basic server startup

# Start server with ASR feature (enabled by default)
cargo run --release -p ominix-api -- \
    --asr-model ~/.OminiX/models/qwen3-asr-1.7b \
    --port 8080

Making requests

# OpenAI Whisper-compatible multipart upload
curl http://localhost:8080/v1/audio/transcriptions \
    -F [email protected] \
    -F language=Chinese

Server configuration

Command-line arguments

asr-model

string

Path to the ASR model directory (e.g., ~/.OminiX/models/qwen3-asr-1.7b)

port

number

default:"8080"

Server port to bind to

models-dir

string

default:"~/.ominix/models"

Directory for storing downloaded models

Environment variables

HF_TOKEN

string

HuggingFace API token for downloading gated models. Can also be stored in ~/.cache/huggingface/token

API endpoints

Audio transcription

POST /v1/audio/transcriptions

Transcribe audio to text using the Qwen3-ASR model

Request (multipart)

file

required

Audio file to transcribe (WAV format recommended)

language

string

Language of the audio (e.g., “English”, “Chinese”). If not specified, the model will auto-detect

Request (JSON)

file_path

string

required

Path to the audio file on the server

language

string

Language of the audio content

Response

text

string

The transcribed text from the audio file

language

string

The detected or specified language

duration

number

Duration of the audio in seconds

Chat completions (planned)

POST /v1/chat/completions

Generate chat completions using LLM models

This endpoint is planned for future implementation and will provide OpenAI-compatible chat completions.

Request

messages

array

required

Array of message objects with role and content fields

Show Message object

role

string

One of: “system”, “user”, or “assistant”

content

string

The message content

model

string

default:"minicpm-sala-9b"

Model identifier to use for generation

max_tokens

number

default:"2048"

Maximum number of tokens to generate

temperature

number

default:"0.7"

Sampling temperature between 0 and 2. Higher values make output more random

Response

string

Unique identifier for this completion

object

string

Always “chat.completion”

model

string

The model used for generation

choices

array

Array of completion choices (typically one)

Show Choice object

index

number

Choice index (0-based)

message

object

Generated message object

Show Message

role

string

Always “assistant”

content

string

Generated text content

finish_reason

string

Reason completion stopped: “stop”, “length”, or “error”

usage

object

Token usage statistics

Show Usage

prompt_tokens

number

Number of tokens in the prompt

completion_tokens

number

Number of tokens in the completion

total_tokens

number

Total tokens used (prompt + completion)

Text-to-speech (planned)

POST /v1/audio/speech

Generate speech audio from text using TTS models

This endpoint is planned for integration with the qwen3-tts-mlx crate.

Request

input

string

required

The text to convert to speech

model

string

default:"qwen3-tts"

TTS model identifier

voice

string

Voice preset to use for generation

Response

Returns binary audio data (WAV or MP3 format) with appropriate Content-Type header.

Model management

GET /v1/models

List all available models with metadata

Response

object

string

Always “list”

data

array

Array of model objects

Show Model object

string

Model identifier

object

string

Always “model”

owned_by

string

Owner of the model (HuggingFace org or “local”)

path

string

Absolute path to model directory

loaded

boolean

Whether this model is currently loaded in memory

repo_id

string

HuggingFace repository ID (if downloaded from HuggingFace)

quantization

object

Quantization settings if model is quantized

Show Quantization

bits

number

Quantization bit width (e.g., 4, 8)

group_size

number

Group size for quantization

size_bytes

number

Total size of model weights in bytes

downloaded_at

string

ISO 8601 timestamp of download (if applicable)

POST /v1/models/download

Download a model from HuggingFace Hub

Request

repo_id

string

required

HuggingFace repository ID (e.g., “OminiX/qwen3-asr-1.7b-8bit”)

Response

status

string

Download status: “downloading”, “complete”, or “failed”

string

Model identifier (derived from repo name)

repo_id

string

HuggingFace repository ID

DELETE /v1/models/{id}

Delete a downloaded model from disk

Response

string

Model identifier that was deleted

deleted

boolean

Always true on success

Health check

GET /health

Check server health status

Response

status

string

Always “ok” if server is running

Features

The server supports feature flags to enable/disable specific model capabilities:

asr

feature

default:"enabled"

Enable ASR (Automatic Speech Recognition) endpoint via qwen3-asr-mlx crate

Build with specific features:

# ASR only (default)
cargo build --release -p ominix-api

# Disable ASR
cargo build --release -p ominix-api --no-default-features

Dependencies

The server is built on the following core dependencies:

HTTP server

hyper v1 - High-performance HTTP server
tokio v1 - Async runtime
http-body-util - HTTP body utilities

Serialization

serde + serde_json - JSON serialization

Model integration

qwen3-asr-mlx - ASR model implementation (feature-gated)
hf-hub - HuggingFace model downloading

Utilities

clap - Command-line argument parsing
base64 - Audio encoding/decoding
dirs - System directory paths

Configuration file

The server maintains a configuration file at ~/.ominix/config.json:

{
  "models_dir": "/Users/username/.ominix/models",
  "models": [
    {
      "id": "qwen3-asr-1.7b-8bit",
      "repo_id": "OminiX/qwen3-asr-1.7b-8bit",
      "path": "/Users/username/.ominix/models/qwen3-asr-1.7b-8bit",
      "quantization": {
        "bits": 8,
        "group_size": 64
      },
      "size_bytes": 1789231104,
      "downloaded_at": "2024-03-15T10:30:00Z"
    }
  ]
}

The configuration is automatically updated when models are downloaded or deleted through the API.

Error handling

All API endpoints follow OpenAI error response format:

{
  "error": {
    "message": "Description of what went wrong",
    "type": "invalid_request_error"
  }
}

Error types

invalid_request_error - Malformed request or invalid parameters
server_error - Internal server error during processing
not_found - Resource not found
conflict - Operation conflicts with current state

CORS support

The server includes CORS headers to allow cross-origin requests:

Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: GET, POST, DELETE, OPTIONS
Access-Control-Allow-Headers: Content-Type, Authorization

Performance considerations

Model loading

Models are loaded into memory on server startup and persist for the lifetime of the server process. Loading large models may take several seconds.

Inference threading

Inference requests are processed sequentially on a dedicated worker thread to avoid GPU contention. For concurrent requests, consider running multiple server instances with different ports.

Memory usage

Each loaded model consumes GPU memory proportional to its size. Ensure your system has sufficient memory for the models you plan to use:

1.7B 8-bit model: ~2GB
9B 8-bit model: ~9GB
72B 4-bit model: ~36GB

Example integration

Python (OpenAI SDK)

from openai import OpenAI

# Point to local OminiX-API server
client = OpenAI(
    api_key="not-needed",
    base_url="http://localhost:8080/v1"
)

# Transcribe audio
with open("audio.wav", "rb") as f:
    transcription = client.audio.transcriptions.create(
        model="qwen3-asr",
        file=f,
        language="English"
    )
print(transcription.text)

JavaScript (OpenAI SDK)

import OpenAI from 'openai';
import fs from 'fs';

const client = new OpenAI({
  apiKey: 'not-needed',
  baseURL: 'http://localhost:8080/v1'
});

// Transcribe audio
const transcription = await client.audio.transcriptions.create({
  model: 'qwen3-asr',
  file: fs.createReadStream('audio.wav'),
  language: 'English'
});
console.log(transcription.text);

cURL

# List available models
curl http://localhost:8080/v1/models

# Download a model
curl -X POST http://localhost:8080/v1/models/download \
  -H "Content-Type: application/json" \
  -d '{"repo_id": "OminiX/qwen3-asr-1.7b-8bit"}'

# Transcribe audio
curl http://localhost:8080/v1/audio/transcriptions \
  -F [email protected] \
  -F language=English

# Health check
curl http://localhost:8080/health

License

MIT OR Apache-2.0

qwen3-asr-mlx - ASR model implementation

Core Libraries

Language Models

Vision-Language

Audio

Image

Server

​Overview

​Installation

​Usage

​Basic server startup

​Making requests

​Server configuration

​Command-line arguments

​Environment variables

​API endpoints

​Audio transcription

POST /v1/audio/transcriptions

​Request (multipart)

​Request (JSON)

​Response

​Chat completions (planned)

POST /v1/chat/completions

​Request

​Response

​Text-to-speech (planned)

POST /v1/audio/speech

​Request

​Response

​Model management

GET /v1/models

​Response

POST /v1/models/download

​Request

​Response

DELETE /v1/models/{id}

​Response

​Health check

GET /health

​Response

​Features

​Dependencies

​HTTP server

​Serialization

​Model integration

​Utilities

​Configuration file

​Error handling

​Error types

​CORS support

​Performance considerations

​Model loading

​Inference threading

​Memory usage

​Example integration

​Python (OpenAI SDK)

​JavaScript (OpenAI SDK)

​cURL

​License

​Related crates

Build docs developers (and LLMs) love

Overview

Installation

Usage

Basic server startup

Making requests

Server configuration

Command-line arguments

Environment variables

API endpoints

Audio transcription

Request (multipart)

Request (JSON)

Response

Chat completions (planned)

Request

Response

Text-to-speech (planned)

Request

Response

Model management

Response

Request

Response

Response

Health check

Response

Features

Dependencies

HTTP server

Serialization

Model integration

Utilities

Configuration file

Error handling

Error types

CORS support

Performance considerations

Model loading

Inference threading

Memory usage

Example integration

Python (OpenAI SDK)

JavaScript (OpenAI SDK)

cURL

License

Related crates