Skip to main content

Overview

The switchAILocal embedding SDK provides ONNX-based local embedding generation using the MiniLM model. This enables advanced features like semantic caching, intelligent routing, and skill matching without requiring external API calls.

Key Features

  • Local Processing: All embeddings computed on-device using ONNX Runtime
  • 384-Dimensional Vectors: Standard MiniLM-L6-v2 model output
  • Fast Inference: Optimized for real-time semantic matching (<20ms)
  • No External Dependencies: Fully offline after model download
  • Thread-Safe: Concurrent embedding generation support

Architecture

When to Use Embeddings

Semantic Tier (Phase 2)

Match user queries to intents using embedding similarity instead of LLM classification:
intelligence:
  embedding:
    enabled: true
    model: "all-MiniLM-L6-v2"
  semantic-tier:
    enabled: true
    confidence-threshold: 0.85
Benefits:
  • 10-20x faster than LLM classification
  • Deterministic results
  • No API costs

Semantic Caching

Cache responses based on semantic similarity:
intelligence:
  semantic-cache:
    enabled: true
    similarity-threshold: 0.95
    max-size: 10000
Use Cases:
  • Deduplicating similar queries
  • Reducing API costs
  • Faster response times for common questions

Skill Matching

Match queries to domain-specific skills:
intelligence:
  skill-matching:
    enabled: true
    confidence-threshold: 0.80
Example Skills:
  • Language experts (Go, Python, TypeScript)
  • Infrastructure (Docker, Kubernetes)
  • Security, Testing, Debugging

Model Details

all-MiniLM-L6-v2

PropertyValue
Dimensions384
Max Sequence Length256 tokens
Model Size~23 MB (ONNX)
Vocabulary Size~30,000 tokens
Performance~5-10ms per embedding

Download the Model

./scripts/download-embedding-model.sh
This downloads:
  • model.onnx - The ONNX model file
  • vocab.txt - The tokenizer vocabulary
Files are stored in ~/.switchailocal/models/.

Quick Start

1

Download Model

./scripts/download-embedding-model.sh
2

Enable in Config

intelligence:
  enabled: true
  embedding:
    enabled: true
    model: "all-MiniLM-L6-v2"
3

Start Server

./ail.sh start
4

Verify

Check logs for:
INFO Embedding engine initialized with model: model.onnx

Configuration Options

intelligence:
  embedding:
    enabled: true
    model: "all-MiniLM-L6-v2"  # Model name
    model-path: "~/.switchailocal/models/model.onnx"  # Override path
    vocab-path: "~/.switchailocal/models/vocab.txt"   # Override vocab
    shared-library: ""  # ONNX Runtime library (auto-detected)

Performance Characteristics

Latency

OperationTypical Latency
Single embedding5-10ms
Batch (10 texts)30-50ms
Cosine similarity<1ms

Memory Usage

  • Model Loading: ~50 MB
  • Per Request: ~1-2 MB (temporary)
  • Cached Embeddings: 384 floats × 4 bytes = 1.5 KB per vector

Accuracy

  • Semantic Similarity: 0.0 (unrelated) to 1.0 (identical)
  • Typical Intent Match: >0.85 for correct matches
  • Typical Skill Match: >0.80 for relevant skills

Comparison with Alternatives

FeatureswitchAILocal EmbeddingOpenAI Embedding APISentenceTransformers
CostFree (local)$0.0001/1K tokensFree (local)
Latency5-10ms50-200ms (network)10-20ms
Privacy100% localData sent to API100% local
Dimensions3841536 (ada-002)Varies
DependenciesONNX Runtime onlyInternet requiredPython + PyTorch

Next Steps

Usage Guide

Learn how to use the embedding SDK

Custom Providers

Integrate custom embedding models

Semantic Tier

Configure semantic intent matching

Semantic Cache

Enable semantic caching

Build docs developers (and LLMs) love