Embeddings

The embeddings endpoint generates vector representations of input text. This endpoint is compatible with OpenAI’s /v1/embeddings API.

Request

curl http://localhost:30000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "BAAI/bge-large-en-v1.5",
    "input": "The quick brown fox jumps over the lazy dog"
  }'

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

response = client.embeddings.create(
    model="BAAI/bge-large-en-v1.5",
    input="The quick brown fox jumps over the lazy dog"
)

print(response.data[0].embedding)
print(f"Embedding dimension: {len(response.data[0].embedding)}")

Parameters

Required

input

string | array

required

Input text to generate embeddings for. Can be:

A single string
An array of strings for batch processing
An array of token IDs (integers)
An array of arrays of token IDs
An array of multimodal embedding inputs (for multimodal models)

model

string

default:"default"

Model name to use for embeddings.

Optional

encoding_format

string

default:"float"

Format of the embeddings. Currently only "float" is supported.

dimensions

integer

Number of dimensions for the output embeddings. If specified, the model will reduce the embedding dimensionality.

user

string

Unique identifier for the end-user.

SGLang Extensions

lora_path

string

Path to LoRA adapter weights to apply to the model.

rid

string

Request ID for tracking.

priority

integer

Priority level for the request.

Multimodal Embeddings

For multimodal embedding models, you can provide text, images, and videos:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

response = client.embeddings.create(
    model="multimodal-embedding-model",
    input=[
        {"text": "A beautiful sunset"},
        {"image": "https://example.com/image.jpg"},
        {"text": "Mountain landscape", "image": "data:image/jpeg;base64,..."}
    ]
)

for i, embedding_obj in enumerate(response.data):
    print(f"Embedding {i} dimension: {len(embedding_obj.embedding)}")

Multimodal Input Format

text

string

Text content for the embedding.

image

string

Image URL, file path, or base64-encoded image.

video

string

Video URL, file path, or base64-encoded video.

Response

object

string

Always "list".

data

array

Array of embedding objects.

object

string

Always "embedding".

embedding

array

Array of floating-point numbers representing the embedding vector.

index

integer

Index of the embedding in the input array.

model

string

Model used to generate embeddings.

usage

object

Token usage information.

prompt_tokens

integer

Number of tokens in the input.

total_tokens

integer

Total tokens processed.

Examples

Single Text Embedding

from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

response = client.embeddings.create(
    model="BAAI/bge-large-en-v1.5",
    input="Machine learning is fascinating"
)

embedding = response.data[0].embedding
print(f"Embedding length: {len(embedding)}")
print(f"First 5 values: {embedding[:5]}")

Batch Embeddings

texts = [
    "Artificial intelligence",
    "Machine learning",
    "Deep learning",
    "Neural networks"
]

response = client.embeddings.create(
    model="BAAI/bge-large-en-v1.5",
    input=texts
)

for i, data in enumerate(response.data):
    print(f"Text {i}: {texts[i]}")
    print(f"Embedding dim: {len(data.embedding)}")
    print()

Semantic Similarity

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Get embeddings for multiple texts
response = client.embeddings.create(
    model="BAAI/bge-large-en-v1.5",
    input=[
        "The cat sat on the mat",
        "A feline rested on the rug",
        "The weather is nice today"
    ]
)

emb1 = np.array(response.data[0].embedding)
emb2 = np.array(response.data[1].embedding)
emb3 = np.array(response.data[2].embedding)

print(f"Similarity (cat/feline): {cosine_similarity(emb1, emb2):.4f}")
print(f"Similarity (cat/weather): {cosine_similarity(emb1, emb3):.4f}")

With LoRA Adapter

response = client.embeddings.create(
    model="BAAI/bge-large-en-v1.5:my-lora-adapter",
    input="Specialized domain text",
    lora_path="/path/to/lora/adapter"
)

embedding = response.data[0].embedding

Supported Models

SGLang supports various embedding models including:

Text Embeddings:
- BAAI/bge-large-en-v1.5
- BAAI/bge-base-en-v1.5
- intfloat/e5-mistral-7b-instruct
- sentence-transformers/all-MiniLM-L6-v2
Multimodal Embeddings:
- Models supporting text + image embeddings
- Models supporting text + video embeddings

Response Format

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "embedding": [0.0234, -0.0187, 0.0456, ...],
      "index": 0
    }
  ],
  "model": "BAAI/bge-large-en-v1.5",
  "usage": {
    "prompt_tokens": 8,
    "total_tokens": 8
  }
}

Use Cases

Retrieval-Augmented Generation (RAG)

Embeddings are commonly used in RAG systems to find relevant documents:

# Index your documents
documents = [
    "SGLang is a fast serving framework for LLMs.",
    "It provides high throughput and low latency.",
    "SGLang supports various models and features."
]

response = client.embeddings.create(
    model="BAAI/bge-large-en-v1.5",
    input=documents
)

doc_embeddings = [data.embedding for data in response.data]

# Query
query = "What is SGLang?"
query_response = client.embeddings.create(
    model="BAAI/bge-large-en-v1.5",
    input=query
)
query_embedding = query_response.data[0].embedding

# Find most similar document
similarities = [cosine_similarity(query_embedding, doc_emb) 
                for doc_emb in doc_embeddings]
best_match_idx = np.argmax(similarities)
print(f"Most relevant: {documents[best_match_idx]}")

Clustering

Group similar texts together:

from sklearn.cluster import KMeans

texts = ["text1", "text2", "text3", ...]  # Your texts

response = client.embeddings.create(
    model="BAAI/bge-large-en-v1.5",
    input=texts
)

embeddings = np.array([data.embedding for data in response.data])

# Cluster into 3 groups
kmeans = KMeans(n_clusters=3, random_state=0)
clusters = kmeans.fit_predict(embeddings)

for i, cluster in enumerate(clusters):
    print(f"Text {i} -> Cluster {cluster}")

Python API

Frontend API

HTTP API

CLI Reference

Embeddings

Embeddings

Request

Parameters

Required

Optional

SGLang Extensions

Multimodal Embeddings

Multimodal Input Format

Response

Examples

Single Text Embedding

Batch Embeddings

Semantic Similarity

With LoRA Adapter

Supported Models

Response Format

Use Cases

Retrieval-Augmented Generation (RAG)

Clustering

See Also

Python API

Frontend API

HTTP API

CLI Reference

​Embeddings

​Request

​Parameters

​Required

​Optional

​SGLang Extensions

​Multimodal Embeddings

​Multimodal Input Format

​Response

​Examples

​Single Text Embedding

​Batch Embeddings

​Semantic Similarity

​With LoRA Adapter

​Supported Models

​Response Format

​Use Cases

​Retrieval-Augmented Generation (RAG)

​Clustering

​See Also

Embeddings

Request

Parameters

Required

Optional

SGLang Extensions

Multimodal Embeddings

Multimodal Input Format

Response

Examples

Single Text Embedding

Batch Embeddings

Semantic Similarity

With LoRA Adapter

Supported Models

Response Format

Use Cases

Retrieval-Augmented Generation (RAG)

Clustering

See Also