Skip to main content
The /v1/embeddings endpoint generates vector embeddings from text input. Embeddings are numerical representations that can be used for semantic search, clustering, recommendations, and other ML tasks.

Endpoint

POST /v1/embeddings
This endpoint requires models with pooling enabled. Start the server with --pooling to specify the pooling type, or let the model use its default.

Request Format

Required Parameters

model
string
required
Model identifier. Use an embedding-specific model for best results (e.g., models based on BERT, Sentence Transformers, or specialized embedding models).
input
string | array
required
Text to generate embeddings for. Can be:
  • A single string: "Hello world"
  • An array of strings: ["Hello", "world"]
  • An array of token IDs: [12, 34, 56]
  • An array of token arrays: [[12, 34], [56, 78]]

Optional Parameters

encoding_format
string
default:"float"
Format for the embeddings:
  • float - Array of floating point numbers
  • base64 - Base64-encoded float array (more efficient for large batches)
dimensions
number
Number of dimensions for the output embeddings. If specified, embeddings will be truncated or padded.
Not all models support dimension adjustment. Check model capabilities.
user
string
Unique identifier for end-user tracking (optional, for monitoring).

Request Examples

curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer no-key" \
  -d '{
    "model": "text-embedding-ada-002",
    "input": "The food was delicious and the waiter was very friendly.",
    "encoding_format": "float"
  }'

Response Format

object
string
Always "list" for embeddings responses.
data
array
Array of embedding objects. Each object contains:
  • object (string) - Always "embedding"
  • embedding (array | string) - The embedding vector (float array or base64 string)
  • index (number) - Position in the input array
model
string
The model used to generate embeddings.
usage
object
Token usage information:
  • prompt_tokens (number) - Number of tokens in the input
  • total_tokens (number) - Total tokens processed

Example Response

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "embedding": [
        0.0023064255,
        -0.009327292,
        -0.0028842222,
        0.015589447,
        -0.008376982,
        // ... (1536 dimensions total for ada-002)
      ],
      "index": 0
    }
  ],
  "model": "text-embedding-ada-002",
  "usage": {
    "prompt_tokens": 8,
    "total_tokens": 8
  }
}

Multiple Inputs Response

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "embedding": [0.0023064255, -0.009327292, ...],
      "index": 0
    },
    {
      "object": "embedding",
      "embedding": [0.0043521156, -0.012456789, ...],
      "index": 1
    },
    {
      "object": "embedding",
      "embedding": [0.0031234567, -0.007654321, ...],
      "index": 2
    }
  ],
  "model": "text-embedding-ada-002",
  "usage": {
    "prompt_tokens": 24,
    "total_tokens": 24
  }
}

Setting Up Embedding Models

Download Embedding Model

# Nomic Embed Text (768 dimensions)
llama-server -hf nomic-ai/nomic-embed-text-v1.5-GGUF:Q8_0 --pooling mean

# BGE Base (768 dimensions)
llama-server -hf BAAI/bge-base-en-v1.5-GGUF:Q8_0 --pooling cls

# All-MiniLM (384 dimensions)
llama-server -m models/all-minilm-l6-v2.gguf --pooling mean

Pooling Types

--pooling
string
Pooling method for generating embeddings:
  • mean - Average of all token embeddings (most common)
  • cls - Use [CLS] token embedding (BERT-style)
  • last - Use last token embedding
  • none - No pooling, returns per-token embeddings
  • rank - For reranking models

Use Cases

Find similar documents by computing cosine similarity:
import openai
import numpy as np
from numpy.linalg import norm

client = openai.OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="sk-no-key-required"
)

def cosine_similarity(a, b):
    return np.dot(a, b) / (norm(a) * norm(b))

# Create embeddings
documents = [
    "Python is a programming language",
    "Machine learning uses algorithms",
    "Dogs are loyal pets"
]

query = "What is Python?"

# Get embeddings
response = client.embeddings.create(
    model="text-embedding-ada-002",
    input=documents + [query]
)

doc_embeddings = [response.data[i].embedding for i in range(len(documents))]
query_embedding = response.data[-1].embedding

# Find most similar
for i, doc in enumerate(documents):
    similarity = cosine_similarity(query_embedding, doc_embeddings[i])
    print(f"{similarity:.4f}: {doc}")
Output:
0.8234: Python is a programming language
0.6541: Machine learning uses algorithms
0.4123: Dogs are loyal pets

Text Clustering

Group similar texts together:
from sklearn.cluster import KMeans
import numpy as np

# Generate embeddings for texts
texts = [
    "cat", "dog", "car", "truck", "kitten", "puppy", "vehicle", "automobile"
]

response = client.embeddings.create(
    model="text-embedding-ada-002",
    input=texts
)

embeddings = np.array([item.embedding for item in response.data])

# Cluster into 2 groups
kmeans = KMeans(n_clusters=2, random_state=0)
labels = kmeans.fit_predict(embeddings)

for label in set(labels):
    cluster_texts = [texts[i] for i, l in enumerate(labels) if l == label]
    print(f"Cluster {label}: {cluster_texts}")
Output:
Cluster 0: ['cat', 'dog', 'kitten', 'puppy']
Cluster 1: ['car', 'truck', 'vehicle', 'automobile']

Recommendations

Find items similar to user preferences:
# User liked these items
liked_items = [
    "Science fiction novel",
    "Space exploration documentary"
]

# Candidate items
candidates = [
    "Historical drama series",
    "Mars colonization movie",
    "Cooking tutorial",
    "Astronomy textbook"
]

# Get embeddings
response = client.embeddings.create(
    model="text-embedding-ada-002",
    input=liked_items + candidates
)

liked_embeddings = [response.data[i].embedding for i in range(len(liked_items))]
candidate_embeddings = [response.data[i+len(liked_items)].embedding 
                        for i in range(len(candidates))]

# Compute average liked embedding
avg_liked = np.mean(liked_embeddings, axis=0)

# Rank candidates
scores = [(i, cosine_similarity(avg_liked, emb)) 
          for i, emb in enumerate(candidate_embeddings)]
scores.sort(key=lambda x: x[1], reverse=True)

print("Recommendations:")
for idx, score in scores:
    print(f"{score:.4f}: {candidates[idx]}")

Multimodal Embeddings

For models with multimodal support, you can embed images along with text:
{
  "model": "clip-vit-large",
  "input": [
    "A photo of a cat",
    {"type": "image", "data": "base64_encoded_image_data"}
  ]
}
Multimodal embedding support is experimental. Check model documentation for capabilities.

Normalization

Embeddings from /v1/embeddings are automatically normalized using Euclidean (L2) norm. This means:
  • All embedding vectors have length 1.0
  • Cosine similarity equals dot product
  • Ready for vector databases
To verify normalization:
import numpy as np

embedding = response.data[0].embedding
norm = np.linalg.norm(embedding)
print(f"Embedding norm: {norm:.6f}")  # Should be ~1.0

Performance Optimization

Batch Processing

Process multiple texts in a single request:
# Efficient: Single request
response = client.embeddings.create(
    model="text-embedding-ada-002",
    input=["text1", "text2", "text3", ...]
)

# Inefficient: Multiple requests
for text in texts:
    response = client.embeddings.create(
        model="text-embedding-ada-002",
        input=text
    )

Model Selection

Model TypeDimensionsUse Case
all-MiniLM-L6384Fast, general purpose
BGE-base768Balanced quality/speed
Nomic Embed768Long context support
BGE-large1024High quality

Context Window

Start server with appropriate context size:
# For long documents
llama-server -m embedding-model.gguf -c 8192 --pooling mean

Error Responses

{
  "error": {
    "message": "This model requires pooling to be enabled",
    "type": "invalid_request_error",
    "code": 400
  }
}
Common errors:
  • No pooling enabled: Start server with --pooling flag
  • Input too long: Reduce text length or increase context size with -c
  • Invalid encoding format: Use float or base64

Comparing to Native Endpoint

llama.cpp also provides /embedding (non-OAI compatible):
Feature/v1/embeddings/embedding
FormatOpenAI-compatiblellama.cpp native
NormalizationAlways L2 normalizedConfigurable
OutputSingle pooled vectorCan return per-token
CompatibilityWorks with OpenAI clientsCustom clients only
For most use cases, prefer /v1/embeddings for compatibility.

Best Practices

  1. Use dedicated embedding models: Don’t use chat/completion models for embeddings
  2. Batch requests: Send multiple texts together for efficiency
  3. Normalize queries: Keep input text clean and consistent
  4. Cache embeddings: Reuse embeddings for unchanged content
  5. Choose appropriate dimensions: Smaller models (384d) for speed, larger (1024d) for quality
  6. Monitor context limits: Split very long texts if needed

Next Steps

  • Chat Completions - Conversational AI
  • Completions - Text generation
  • Vector databases: Integrate with Pinecone, Weaviate, or Milvus for semantic search