/v1/embeddings endpoint generates vector embeddings from text input. Embeddings are numerical representations that can be used for semantic search, clustering, recommendations, and other ML tasks.
Endpoint
This endpoint requires models with pooling enabled. Start the server with
--pooling to specify the pooling type, or let the model use its default.Request Format
Required Parameters
Model identifier. Use an embedding-specific model for best results (e.g., models based on BERT, Sentence Transformers, or specialized embedding models).
Text to generate embeddings for. Can be:
- A single string:
"Hello world" - An array of strings:
["Hello", "world"] - An array of token IDs:
[12, 34, 56] - An array of token arrays:
[[12, 34], [56, 78]]
Optional Parameters
Format for the embeddings:
float- Array of floating point numbersbase64- Base64-encoded float array (more efficient for large batches)
Number of dimensions for the output embeddings. If specified, embeddings will be truncated or padded.
Not all models support dimension adjustment. Check model capabilities.
Unique identifier for end-user tracking (optional, for monitoring).
Request Examples
Response Format
Always
"list" for embeddings responses.Array of embedding objects. Each object contains:
object(string) - Always"embedding"embedding(array | string) - The embedding vector (float array or base64 string)index(number) - Position in the input array
The model used to generate embeddings.
Token usage information:
prompt_tokens(number) - Number of tokens in the inputtotal_tokens(number) - Total tokens processed
Example Response
Multiple Inputs Response
Setting Up Embedding Models
Download Embedding Model
Pooling Types
Pooling method for generating embeddings:
mean- Average of all token embeddings (most common)cls- Use [CLS] token embedding (BERT-style)last- Use last token embeddingnone- No pooling, returns per-token embeddingsrank- For reranking models
Use Cases
Semantic Search
Find similar documents by computing cosine similarity:Text Clustering
Group similar texts together:Recommendations
Find items similar to user preferences:Multimodal Embeddings
For models with multimodal support, you can embed images along with text:Multimodal embedding support is experimental. Check model documentation for capabilities.
Normalization
Embeddings from/v1/embeddings are automatically normalized using Euclidean (L2) norm. This means:
- All embedding vectors have length 1.0
- Cosine similarity equals dot product
- Ready for vector databases
Performance Optimization
Batch Processing
Process multiple texts in a single request:Model Selection
| Model Type | Dimensions | Use Case |
|---|---|---|
| all-MiniLM-L6 | 384 | Fast, general purpose |
| BGE-base | 768 | Balanced quality/speed |
| Nomic Embed | 768 | Long context support |
| BGE-large | 1024 | High quality |
Context Window
Start server with appropriate context size:Error Responses
- No pooling enabled: Start server with
--poolingflag - Input too long: Reduce text length or increase context size with
-c - Invalid encoding format: Use
floatorbase64
Comparing to Native Endpoint
llama.cpp also provides/embedding (non-OAI compatible):
| Feature | /v1/embeddings | /embedding |
|---|---|---|
| Format | OpenAI-compatible | llama.cpp native |
| Normalization | Always L2 normalized | Configurable |
| Output | Single pooled vector | Can return per-token |
| Compatibility | Works with OpenAI clients | Custom clients only |
/v1/embeddings for compatibility.
Best Practices
- Use dedicated embedding models: Don’t use chat/completion models for embeddings
- Batch requests: Send multiple texts together for efficiency
- Normalize queries: Keep input text clean and consistent
- Cache embeddings: Reuse embeddings for unchanged content
- Choose appropriate dimensions: Smaller models (384d) for speed, larger (1024d) for quality
- Monitor context limits: Split very long texts if needed
Next Steps
- Chat Completions - Conversational AI
- Completions - Text generation
- Vector databases: Integrate with Pinecone, Weaviate, or Milvus for semantic search

