llama.cpp provides multiple ways to generate embeddings - high-dimensional vector representations of text that capture semantic meaning. These embeddings are essential for semantic search, similarity comparison, and retrieval-augmented generation (RAG).
Overview
Embeddings convert text into dense numerical vectors that preserve semantic relationships. Similar texts produce similar embeddings, making them useful for:
Semantic search : Find documents by meaning rather than keyword matching
Similarity measurement : Compare texts for similarity
Clustering : Group similar documents together
Classification : Train classifiers on embedding features
Retrieval-Augmented Generation (RAG) : Retrieve relevant context for LLM prompts
Quick Start
Start the server
Launch llama-server with an embedding model: ./llama-server -m embedding-model.gguf --embeddings --pooling mean
Generate embeddings
Make a request to the embeddings endpoint: curl http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"input": "Hello, world!",
"model": "text-embedding"
}'
Process the results
The response contains normalized embedding vectors: {
"object" : "list" ,
"data" : [{
"object" : "embedding" ,
"embedding" : [ 0.023 , -0.015 , 0.042 , ... ],
"index" : 0
}]
}
Using llama-server
The server provides both OpenAI-compatible and custom embedding endpoints.
Starting an Embedding Server
# Basic embedding server
./llama-server -m embedding-model.gguf --embeddings --pooling mean
# With larger batch size for throughput
./llama-server -m embedding-model.gguf --embeddings --pooling mean -ub 8192
# With GPU acceleration
./llama-server -m embedding-model.gguf --embeddings --pooling mean -ngl 99
The --embeddings flag restricts the server to only support embedding use cases. Use this flag with dedicated embedding models for optimal performance.
Pooling Types
Pooling determines how token embeddings are combined into a single vector:
--pooling
string
default: "model default"
Pooling method for embeddings:
none: Return embeddings for all tokens (no pooling)
mean: Average of all token embeddings
cls: Use the CLS token embedding
last: Use the last token embedding
rank: For reranking models
# Mean pooling (most common)
./llama-server -m model.gguf --embeddings --pooling mean
# CLS token (for BERT-style models)
./llama-server -m model.gguf --embeddings --pooling cls
# Last token (for some decoder models)
./llama-server -m model.gguf --embeddings --pooling last
OpenAI-Compatible API
The /v1/embeddings endpoint follows the OpenAI API specification.
curl http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"input": "The quick brown fox jumps over the lazy dog",
"model": "text-embedding",
"encoding_format": "float"
}'
curl http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"input": [
"First document to embed",
"Second document to embed",
"Third document to embed"
],
"model": "text-embedding"
}'
{
"object" : "list" ,
"data" : [
{
"object" : "embedding" ,
"embedding" : [ 0.0234 , -0.0145 , 0.0421 , ... , 0.0089 ],
"index" : 0
},
{
"object" : "embedding" ,
"embedding" : [ 0.0198 , -0.0167 , 0.0389 , ... , 0.0102 ],
"index" : 1
}
],
"model" : "text-embedding" ,
"usage" : {
"prompt_tokens" : 24 ,
"total_tokens" : 24
}
}
The /v1/embeddings endpoint requires a pooling type other than none and returns normalized embeddings using the Euclidean norm.
Custom Embedding API
The /embedding endpoint provides more flexibility than the OpenAI-compatible endpoint.
Basic Request
curl http://localhost:8080/embedding \
-H "Content-Type: application/json" \
-d '{
"content": "Text to embed"
}'
Normalization Options
curl http://localhost:8080/embedding \
-H "Content-Type: application/json" \
-d '{
"content": "Text to embed",
"embd_normalize": 2
}'
Normalization method for embeddings:
-1: No normalization
0: Max absolute (scale to int16 range)
1: Taxicab / L1 norm
2: Euclidean / L2 norm (default)
>2: P-norm with specified p value
Non-OpenAI /embeddings Endpoint
The /embeddings endpoint (without /v1) supports all pooling types including none:
# Start server with no pooling
./llama-server -m model.gguf --embeddings --pooling none
# Get embeddings for all tokens
curl http://localhost:8080/embeddings \
-H "Content-Type: application/json" \
-d '{
"input": "Generate per-token embeddings"
}'
Response format (pooling = none):
[
{
"index" : 0 ,
"embedding" : [
[ 0.023 , -0.015 , ... ], // token 0
[ 0.019 , -0.021 , ... ], // token 1
[ 0.031 , -0.018 , ... ], // token 2
...
]
}
]
Using llama-embedding CLI
The llama-embedding command-line tool generates embeddings directly without running a server.
Basic Usage
# Generate embedding for text
./llama-embedding -m model.gguf --pooling mean -p "Hello World!" --log-disable
Raw Output (Default)
JSON Format
Array Format
./llama-embedding -m model.gguf --pooling mean \
-p "Text to embed" \
--log-disable
# Output: space-separated floats
0.0234 -0.0145 0.0421 ... 0.0089
Generate embeddings for multiple texts using a separator:
./llama-embedding -m model.gguf --pooling mean \
-p "First text<#sep#>Second text<#sep#>Third text" \
--embd-separator "<#sep#>" \
--embd-normalize 2 \
--embd-output-format array \
--log-disable
Advanced Options
./llama-embedding -m model.gguf \
--pooling mean \
-p "Text to embed" \
--embd-normalize 2 \ # L2 normalization
--embd-output-format json \ # JSON output
--n-gpu-layers 99 \ # GPU acceleration
--log-disable # Suppress logs
Similarity Calculation
Once you have embeddings, calculate similarity using cosine similarity:
similarity = (A · B) / (||A|| × ||B||)
For normalized embeddings (L2 norm), this simplifies to the dot product:
Python Example
import numpy as np
import requests
def get_embedding ( text ):
response = requests.post(
"http://localhost:8080/v1/embeddings" ,
json = { "input" : text, "model" : "text-embedding" }
)
return np.array(response.json()[ "data" ][ 0 ][ "embedding" ])
# Get embeddings
emb1 = get_embedding( "The cat sits on the mat" )
emb2 = get_embedding( "A feline rests on the rug" )
emb3 = get_embedding( "Python programming language" )
# Calculate cosine similarity (embeddings are already normalized)
print ( f "Cat vs Feline: { np.dot(emb1, emb2) :.3f} " ) # High similarity
print ( f "Cat vs Python: { np.dot(emb1, emb3) :.3f} " ) # Low similarity
JavaScript Example
async function getEmbedding ( text ) {
const response = await fetch ( 'http://localhost:8080/v1/embeddings' , {
method: 'POST' ,
headers: { 'Content-Type' : 'application/json' },
body: JSON . stringify ({ input: text , model: 'text-embedding' })
});
const data = await response . json ();
return data . data [ 0 ]. embedding ;
}
function cosineSimilarity ( a , b ) {
return a . reduce (( sum , val , i ) => sum + val * b [ i ], 0 );
}
// Calculate similarity
const emb1 = await getEmbedding ( 'The cat sits on the mat' );
const emb2 = await getEmbedding ( 'A feline rests on the rug' );
console . log ( 'Similarity:' , cosineSimilarity ( emb1 , emb2 ));
Multimodal Embeddings
Some models support generating embeddings from images or audio in addition to text.
Image Embeddings
# Start server with multimodal model
./llama-server -hf ggml-org/gemma-3-4b-it-GGUF --embeddings --pooling mean
# Generate image embedding
curl http://localhost:8080/embedding \
-H "Content-Type: application/json" \
-d '{
"content": {"prompt_string": "Image description", "multimodal_data": ["base64_image_data"]}
}'
See the Multimodal documentation for details on image and audio input formats.
Embedding Models
Recommended Models
Popular embedding models available in GGUF format:
sentence-transformers/all-MiniLM-L6-v2 : Lightweight, fast, 384 dimensions
BAAI/bge-small-en-v1.5 : Strong performance, 384 dimensions
BAAI/bge-base-en-v1.5 : Balanced quality/speed, 768 dimensions
BAAI/bge-large-en-v1.5 : High quality, 1024 dimensions
Alibaba-NLP/gte-large-en-v1.5 : Excellent for retrieval, 1024 dimensions
intfloat/e5-large-v2 : Strong general-purpose, 1024 dimensions
Finding GGUF Embedding Models
Search Hugging Face for GGUF embedding models:
https://huggingface.co/models?pipeline_tag=feature-extraction&search=gguf
Using with llama-server
# Download from Hugging Face
./llama-server -hf sentence-transformers/all-MiniLM-L6-v2-GGUF --embeddings --pooling mean
# Or use local file
./llama-server -m all-MiniLM-L6-v2.gguf --embeddings --pooling mean
Reranking
Reranking models score document relevance for a given query, useful for improving search results.
Starting a Reranking Server
./llama-server -m bge-reranker-v2-m3.gguf --embedding --pooling rank --rerank
Reranking API
curl http://localhost:8080/v1/rerank \
-H "Content-Type: application/json" \
-d '{
"model": "reranker",
"query": "What is a panda?",
"documents": [
"A panda is a type of fish",
"The giant panda is a bear species endemic to China",
"Pandas are black and white animals",
"Programming pandas is a data analysis library"
],
"top_n": 2
}'
Response:
{
"results" : [
{ "index" : 1 , "relevance_score" : 0.95 , "document" : "The giant panda is..." },
{ "index" : 2 , "relevance_score" : 0.78 , "document" : "Pandas are black..." }
]
}
Recommended Reranking Models
BAAI/bge-reranker-v2-m3 : Multilingual reranking
BAAI/bge-reranker-large : English reranking
Use Cases
Semantic Search
Index documents
Generate embeddings for all documents in your corpus: documents = [ "doc1 text" , "doc2 text" , "doc3 text" ]
embeddings = [get_embedding(doc) for doc in documents]
Embed query
Generate embedding for the search query: query_embedding = get_embedding( "search query" )
Find similar documents
Calculate similarity and rank: similarities = [np.dot(query_embedding, emb) for emb in embeddings]
top_docs = sorted ( zip (documents, similarities), key = lambda x : - x[ 1 ])
RAG (Retrieval-Augmented Generation)
Build vector database
Store document embeddings in a vector database (FAISS, Pinecone, Weaviate, etc.)
Retrieve context
For a user query, find the most similar documents
Generate response
Pass retrieved documents as context to the LLM: context = " \n\n " .join(retrieved_docs)
prompt = f "Context: \n { context } \n\n Question: { query } \n Answer:"
Document Clustering
from sklearn.cluster import KMeans
import numpy as np
# Get embeddings for documents
documents = [ "doc1" , "doc2" , "doc3" , ... ]
embeddings = np.array([get_embedding(doc) for doc in documents])
# Cluster documents
kmeans = KMeans( n_clusters = 5 )
clusters = kmeans.fit_predict(embeddings)
# Group documents by cluster
for cluster_id in range ( 5 ):
cluster_docs = [doc for doc, c in zip (documents, clusters) if c == cluster_id]
print ( f "Cluster { cluster_id } : { len (cluster_docs) } documents" )
Batch Processing
# Increase batch size for higher throughput
./llama-server -m model.gguf --embeddings --pooling mean -ub 8192 -b 4096
GPU Acceleration
# Offload to GPU for faster embedding generation
./llama-server -m model.gguf --embeddings --pooling mean -ngl 99
Caching
For repeated queries, cache embeddings to avoid recomputation:
import functools
@functools.lru_cache ( maxsize = 1000 )
def get_embedding_cached ( text ):
return get_embedding(text)
Troubleshooting
Error: “Pooling type required”
Ensure you specify a pooling method:
./llama-server -m model.gguf --embeddings --pooling mean
Poor Embedding Quality
Ensure you’re using a proper embedding model (not a chat/completion model)
Check that the pooling method matches the model’s training
Verify normalization is enabled for similarity comparisons
Low Throughput
Increase batch size: -ub 8192 -b 4096
Enable GPU offload: -ngl 99
Use smaller embedding models
Process documents in batches via the API
See Also
Server Full server API documentation
Multimodal Image and audio embeddings
CLI Tool Command-line inference
Speculative Decoding Speed up text generation