The embeddings endpoint generates vector representations of text inputs. It follows the OpenAI Embeddings API format.
Endpoint
Request body
The embedding model to use. Use a model name from /v1/models that supports embeddings.
The input text(s) to embed. Can be:
- A single string:
"Hello world"
- An array of strings:
["Hello", "World"]
- Token IDs:
[123, 456, 789]
- Array of token ID arrays:
[[123, 456], [789, 012]]
The format to return embeddings in:
"float": Array of floating-point numbers
"base64": Base64-encoded string
Number of dimensions for the embedding output. Only supported for models with matryoshka representation.
vLLM-specific parameters
Truncate input to this many tokens if it exceeds the limit.
Additional data to include in the response, passed through unchanged.
Array of embedding objects.The embedding vector, as an array of floats or base64 string.
Index of the embedding in the input array.
The model used for embeddings.
Token usage statistics.Number of tokens in the input.
Example: Single text embedding
curl http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "sentence-transformers/all-MiniLM-L6-v2",
"input": "The quick brown fox jumps over the lazy dog"
}'
{
"object": "list",
"data": [
{
"object": "embedding",
"embedding": [
0.0123,
-0.0234,
0.0345,
...
],
"index": 0
}
],
"model": "sentence-transformers/all-MiniLM-L6-v2",
"usage": {
"prompt_tokens": 12,
"total_tokens": 12
}
}
Example: Batch embeddings
curl http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "sentence-transformers/all-MiniLM-L6-v2",
"input": [
"First document to embed",
"Second document to embed",
"Third document to embed"
]
}'
{
"object": "list",
"data": [
{
"object": "embedding",
"embedding": [0.0123, -0.0234, ...],
"index": 0
},
{
"object": "embedding",
"embedding": [0.0456, -0.0567, ...],
"index": 1
},
{
"object": "embedding",
"embedding": [0.0789, -0.0890, ...],
"index": 2
}
],
"model": "sentence-transformers/all-MiniLM-L6-v2",
"usage": {
"prompt_tokens": 18,
"total_tokens": 18
}
}
Example: Matryoshka embeddings
curl http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "nomic-ai/nomic-embed-text-v1.5",
"input": "Sample text for embedding",
"dimensions": 256
}'
The response will contain embeddings with 256 dimensions instead of the default (e.g., 768).
Example: Base64 encoding
curl http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "sentence-transformers/all-MiniLM-L6-v2",
"input": "Hello world",
"encoding_format": "base64"
}'
{
"object": "list",
"data": [
{
"object": "embedding",
"embedding": "AAAAAAECAwQFBgcICQoLDA0ODxAREhMUFRYXGBkaGxwdHh8gISIj...",
"index": 0
}
],
"model": "sentence-transformers/all-MiniLM-L6-v2",
"usage": {
"prompt_tokens": 3,
"total_tokens": 3
}
}
Use cases
Semantic search
Embed documents and queries to find semantically similar content:
import requests
import numpy as np
# Embed documents
docs = ["Paris is the capital of France", "London is in England", "Berlin is in Germany"]
response = requests.post(
"http://localhost:8000/v1/embeddings",
json={"model": "sentence-transformers/all-MiniLM-L6-v2", "input": docs}
)
doc_embeddings = [d["embedding"] for d in response.json()["data"]]
# Embed query
query = "What is the capital of France?"
response = requests.post(
"http://localhost:8000/v1/embeddings",
json={"model": "sentence-transformers/all-MiniLM-L6-v2", "input": query}
)
query_embedding = response.json()["data"][0]["embedding"]
# Find most similar document
similarities = [np.dot(query_embedding, doc_emb) for doc_emb in doc_embeddings]
most_similar = docs[np.argmax(similarities)]
print(most_similar) # "Paris is the capital of France"
Clustering
Group similar texts together:
from sklearn.cluster import KMeans
texts = ["text1", "text2", "text3", ...]
response = requests.post(
"http://localhost:8000/v1/embeddings",
json={"model": "sentence-transformers/all-MiniLM-L6-v2", "input": texts}
)
embeddings = [d["embedding"] for d in response.json()["data"]]
kmeans = KMeans(n_clusters=3)
clusters = kmeans.fit_predict(embeddings)