Skip to main content
The embeddings endpoint generates vector representations of text inputs. It follows the OpenAI Embeddings API format.

Endpoint

POST /v1/embeddings

Request body

model
string
required
The embedding model to use. Use a model name from /v1/models that supports embeddings.
input
string | array
required
The input text(s) to embed. Can be:
  • A single string: "Hello world"
  • An array of strings: ["Hello", "World"]
  • Token IDs: [123, 456, 789]
  • Array of token ID arrays: [[123, 456], [789, 012]]
encoding_format
string
default:"float"
The format to return embeddings in:
  • "float": Array of floating-point numbers
  • "base64": Base64-encoded string
dimensions
integer
default:"null"
Number of dimensions for the embedding output. Only supported for models with matryoshka representation.

vLLM-specific parameters

truncate_prompt_tokens
integer
default:"null"
Truncate input to this many tokens if it exceeds the limit.
additional_data
any
default:"null"
Additional data to include in the response, passed through unchanged.

Response format

object
string
Always “list”.
data
array
Array of embedding objects.
object
string
Always “embedding”.
embedding
array | string
The embedding vector, as an array of floats or base64 string.
index
integer
Index of the embedding in the input array.
model
string
The model used for embeddings.
usage
object
Token usage statistics.
prompt_tokens
integer
Number of tokens in the input.
total_tokens
integer
Total tokens processed.

Example: Single text embedding

curl http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sentence-transformers/all-MiniLM-L6-v2",
    "input": "The quick brown fox jumps over the lazy dog"
  }'
{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "embedding": [
        0.0123,
        -0.0234,
        0.0345,
        ...
      ],
      "index": 0
    }
  ],
  "model": "sentence-transformers/all-MiniLM-L6-v2",
  "usage": {
    "prompt_tokens": 12,
    "total_tokens": 12
  }
}

Example: Batch embeddings

curl http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sentence-transformers/all-MiniLM-L6-v2",
    "input": [
      "First document to embed",
      "Second document to embed",
      "Third document to embed"
    ]
  }'
{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "embedding": [0.0123, -0.0234, ...],
      "index": 0
    },
    {
      "object": "embedding",
      "embedding": [0.0456, -0.0567, ...],
      "index": 1
    },
    {
      "object": "embedding",
      "embedding": [0.0789, -0.0890, ...],
      "index": 2
    }
  ],
  "model": "sentence-transformers/all-MiniLM-L6-v2",
  "usage": {
    "prompt_tokens": 18,
    "total_tokens": 18
  }
}

Example: Matryoshka embeddings

curl http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-ai/nomic-embed-text-v1.5",
    "input": "Sample text for embedding",
    "dimensions": 256
  }'
The response will contain embeddings with 256 dimensions instead of the default (e.g., 768).

Example: Base64 encoding

curl http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sentence-transformers/all-MiniLM-L6-v2",
    "input": "Hello world",
    "encoding_format": "base64"
  }'
{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "embedding": "AAAAAAECAwQFBgcICQoLDA0ODxAREhMUFRYXGBkaGxwdHh8gISIj...",
      "index": 0
    }
  ],
  "model": "sentence-transformers/all-MiniLM-L6-v2",
  "usage": {
    "prompt_tokens": 3,
    "total_tokens": 3
  }
}

Use cases

Embed documents and queries to find semantically similar content:
import requests
import numpy as np

# Embed documents
docs = ["Paris is the capital of France", "London is in England", "Berlin is in Germany"]
response = requests.post(
    "http://localhost:8000/v1/embeddings",
    json={"model": "sentence-transformers/all-MiniLM-L6-v2", "input": docs}
)
doc_embeddings = [d["embedding"] for d in response.json()["data"]]

# Embed query
query = "What is the capital of France?"
response = requests.post(
    "http://localhost:8000/v1/embeddings",
    json={"model": "sentence-transformers/all-MiniLM-L6-v2", "input": query}
)
query_embedding = response.json()["data"][0]["embedding"]

# Find most similar document
similarities = [np.dot(query_embedding, doc_emb) for doc_emb in doc_embeddings]
most_similar = docs[np.argmax(similarities)]
print(most_similar)  # "Paris is the capital of France"

Clustering

Group similar texts together:
from sklearn.cluster import KMeans

texts = ["text1", "text2", "text3", ...]
response = requests.post(
    "http://localhost:8000/v1/embeddings",
    json={"model": "sentence-transformers/all-MiniLM-L6-v2", "input": texts}
)
embeddings = [d["embedding"] for d in response.json()["data"]]

kmeans = KMeans(n_clusters=3)
clusters = kmeans.fit_predict(embeddings)

Build docs developers (and LLMs) love