Embeddings API

The embeddings endpoint generates vector representations of text inputs. It follows the OpenAI Embeddings API format.

Endpoint

POST /v1/embeddings

Request body

model

string

required

The embedding model to use. Use a model name from /v1/models that supports embeddings.

input

string | array

required

The input text(s) to embed. Can be:

A single string: "Hello world"
An array of strings: ["Hello", "World"]
Token IDs: [123, 456, 789]
Array of token ID arrays: [[123, 456], [789, 012]]

encoding_format

string

default:"float"

The format to return embeddings in:

"float": Array of floating-point numbers
"base64": Base64-encoded string

dimensions

integer

default:"null"

Number of dimensions for the embedding output. Only supported for models with matryoshka representation.

vLLM-specific parameters

truncate_prompt_tokens

integer

default:"null"

Truncate input to this many tokens if it exceeds the limit.

additional_data

any

default:"null"

Additional data to include in the response, passed through unchanged.

Response format

object

string

Always “list”.

data

array

Array of embedding objects.

object

string

Always “embedding”.

embedding

array | string

The embedding vector, as an array of floats or base64 string.

index

integer

Index of the embedding in the input array.

model

string

The model used for embeddings.

usage

object

Token usage statistics.

prompt_tokens

integer

Number of tokens in the input.

total_tokens

integer

Total tokens processed.

Example: Single text embedding

curl http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sentence-transformers/all-MiniLM-L6-v2",
    "input": "The quick brown fox jumps over the lazy dog"
  }'

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "embedding": [
        0.0123,
        -0.0234,
        0.0345,
        ...
      ],
      "index": 0
    }
  ],
  "model": "sentence-transformers/all-MiniLM-L6-v2",
  "usage": {
    "prompt_tokens": 12,
    "total_tokens": 12
  }
}

Example: Batch embeddings

curl http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sentence-transformers/all-MiniLM-L6-v2",
    "input": [
      "First document to embed",
      "Second document to embed",
      "Third document to embed"
    ]
  }'

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "embedding": [0.0123, -0.0234, ...],
      "index": 0
    },
    {
      "object": "embedding",
      "embedding": [0.0456, -0.0567, ...],
      "index": 1
    },
    {
      "object": "embedding",
      "embedding": [0.0789, -0.0890, ...],
      "index": 2
    }
  ],
  "model": "sentence-transformers/all-MiniLM-L6-v2",
  "usage": {
    "prompt_tokens": 18,
    "total_tokens": 18
  }
}

Example: Matryoshka embeddings

curl http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-ai/nomic-embed-text-v1.5",
    "input": "Sample text for embedding",
    "dimensions": 256
  }'

The response will contain embeddings with 256 dimensions instead of the default (e.g., 768).

Example: Base64 encoding

curl http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sentence-transformers/all-MiniLM-L6-v2",
    "input": "Hello world",
    "encoding_format": "base64"
  }'

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "embedding": "AAAAAAECAwQFBgcICQoLDA0ODxAREhMUFRYXGBkaGxwdHh8gISIj...",
      "index": 0
    }
  ],
  "model": "sentence-transformers/all-MiniLM-L6-v2",
  "usage": {
    "prompt_tokens": 3,
    "total_tokens": 3
  }
}

Use cases

Semantic search

Embed documents and queries to find semantically similar content:

import requests
import numpy as np

# Embed documents
docs = ["Paris is the capital of France", "London is in England", "Berlin is in Germany"]
response = requests.post(
    "http://localhost:8000/v1/embeddings",
    json={"model": "sentence-transformers/all-MiniLM-L6-v2", "input": docs}
)
doc_embeddings = [d["embedding"] for d in response.json()["data"]]

# Embed query
query = "What is the capital of France?"
response = requests.post(
    "http://localhost:8000/v1/embeddings",
    json={"model": "sentence-transformers/all-MiniLM-L6-v2", "input": query}
)
query_embedding = response.json()["data"][0]["embedding"]

# Find most similar document
similarities = [np.dot(query_embedding, doc_emb) for doc_emb in doc_embeddings]
most_similar = docs[np.argmax(similarities)]
print(most_similar)  # "Paris is the capital of France"

Clustering

Group similar texts together:

from sklearn.cluster import KMeans

texts = ["text1", "text2", "text3", ...]
response = requests.post(
    "http://localhost:8000/v1/embeddings",
    json={"model": "sentence-transformers/all-MiniLM-L6-v2", "input": texts}
)
embeddings = [d["embedding"] for d in response.json()["data"]]

kmeans = KMeans(n_clusters=3)
clusters = kmeans.fit_predict(embeddings)

Models endpoint - List available embedding models
LLM.embed() - Python API equivalent
PoolingParams - Configure embedding generation

Python API

REST API

CLI Reference

Embeddings API

Endpoint

Request body

vLLM-specific parameters

Response format

Example: Single text embedding

Example: Batch embeddings

Example: Matryoshka embeddings

Example: Base64 encoding

Use cases

Semantic search

Clustering

Build docs developers (and LLMs) love

Python API

REST API

CLI Reference

​Endpoint

​Request body

​vLLM-specific parameters

​Response format

​Example: Single text embedding

​Example: Batch embeddings

​Example: Matryoshka embeddings

​Example: Base64 encoding

​Use cases

​Semantic search

​Clustering

​Related

Build docs developers (and LLMs) love

Endpoint

Request body

vLLM-specific parameters

Response format

Example: Single text embedding

Example: Batch embeddings

Example: Matryoshka embeddings

Example: Base64 encoding

Use cases

Semantic search

Clustering

Related