Models API

The Models API provides information about available LLM models from all configured providers, including capabilities, pricing, and context limits.

List models

Retrieve all available models from configured providers.

Endpoint

GET /api/models

Parameters

provider

string

Filter models by provider (openai, anthropic, gemini, bedrock)

capability

string

Filter by model capability (chat, embedding, image_generation)

Response

models

array

Array of model objects

Show Model object

string

Model identifier (e.g., “gpt-4o”, “claude-3-5-sonnet-20241022”)

name

string

Human-readable model name

provider

string

Provider name (openai, anthropic, gemini, bedrock)

capabilities

array

Array of capabilities: [“chat”], [“embedding”], or [“image_generation”]

context_window

integer

Maximum context window size in tokens

max_output_tokens

integer

Maximum number of output tokens

pricing

object

Pricing information per 1M tokens

Show Pricing

input

number

Cost per 1M input tokens in USD

output

number

Cost per 1M output tokens in USD

features

array

Supported features: [“streaming”, “tools”, “vision”, “json_mode”]

Example

cURL

curl http://localhost:9090/api/models

Python

import requests

response = requests.get('http://localhost:9090/api/models')
models = response.json()

# List all chat models
chat_models = [m for m in models['models'] if 'chat' in m['capabilities']]
for model in chat_models:
    print(f"{model['id']} - {model['provider']}")
    print(f"  Context: {model['context_window']} tokens")
    print(f"  Price: ${model['pricing']['input']}/1M input, ${model['pricing']['output']}/1M output")

Filter by provider

Get models from a specific provider:

curl "http://localhost:9090/api/models?provider=anthropic"

Returns only Anthropic Claude models.

Filter by capability

Get models with specific capabilities:

# Chat models only
curl "http://localhost:9090/api/models?capability=chat"

# Embedding models
curl "http://localhost:9090/api/models?capability=embedding"

# Image generation models
curl "http://localhost:9090/api/models?capability=image_generation"

Using models in requests

Use the model ID from the API in your requests:

curl http://localhost:9090/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

vLLora automatically routes to the correct provider based on the model ID.

Model capabilities

Chat models

Models with chat capabilities support:

Conversational interfaces
Multi-turn dialogues
System prompts
Tool/function calling (if supported)

Example chat models:

gpt-4o, gpt-4o-mini (OpenAI)
claude-3-5-sonnet-20241022, claude-3-5-haiku-20241022 (Anthropic)
gemini-2.0-flash-exp, gemini-1.5-pro (Google)
anthropic.claude-3-5-sonnet-20241022-v2:0 (AWS Bedrock)

Embedding models

Models that generate vector embeddings:

text-embedding-3-small, text-embedding-3-large (OpenAI)
text-embedding-004 (Google)

Image generation models

Models that generate images from text:

dall-e-3, dall-e-2 (OpenAI)

Pricing information

The pricing field shows costs per 1 million tokens:

{
  "id": "gpt-4o",
  "pricing": {
    "input": 2.50,   // $2.50 per 1M input tokens
    "output": 10.00  // $10.00 per 1M output tokens
  }
}

Use this to estimate costs before making requests.

Context windows

The context_window field indicates the maximum total tokens (input + output):

gpt-4o: 128,000 tokens
claude-3-5-sonnet-20241022: 200,000 tokens
gemini-2.0-flash-exp: 1,000,000 tokens

Ensure your requests fit within the model’s context window.

CLI command

List models from the command line:

vllora list

This displays all available models in a formatted table.

Sync models

Update the model database from provider APIs:

# Sync all models
vllora sync --models

# Sync specific provider
vllora sync --models --providers

Model information is embedded in vLLora at build time for fast startup. Use sync to update with the latest models from provider APIs.

Best practices

Choose the right model for your use case

Use smaller/faster models (gpt-4o-mini, claude-3-5-haiku) for simple tasks
Use larger models (gpt-4o, claude-3-5-sonnet) for complex reasoning
Check context window if you have long conversations

Monitor costs

Check model pricing before deploying. Cost differences can be significant:

gpt-4o-mini: $0.15/$ 0.60 per 1M tokens
gpt-4o: $2.50/$ 10.00 per 1M tokens

Verify capabilities

Not all models support all features. Check the features array:

Tool calling support varies by model
Vision capabilities are model-specific
JSON mode isn’t universal

Keep model data updated

Run vllora sync --models periodically to get new models and updated pricing.

Next steps

Chat Completions

Use models in chat completion requests

Embeddings

Generate embeddings with embedding models

Image Generation

Create images with DALL-E models

Providers

Learn about provider support

Gateway API

Management API

Rust SDK

List models

Endpoint

Parameters

Response

Example

Filter by provider

Filter by capability

Using models in requests

Model capabilities

Chat models

Embedding models

Image generation models

Pricing information

Context windows

CLI command

Sync models

Best practices

Next steps

Chat Completions

Embeddings

Image Generation

Providers

Build docs developers (and LLMs) love

Gateway API

Management API

Rust SDK

​List models

​Endpoint

​Parameters

​Response

​Example

​Filter by provider

​Filter by capability

​Using models in requests

​Model capabilities

​Chat models

​Embedding models

​Image generation models

​Pricing information

​Context windows

​CLI command

​Sync models

​Best practices

​Next steps

Chat Completions

Embeddings

Image Generation

Providers

Build docs developers (and LLMs) love

List models

Endpoint

Parameters

Response

Example

Filter by provider

Filter by capability

Using models in requests

Model capabilities

Chat models

Embedding models

Image generation models

Pricing information

Context windows

CLI command

Sync models

Best practices

Next steps