Skip to main content

Overview

Anyscale Endpoints provides serverless access to popular open-source models built on Ray, offering fast inference, competitive pricing, and easy scaling. Perfect for production deployments of Llama, Mixtral, and other open models. Base URL: https://api.endpoints.anyscale.com/v1

Supported Features

  • ✅ Chat Completions
  • ✅ Completions
  • ✅ Streaming
  • ✅ Embeddings
  • ✅ Function Calling (select models)
  • ❌ Vision
  • ❌ Image Generation
  • ❌ Fine-tuning

Quick Start

Chat Completions

from portkey_ai import Portkey

client = Portkey(
    provider="anyscale",
    Authorization="***"  # Your Anyscale API key
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    messages=[
        {"role": "user", "content": "Explain Anyscale Endpoints"}
    ]
)

print(response.choices[0].message.content)

Streaming

stream = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Available Models

Meta Llama

ModelContextDescription
meta-llama/Meta-Llama-3.1-405B-Instruct128KLargest Llama 3.1
meta-llama/Meta-Llama-3.1-70B-Instruct128KEfficient, capable
meta-llama/Meta-Llama-3.1-8B-Instruct128KFast, compact
meta-llama/Llama-3.2-90B-Vision-Instruct128KVision-enabled
meta-llama/Llama-3.2-11B-Vision-Instruct128KSmaller vision

Mistral AI

ModelContextDescription
mistralai/Mixtral-8x22B-Instruct-v0.164KLarge MoE
mistralai/Mixtral-8x7B-Instruct-v0.132KEfficient MoE
mistralai/Mistral-7B-Instruct-v0.132KCompact

Google Gemma

ModelContextDescription
google/gemma-2-27b-it8KLatest Gemma
google/gemma-2-9b-it8KEfficient

Qwen

ModelContextDescription
Qwen/Qwen2.5-72B-Instruct32KLatest Qwen
Qwen/Qwen2.5-7B-Instruct32KCompact

Embeddings

ModelDimensionsDescription
thenlper/gte-large1024High-quality embeddings
BAAI/bge-large-en-v1.51024Popular choice
Anyscale excels at:
  • Production-ready - Built for scale on Ray
  • Fast inference - Optimized serving
  • Cost-effective - Competitive pricing
  • Open models - Popular OSS models
  • Easy scaling - Serverless architecture

Configuration Options

client = Portkey(
    provider="anyscale",
    Authorization="***"  # Bearer token
)
HeaderDescriptionRequired
AuthorizationAnyscale API keyYes

Advanced Features

System Messages

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful AI assistant specialized in Python."
        },
        {
            "role": "user",
            "content": "How do I use async/await?"
        }
    ]
)

Temperature Control

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    messages=[{"role": "user", "content": "Write creative content"}],
    temperature=0.9,  # Higher for creativity
    max_tokens=500
)

Embeddings

response = client.embeddings.create(
    model="thenlper/gte-large",
    input="Anyscale provides serverless inference for open models"
)

embedding = response.data[0].embedding
print(f"Dimensions: {len(embedding)}")
Batch embeddings:
response = client.embeddings.create(
    model="thenlper/gte-large",
    input=[
        "First document",
        "Second document",
        "Third document"
    ]
)

for i, item in enumerate(response.data):
    print(f"Document {i}: {len(item.embedding)} dimensions")

Completions API

response = client.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    prompt="The future of AI is",
    max_tokens=100
)

print(response.choices[0].text)

Fallback Configuration

Fallback to Together AI:
config = {
    "strategy": {"mode": "fallback"},
    "targets": [
        {
            "provider": "anyscale",
            "api_key": "***",
            "override_params": {"model": "meta-llama/Meta-Llama-3.1-70B-Instruct"}
        },
        {
            "provider": "together-ai",
            "api_key": "***",
            "override_params": {"model": "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo"}
        }
    ]
}

client = Portkey().with_options(config=config)

Load Balancing

Balance across different models:
config = {
    "strategy": {"mode": "loadbalance"},
    "targets": [
        {
            "provider": "anyscale",
            "api_key": "***",
            "override_params": {"model": "meta-llama/Meta-Llama-3.1-70B-Instruct"},
            "weight": 0.7
        },
        {
            "provider": "anyscale",
            "api_key": "***",
            "override_params": {"model": "meta-llama/Meta-Llama-3.1-8B-Instruct"},
            "weight": 0.3
        }
    ]
}

client = Portkey().with_options(config=config)

Error Handling

from portkey_ai.exceptions import (
    RateLimitError,
    APIError,
    AuthenticationError
)

try:
    response = client.chat.completions.create(
        model="meta-llama/Meta-Llama-3.1-70B-Instruct",
        messages=[{"role": "user", "content": "Hello"}]
    )
except RateLimitError as e:
    print(f"Rate limit: {e}")
except AuthenticationError as e:
    print(f"Invalid API key: {e}")
except APIError as e:
    print(f"API error: {e}")

Best Practices

  1. Start with 70B - Best balance of speed and quality
  2. Use 8B for volume - Cost-effective for simple tasks
  3. Enable streaming - Better user experience
  4. Set appropriate max_tokens - Control costs and latency
  5. Use system prompts - Guide model behavior
  6. Implement retry logic - Handle transient failures
  7. Monitor usage - Track costs and performance
  8. Cache responses - Reduce redundant calls

Ray Integration

Anyscale Endpoints is built on Ray, providing:
  • Automatic scaling based on demand
  • Efficient resource utilization across clusters
  • Fast cold starts with model caching
  • High availability with redundancy

Pricing

Anyscale offers competitive pricing for open models:

Anyscale Pricing

View detailed pricing for all Anyscale models

Getting Started

  1. Sign up at Anyscale Endpoints
  2. Get your API key
  3. Start making requests

Together AI

Alternative open models platform

Groq

Ultra-fast inference

Load Balancing

Balance across providers

Fallbacks

Fallback configurations

Build docs developers (and LLMs) love