Anyscale

Overview

Anyscale Endpoints provides serverless access to popular open-source models built on Ray, offering fast inference, competitive pricing, and easy scaling. Perfect for production deployments of Llama, Mixtral, and other open models. Base URL: https://api.endpoints.anyscale.com/v1

Supported Features

✅ Chat Completions
✅ Completions
✅ Streaming
✅ Embeddings
✅ Function Calling (select models)
❌ Vision
❌ Image Generation
❌ Fine-tuning

Quick Start

Chat Completions

from portkey_ai import Portkey

client = Portkey(
    provider="anyscale",
    Authorization="***"  # Your Anyscale API key
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    messages=[
        {"role": "user", "content": "Explain Anyscale Endpoints"}
    ]
)

print(response.choices[0].message.content)

Streaming

stream = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Available Models

Meta Llama

Model	Context	Description
`meta-llama/Meta-Llama-3.1-405B-Instruct`	128K	Largest Llama 3.1
`meta-llama/Meta-Llama-3.1-70B-Instruct`	128K	Efficient, capable
`meta-llama/Meta-Llama-3.1-8B-Instruct`	128K	Fast, compact
`meta-llama/Llama-3.2-90B-Vision-Instruct`	128K	Vision-enabled
`meta-llama/Llama-3.2-11B-Vision-Instruct`	128K	Smaller vision

Mistral AI

Model	Context	Description
`mistralai/Mixtral-8x22B-Instruct-v0.1`	64K	Large MoE
`mistralai/Mixtral-8x7B-Instruct-v0.1`	32K	Efficient MoE
`mistralai/Mistral-7B-Instruct-v0.1`	32K	Compact

Google Gemma

Model	Context	Description
`google/gemma-2-27b-it`	8K	Latest Gemma
`google/gemma-2-9b-it`	8K	Efficient

Qwen

Model	Context	Description
`Qwen/Qwen2.5-72B-Instruct`	32K	Latest Qwen
`Qwen/Qwen2.5-7B-Instruct`	32K	Compact

Embeddings

Model	Dimensions	Description
`thenlper/gte-large`	1024	High-quality embeddings
`BAAI/bge-large-en-v1.5`	1024	Popular choice

Anyscale excels at:

Production-ready - Built for scale on Ray
Fast inference - Optimized serving
Cost-effective - Competitive pricing
Open models - Popular OSS models
Easy scaling - Serverless architecture

Configuration Options

client = Portkey(
    provider="anyscale",
    Authorization="***"  # Bearer token
)

Header	Description	Required
`Authorization`	Anyscale API key	Yes

Advanced Features

System Messages

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful AI assistant specialized in Python."
        },
        {
            "role": "user",
            "content": "How do I use async/await?"
        }
    ]
)

Temperature Control

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    messages=[{"role": "user", "content": "Write creative content"}],
    temperature=0.9,  # Higher for creativity
    max_tokens=500
)

Embeddings

response = client.embeddings.create(
    model="thenlper/gte-large",
    input="Anyscale provides serverless inference for open models"
)

embedding = response.data[0].embedding
print(f"Dimensions: {len(embedding)}")

Batch embeddings:

response = client.embeddings.create(
    model="thenlper/gte-large",
    input=[
        "First document",
        "Second document",
        "Third document"
    ]
)

for i, item in enumerate(response.data):
    print(f"Document {i}: {len(item.embedding)} dimensions")

Completions API

response = client.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    prompt="The future of AI is",
    max_tokens=100
)

print(response.choices[0].text)

Fallback Configuration

Fallback to Together AI:

config = {
    "strategy": {"mode": "fallback"},
    "targets": [
        {
            "provider": "anyscale",
            "api_key": "***",
            "override_params": {"model": "meta-llama/Meta-Llama-3.1-70B-Instruct"}
        },
        {
            "provider": "together-ai",
            "api_key": "***",
            "override_params": {"model": "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo"}
        }
    ]
}

client = Portkey().with_options(config=config)

Load Balancing

Balance across different models:

config = {
    "strategy": {"mode": "loadbalance"},
    "targets": [
        {
            "provider": "anyscale",
            "api_key": "***",
            "override_params": {"model": "meta-llama/Meta-Llama-3.1-70B-Instruct"},
            "weight": 0.7
        },
        {
            "provider": "anyscale",
            "api_key": "***",
            "override_params": {"model": "meta-llama/Meta-Llama-3.1-8B-Instruct"},
            "weight": 0.3
        }
    ]
}

client = Portkey().with_options(config=config)

Error Handling

from portkey_ai.exceptions import (
    RateLimitError,
    APIError,
    AuthenticationError
)

try:
    response = client.chat.completions.create(
        model="meta-llama/Meta-Llama-3.1-70B-Instruct",
        messages=[{"role": "user", "content": "Hello"}]
    )
except RateLimitError as e:
    print(f"Rate limit: {e}")
except AuthenticationError as e:
    print(f"Invalid API key: {e}")
except APIError as e:
    print(f"API error: {e}")

Best Practices

Start with 70B - Best balance of speed and quality
Use 8B for volume - Cost-effective for simple tasks
Enable streaming - Better user experience
Set appropriate max_tokens - Control costs and latency
Use system prompts - Guide model behavior
Implement retry logic - Handle transient failures
Monitor usage - Track costs and performance
Cache responses - Reduce redundant calls

Ray Integration

Anyscale Endpoints is built on Ray, providing:

Automatic scaling based on demand
Efficient resource utilization across clusters
Fast cold starts with model caching
High availability with redundancy

Pricing

Anyscale offers competitive pricing for open models:

Anyscale Pricing

View detailed pricing for all Anyscale models

Getting Started

Sign up at Anyscale Endpoints
Get your API key
Start making requests

Together AI

Alternative open models platform

Groq

Ultra-fast inference

Load Balancing

Balance across providers

Fallbacks

Fallback configurations

Overview

Major Providers

Specialized Providers

Overview

Supported Features

Quick Start

Chat Completions

Streaming

Available Models

Meta Llama

Mistral AI

Google Gemma

Qwen

Embeddings

Configuration Options

Advanced Features

System Messages

Temperature Control

Embeddings

Completions API

Fallback Configuration

Load Balancing

Error Handling

Best Practices

Ray Integration

Pricing

Anyscale Pricing

Getting Started

Together AI

Groq

Load Balancing

Fallbacks

Build docs developers (and LLMs) love

Overview

Major Providers

Specialized Providers

​Overview

​Supported Features

​Quick Start

​Chat Completions

​Streaming

​Available Models

​Meta Llama

​Mistral AI

​Google Gemma

​Qwen

​Embeddings

​Configuration Options

​Advanced Features

​System Messages

​Temperature Control

​Embeddings

​Completions API

​Fallback Configuration

​Load Balancing

​Error Handling

​Best Practices

​Ray Integration

​Pricing

Anyscale Pricing

​Getting Started

​Related Resources

Together AI

Groq

Load Balancing

Fallbacks

Build docs developers (and LLMs) love

Overview

Supported Features

Quick Start

Chat Completions

Streaming

Available Models

Meta Llama

Mistral AI

Google Gemma

Qwen

Embeddings

Configuration Options

Advanced Features

System Messages

Temperature Control

Embeddings

Completions API

Fallback Configuration

Load Balancing

Error Handling

Best Practices

Ray Integration

Pricing

Getting Started

Related Resources