Overview
Anyscale Endpoints provides serverless access to popular open-source models built on Ray, offering fast inference, competitive pricing, and easy scaling. Perfect for production deployments of Llama, Mixtral, and other open models. Base URL:https://api.endpoints.anyscale.com/v1
Supported Features
- ✅ Chat Completions
- ✅ Completions
- ✅ Streaming
- ✅ Embeddings
- ✅ Function Calling (select models)
- ❌ Vision
- ❌ Image Generation
- ❌ Fine-tuning
Quick Start
Chat Completions
Streaming
Available Models
Meta Llama
| Model | Context | Description |
|---|---|---|
meta-llama/Meta-Llama-3.1-405B-Instruct | 128K | Largest Llama 3.1 |
meta-llama/Meta-Llama-3.1-70B-Instruct | 128K | Efficient, capable |
meta-llama/Meta-Llama-3.1-8B-Instruct | 128K | Fast, compact |
meta-llama/Llama-3.2-90B-Vision-Instruct | 128K | Vision-enabled |
meta-llama/Llama-3.2-11B-Vision-Instruct | 128K | Smaller vision |
Mistral AI
| Model | Context | Description |
|---|---|---|
mistralai/Mixtral-8x22B-Instruct-v0.1 | 64K | Large MoE |
mistralai/Mixtral-8x7B-Instruct-v0.1 | 32K | Efficient MoE |
mistralai/Mistral-7B-Instruct-v0.1 | 32K | Compact |
Google Gemma
| Model | Context | Description |
|---|---|---|
google/gemma-2-27b-it | 8K | Latest Gemma |
google/gemma-2-9b-it | 8K | Efficient |
Qwen
| Model | Context | Description |
|---|---|---|
Qwen/Qwen2.5-72B-Instruct | 32K | Latest Qwen |
Qwen/Qwen2.5-7B-Instruct | 32K | Compact |
Embeddings
| Model | Dimensions | Description |
|---|---|---|
thenlper/gte-large | 1024 | High-quality embeddings |
BAAI/bge-large-en-v1.5 | 1024 | Popular choice |
Anyscale excels at:
- Production-ready - Built for scale on Ray
- Fast inference - Optimized serving
- Cost-effective - Competitive pricing
- Open models - Popular OSS models
- Easy scaling - Serverless architecture
Configuration Options
| Header | Description | Required |
|---|---|---|
Authorization | Anyscale API key | Yes |
Advanced Features
System Messages
Temperature Control
Embeddings
Completions API
Fallback Configuration
Fallback to Together AI:Load Balancing
Balance across different models:Error Handling
Best Practices
- Start with 70B - Best balance of speed and quality
- Use 8B for volume - Cost-effective for simple tasks
- Enable streaming - Better user experience
- Set appropriate max_tokens - Control costs and latency
- Use system prompts - Guide model behavior
- Implement retry logic - Handle transient failures
- Monitor usage - Track costs and performance
- Cache responses - Reduce redundant calls
Ray Integration
Anyscale Endpoints is built on Ray, providing:- Automatic scaling based on demand
- Efficient resource utilization across clusters
- Fast cold starts with model caching
- High availability with redundancy
Pricing
Anyscale offers competitive pricing for open models:Anyscale Pricing
View detailed pricing for all Anyscale models
Getting Started
- Sign up at Anyscale Endpoints
- Get your API key
- Start making requests
Related Resources
Together AI
Alternative open models platform
Groq
Ultra-fast inference
Load Balancing
Balance across providers
Fallbacks
Fallback configurations