DeepInfra

Overview

DeepInfra provides access to 100+ open-source and proprietary AI models with cost-effective inference, serverless deployment, and pay-as-you-go pricing. Perfect for developers seeking affordable AI at scale. Base URL: https://api.deepinfra.com/v1/openai

Supported Features

✅ Chat Completions
✅ Streaming
✅ Vision (select models)
✅ Function Calling (select models)
❌ Embeddings (via separate API)
❌ Image Generation (via separate API)
❌ Fine-tuning

Quick Start

Chat Completions

from portkey_ai import Portkey

client = Portkey(
    provider="deepinfra",
    Authorization="***"  # Your DeepInfra API key
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    messages=[
        {"role": "user", "content": "Explain DeepInfra's advantages"}
    ]
)

print(response.choices[0].message.content)

Streaming

stream = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    messages=[{"role": "user", "content": "Write a short story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Popular Models

Meta Llama

Model	Context	Price Tier	Description
`meta-llama/Meta-Llama-3.1-405B-Instruct`	128K	Premium	Largest Llama
`meta-llama/Meta-Llama-3.1-70B-Instruct`	128K	Mid	Balanced
`meta-llama/Meta-Llama-3.1-8B-Instruct`	128K	Budget	Fast, cheap
`meta-llama/Llama-3.2-90B-Vision-Instruct`	128K	Premium	Vision

Mistral & Mixtral

Model	Context	Price Tier
`mistralai/Mixtral-8x22B-Instruct-v0.1`	64K	Mid
`mistralai/Mixtral-8x7B-Instruct-v0.1`	32K	Budget
`mistralai/Mistral-7B-Instruct-v0.3`	32K	Budget

Qwen

Model	Context	Description
`Qwen/Qwen2.5-72B-Instruct`	32K	Latest Qwen
`Qwen/Qwen2.5-7B-Instruct`	32K	Efficient
`Qwen/QwQ-32B-Preview`	32K	Reasoning

Specialized Models

Model	Type	Use Case
`microsoft/WizardLM-2-8x22B`	Code/Chat	Coding tasks
`cognitivecomputations/dolphin-2.6-mixtral-8x7b`	Chat	Uncensored
`lizpreciatior/lzlv_70b_fp16_hf`	Roleplay	Creative

DeepInfra excels at:

Cost-effectiveness - Up to 10x cheaper than alternatives
Model variety - 100+ models available
Serverless - No infrastructure management
Pay-as-you-go - No minimum commitment
Fast deployment - Instant access to models

Configuration Options

client = Portkey(
    provider="deepinfra",
    Authorization="***"  # Bearer token
)

Header	Description	Required
`Authorization`	DeepInfra API key	Yes

Advanced Features

System Messages

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful coding assistant."
        },
        {
            "role": "user",
            "content": "Write a Python function to sort a list"
        }
    ]
)

Temperature and Sampling

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    messages=[{"role": "user", "content": "Generate creative ideas"}],
    temperature=0.9,      # Higher for creativity
    top_p=0.95,          # Nucleus sampling
    max_tokens=500,      # Limit response length
    frequency_penalty=0.5 # Reduce repetition
)

Vision Models

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-90B-Vision-Instruct",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image"},
            {
                "type": "image_url",
                "image_url": {"url": "https://example.com/image.jpg"}
            }
        ]
    }]
)

Multi-turn Conversations

conversation = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is machine learning?"},
    {"role": "assistant", "content": "Machine learning is..."},
    {"role": "user", "content": "Can you give an example?"}
]

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    messages=conversation
)

Cost Optimization

Choose the Right Model

# For simple tasks - use 8B (cheapest)
client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Simple question"}]
)

# For complex tasks - use 70B (balanced)
client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    messages=[{"role": "user", "content": "Complex reasoning task"}]
)

# For most complex - use 405B (premium)
client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-405B-Instruct",
    messages=[{"role": "user", "content": "Very complex task"}]
)

Set Token Limits

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    messages=[{"role": "user", "content": "Brief answer please"}],
    max_tokens=100  # Control costs by limiting output
)

Fallback Configuration

Fallback to OpenAI if needed:

config = {
    "strategy": {"mode": "fallback"},
    "targets": [
        {
            "provider": "deepinfra",
            "api_key": "***",
            "override_params": {"model": "meta-llama/Meta-Llama-3.1-70B-Instruct"}
        },
        {
            "provider": "openai",
            "api_key": "sk-***",
            "override_params": {"model": "gpt-4o-mini"}
        }
    ]
}

client = Portkey().with_options(config=config)

Load Balancing

Balance cost vs quality:

config = {
    "strategy": {"mode": "loadbalance"},
    "targets": [
        {
            "provider": "deepinfra",
            "api_key": "***",
            "override_params": {"model": "meta-llama/Meta-Llama-3.1-8B-Instruct"},
            "weight": 0.7  # 70% to cheap model
        },
        {
            "provider": "deepinfra",
            "api_key": "***",
            "override_params": {"model": "meta-llama/Meta-Llama-3.1-70B-Instruct"},
            "weight": 0.3  # 30% to better model
        }
    ]
}

client = Portkey().with_options(config=config)

Error Handling

from portkey_ai.exceptions import (
    RateLimitError,
    APIError,
    AuthenticationError
)

try:
    response = client.chat.completions.create(
        model="meta-llama/Meta-Llama-3.1-70B-Instruct",
        messages=[{"role": "user", "content": "Hello"}]
    )
except RateLimitError as e:
    print(f"Rate limit: {e}")
except AuthenticationError as e:
    print(f"Invalid API key: {e}")
except APIError as e:
    print(f"API error: {e}")

Best Practices

Start with smaller models - Test with 8B before using 70B
Set max_tokens - Control costs
Use streaming - Better UX
Cache responses - Reduce API calls
Monitor costs - DeepInfra has usage dashboard
Choose right model - Balance cost vs quality
Batch similar requests - More efficient
Handle rate limits - Implement backoff

Use Cases

Budget-Conscious Development

# Use cheap 8B model for development
dev_client = Portkey(
    provider="deepinfra",
    Authorization="***"
)

response = dev_client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Test query"}]
)

High-Volume Applications

# Cost-effective for large scale
for user_query in user_queries:
    response = client.chat.completions.create(
        model="meta-llama/Meta-Llama-3.1-8B-Instruct",
        messages=[{"role": "user", "content": user_query}],
        max_tokens=200  # Limit costs
    )

A/B Testing Models

# Test different models cost-effectively
models_to_test = [
    "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "meta-llama/Meta-Llama-3.1-70B-Instruct",
    "mistralai/Mixtral-8x7B-Instruct-v0.1"
]

for model in models_to_test:
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": test_prompt}]
    )
    # Compare results

Rate Limits

Generous free tier for testing
Pay-as-you-go with no minimums
Rate limits based on tier
Contact DeepInfra for enterprise needs

Pricing Advantages

DeepInfra typically offers:

50-90% cheaper than major providers
No minimum spend requirement
Free credits for new users
Transparent pricing per token

DeepInfra Pricing

View detailed pricing for all DeepInfra models

Getting Started

Sign up at DeepInfra
Get your API key
Start with free credits
Scale as needed

Together AI

Alternative open models platform

Cost Optimization

Reduce AI costs

Load Balancing

Balance cost vs quality

Caching

Cache for cost savings

Overview

Major Providers

Specialized Providers

Overview

Supported Features

Quick Start

Chat Completions

Streaming

Popular Models

Meta Llama

Mistral & Mixtral

Qwen

Specialized Models

Configuration Options

Advanced Features

System Messages

Temperature and Sampling

Vision Models

Multi-turn Conversations

Cost Optimization

Choose the Right Model

Set Token Limits

Fallback Configuration

Load Balancing

Error Handling

Best Practices

Use Cases

Budget-Conscious Development

High-Volume Applications

A/B Testing Models

Rate Limits

Pricing Advantages

DeepInfra Pricing

Getting Started

Together AI

Cost Optimization

Load Balancing

Caching

Build docs developers (and LLMs) love

Overview

Major Providers

Specialized Providers

​Overview

​Supported Features

​Quick Start

​Chat Completions

​Streaming

​Popular Models

​Meta Llama

​Mistral & Mixtral

​Qwen

​Specialized Models

​Configuration Options

​Advanced Features

​System Messages

​Temperature and Sampling

​Vision Models

​Multi-turn Conversations

​Cost Optimization

​Choose the Right Model

​Set Token Limits

​Fallback Configuration

​Load Balancing

​Error Handling

​Best Practices

​Use Cases

​Budget-Conscious Development

​High-Volume Applications

​A/B Testing Models

​Rate Limits

​Pricing Advantages

DeepInfra Pricing

​Getting Started

​Related Resources

Together AI

Cost Optimization

Load Balancing

Caching

Build docs developers (and LLMs) love

Overview

Supported Features

Quick Start

Chat Completions

Streaming

Popular Models

Meta Llama

Mistral & Mixtral

Qwen

Specialized Models

Configuration Options

Advanced Features

System Messages

Temperature and Sampling

Vision Models

Multi-turn Conversations

Cost Optimization

Choose the Right Model

Set Token Limits

Fallback Configuration

Load Balancing

Error Handling

Best Practices

Use Cases

Budget-Conscious Development

High-Volume Applications

A/B Testing Models

Rate Limits

Pricing Advantages

Getting Started

Related Resources