Skip to main content

Overview

DeepInfra provides access to 100+ open-source and proprietary AI models with cost-effective inference, serverless deployment, and pay-as-you-go pricing. Perfect for developers seeking affordable AI at scale. Base URL: https://api.deepinfra.com/v1/openai

Supported Features

  • ✅ Chat Completions
  • ✅ Streaming
  • ✅ Vision (select models)
  • ✅ Function Calling (select models)
  • ❌ Embeddings (via separate API)
  • ❌ Image Generation (via separate API)
  • ❌ Fine-tuning

Quick Start

Chat Completions

from portkey_ai import Portkey

client = Portkey(
    provider="deepinfra",
    Authorization="***"  # Your DeepInfra API key
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    messages=[
        {"role": "user", "content": "Explain DeepInfra's advantages"}
    ]
)

print(response.choices[0].message.content)

Streaming

stream = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    messages=[{"role": "user", "content": "Write a short story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Meta Llama

ModelContextPrice TierDescription
meta-llama/Meta-Llama-3.1-405B-Instruct128KPremiumLargest Llama
meta-llama/Meta-Llama-3.1-70B-Instruct128KMidBalanced
meta-llama/Meta-Llama-3.1-8B-Instruct128KBudgetFast, cheap
meta-llama/Llama-3.2-90B-Vision-Instruct128KPremiumVision

Mistral & Mixtral

ModelContextPrice Tier
mistralai/Mixtral-8x22B-Instruct-v0.164KMid
mistralai/Mixtral-8x7B-Instruct-v0.132KBudget
mistralai/Mistral-7B-Instruct-v0.332KBudget

Qwen

ModelContextDescription
Qwen/Qwen2.5-72B-Instruct32KLatest Qwen
Qwen/Qwen2.5-7B-Instruct32KEfficient
Qwen/QwQ-32B-Preview32KReasoning

Specialized Models

ModelTypeUse Case
microsoft/WizardLM-2-8x22BCode/ChatCoding tasks
cognitivecomputations/dolphin-2.6-mixtral-8x7bChatUncensored
lizpreciatior/lzlv_70b_fp16_hfRoleplayCreative
DeepInfra excels at:
  • Cost-effectiveness - Up to 10x cheaper than alternatives
  • Model variety - 100+ models available
  • Serverless - No infrastructure management
  • Pay-as-you-go - No minimum commitment
  • Fast deployment - Instant access to models

Configuration Options

client = Portkey(
    provider="deepinfra",
    Authorization="***"  # Bearer token
)
HeaderDescriptionRequired
AuthorizationDeepInfra API keyYes

Advanced Features

System Messages

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful coding assistant."
        },
        {
            "role": "user",
            "content": "Write a Python function to sort a list"
        }
    ]
)

Temperature and Sampling

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    messages=[{"role": "user", "content": "Generate creative ideas"}],
    temperature=0.9,      # Higher for creativity
    top_p=0.95,          # Nucleus sampling
    max_tokens=500,      # Limit response length
    frequency_penalty=0.5 # Reduce repetition
)

Vision Models

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-90B-Vision-Instruct",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image"},
            {
                "type": "image_url",
                "image_url": {"url": "https://example.com/image.jpg"}
            }
        ]
    }]
)

Multi-turn Conversations

conversation = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is machine learning?"},
    {"role": "assistant", "content": "Machine learning is..."},
    {"role": "user", "content": "Can you give an example?"}
]

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    messages=conversation
)

Cost Optimization

Choose the Right Model

# For simple tasks - use 8B (cheapest)
client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Simple question"}]
)

# For complex tasks - use 70B (balanced)
client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    messages=[{"role": "user", "content": "Complex reasoning task"}]
)

# For most complex - use 405B (premium)
client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-405B-Instruct",
    messages=[{"role": "user", "content": "Very complex task"}]
)

Set Token Limits

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    messages=[{"role": "user", "content": "Brief answer please"}],
    max_tokens=100  # Control costs by limiting output
)

Fallback Configuration

Fallback to OpenAI if needed:
config = {
    "strategy": {"mode": "fallback"},
    "targets": [
        {
            "provider": "deepinfra",
            "api_key": "***",
            "override_params": {"model": "meta-llama/Meta-Llama-3.1-70B-Instruct"}
        },
        {
            "provider": "openai",
            "api_key": "sk-***",
            "override_params": {"model": "gpt-4o-mini"}
        }
    ]
}

client = Portkey().with_options(config=config)

Load Balancing

Balance cost vs quality:
config = {
    "strategy": {"mode": "loadbalance"},
    "targets": [
        {
            "provider": "deepinfra",
            "api_key": "***",
            "override_params": {"model": "meta-llama/Meta-Llama-3.1-8B-Instruct"},
            "weight": 0.7  # 70% to cheap model
        },
        {
            "provider": "deepinfra",
            "api_key": "***",
            "override_params": {"model": "meta-llama/Meta-Llama-3.1-70B-Instruct"},
            "weight": 0.3  # 30% to better model
        }
    ]
}

client = Portkey().with_options(config=config)

Error Handling

from portkey_ai.exceptions import (
    RateLimitError,
    APIError,
    AuthenticationError
)

try:
    response = client.chat.completions.create(
        model="meta-llama/Meta-Llama-3.1-70B-Instruct",
        messages=[{"role": "user", "content": "Hello"}]
    )
except RateLimitError as e:
    print(f"Rate limit: {e}")
except AuthenticationError as e:
    print(f"Invalid API key: {e}")
except APIError as e:
    print(f"API error: {e}")

Best Practices

  1. Start with smaller models - Test with 8B before using 70B
  2. Set max_tokens - Control costs
  3. Use streaming - Better UX
  4. Cache responses - Reduce API calls
  5. Monitor costs - DeepInfra has usage dashboard
  6. Choose right model - Balance cost vs quality
  7. Batch similar requests - More efficient
  8. Handle rate limits - Implement backoff

Use Cases

Budget-Conscious Development

# Use cheap 8B model for development
dev_client = Portkey(
    provider="deepinfra",
    Authorization="***"
)

response = dev_client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Test query"}]
)

High-Volume Applications

# Cost-effective for large scale
for user_query in user_queries:
    response = client.chat.completions.create(
        model="meta-llama/Meta-Llama-3.1-8B-Instruct",
        messages=[{"role": "user", "content": user_query}],
        max_tokens=200  # Limit costs
    )

A/B Testing Models

# Test different models cost-effectively
models_to_test = [
    "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "meta-llama/Meta-Llama-3.1-70B-Instruct",
    "mistralai/Mixtral-8x7B-Instruct-v0.1"
]

for model in models_to_test:
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": test_prompt}]
    )
    # Compare results

Rate Limits

  • Generous free tier for testing
  • Pay-as-you-go with no minimums
  • Rate limits based on tier
  • Contact DeepInfra for enterprise needs

Pricing Advantages

DeepInfra typically offers:
  • 50-90% cheaper than major providers
  • No minimum spend requirement
  • Free credits for new users
  • Transparent pricing per token

DeepInfra Pricing

View detailed pricing for all DeepInfra models

Getting Started

  1. Sign up at DeepInfra
  2. Get your API key
  3. Start with free credits
  4. Scale as needed

Together AI

Alternative open models platform

Cost Optimization

Reduce AI costs

Load Balancing

Balance cost vs quality

Caching

Cache for cost savings

Build docs developers (and LLMs) love