Best Practices

Overview

Free LLM API resources exist because providers want developers to experiment and build amazing things. Following these best practices ensures these services remain available for everyone.

Don’t abuse these services. Excessive usage, automation attacks, or violating terms of service can result in these free tiers being shut down for everyone.

General Principles

Respect Rate Limits

Never try to circumvent or work around rate limits with multiple accounts or IPs

Use Appropriate Models

Choose the smallest model that can accomplish your task effectively

Cache Aggressively

Store responses for identical requests to minimize redundant API calls

Monitor Usage

Track your consumption to stay well within limits and plan accordingly

Security Best Practices

1. API Key Management

Store Keys Securely

Never hardcode API keys in your source code:

# ❌ BAD - API key in code
client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="sk-or-v1-abc123..."  # NEVER DO THIS
)

# ✅ GOOD - Use environment variables
import os
client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.getenv("OPENROUTER_API_KEY")
)

Use a .env file (and add it to .gitignore):

.env

OPENROUTER_API_KEY=sk-or-v1-your-key-here
GROQ_API_KEY=gsk_your-key-here
CEREBRAS_API_KEY=your-key-here

.gitignore

.env
.env.local
*.key
secrets/

Use tools like python-dotenv or dotenv (Node.js) to load environment variables from .env files.

Rotate Keys Regularly

Set a schedule to rotate API keys:

Monthly rotation for development projects
Weekly rotation for production applications
Immediate rotation if you suspect a key has been compromised

Most providers allow multiple keys:

import os
from datetime import datetime

# Use date-based key selection for graceful rotation
if datetime.now().day < 15:
    api_key = os.getenv("OPENROUTER_KEY_A")
else:
    api_key = os.getenv("OPENROUTER_KEY_B")

Limit Key Scope

Use different keys for different environments:

Development: One set of keys
Staging: Different keys
Production: Separate keys

Benefits:

Easier to track usage by environment
Compromised dev key doesn’t affect production
Can revoke keys without disrupting all environments

2. Input Validation

Always validate and sanitize user inputs:

def validate_prompt(user_input: str, max_length: int = 1000) -> str:
    # Limit length
    if len(user_input) > max_length:
        raise ValueError(f"Input too long: {len(user_input)} > {max_length}")
    
    # Remove potentially malicious content
    user_input = user_input.strip()
    
    # Check for prompt injection attempts
    suspicious_patterns = [
        "ignore previous instructions",
        "disregard all previous",
        "you are now",
        "new instructions:"
    ]
    
    for pattern in suspicious_patterns:
        if pattern.lower() in user_input.lower():
            raise ValueError("Potential prompt injection detected")
    
    return user_input

# Usage
try:
    safe_input = validate_prompt(user_input)
    response = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[{"role": "user", "content": safe_input}]
    )
except ValueError as e:
    print(f"Invalid input: {e}")

Prompt injection is a real security concern. Always validate inputs, especially in user-facing applications.

3. Rate Limiting on Your End

Implement application-level rate limiting:

from flask import Flask, request
from flask_limiter import Limiter
from flask_limiter.util import get_remote_address

app = Flask(__name__)
limiter = Limiter(
    app=app,
    key_func=get_remote_address,
    default_limits=["100 per day", "10 per minute"]
)

@app.route("/api/chat", methods=["POST"])
@limiter.limit("5 per minute")
def chat():
    user_message = request.json.get("message")
    # Process with LLM API
    return {"response": "..."}

Performance Optimization

1. Response Caching

import redis
import json
import hashlib
from openai import OpenAI

redis_client = redis.Redis(host='localhost', port=6379, db=0)
llm_client = OpenAI(base_url="https://openrouter.ai/api/v1", api_key=os.getenv("OPENROUTER_KEY"))

def get_cache_key(model: str, messages: list) -> str:
    """Generate cache key from model and messages"""
    content = json.dumps({"model": model, "messages": messages}, sort_keys=True)
    return hashlib.sha256(content.encode()).hexdigest()

def cached_completion(model: str, messages: list, ttl: int = 3600):
    # Check cache
    cache_key = get_cache_key(model, messages)
    cached = redis_client.get(cache_key)
    
    if cached:
        print("Cache hit!")
        return json.loads(cached)
    
    # Cache miss - call API
    print("Cache miss - calling API")
    response = llm_client.chat.completions.create(
        model=model,
        messages=messages
    )
    
    # Store in cache
    redis_client.setex(
        cache_key,
        ttl,
        json.dumps(response.model_dump())
    )
    
    return response

# Usage
response = cached_completion(
    "meta-llama/llama-3.3-70b-instruct:free",
    [{"role": "user", "content": "What is Python?"}]
)

Cache responses for at least 1 hour for frequently asked questions. For dynamic content, use shorter TTLs (5-15 minutes).

2. Model Selection Strategy

Choose the right model for the task:

Task Type	Recommended Model	Reasoning
Simple Q&A	Llama 3.2 3B, Gemma 3 4B	Fast, efficient, sufficient for basic queries
Code Generation	Qwen 3 Coder, Codestral	Specialized for code
Complex Reasoning	Llama 3.3 70B, Qwen 3 235B	Larger models for complex logic
Long Documents	Mistral models, Gemini	Better context handling
Multilingual	Cohere Aya, Qwen 3	Optimized for multiple languages

def select_model(task_type: str, complexity: str) -> dict:
    """Select optimal model and provider based on task"""
    
    if task_type == "code" and complexity == "simple":
        return {
            "provider": "openrouter",
            "model": "qwen/qwen3-coder:free",
            "reason": "Fast, free code model"
        }
    
    if task_type == "reasoning" and complexity == "complex":
        return {
            "provider": "cerebras",
            "model": "llama3.3-70b",
            "reason": "Large model with high daily limit"
        }
    
    # Default to fast, small model
    return {
        "provider": "groq",
        "model": "llama-3.1-8b-instant",
        "reason": "Ultra-fast for general tasks"
    }

3. Streaming for Better UX

Use streaming to show responses as they generate:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key=os.getenv("GROQ_API_KEY")
)

def stream_response(prompt: str):
    stream = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )
    
    for chunk in stream:
        if chunk.choices[0].delta.content:
            content = chunk.choices[0].delta.content
            print(content, end="", flush=True)

stream_response("Explain machine learning")

Streaming uses the same API quota but provides better perceived performance - users see results immediately.

Cost Optimization

1. Token Efficiency

Minimize token usage without sacrificing quality:

# ❌ Inefficient prompt
prompt = """
You are a highly knowledgeable and experienced assistant with extensive 
expertise in software development, computer science, and technology. 
Please provide a detailed, comprehensive, and well-structured explanation 
of the following topic, including examples, best practices, and any relevant 
considerations:

What is a REST API?
"""

# ✅ Efficient prompt
prompt = "Explain REST APIs concisely with an example."

# Both can produce similar quality responses, but the second uses ~90% fewer tokens

Use shorter system messages:

# ❌ Verbose system message
system = "You are an expert programmer with 20 years of experience..."

# ✅ Concise system message
system = "Expert programmer. Be concise."

2. Conversation Management

Trim conversation history intelligently:

def trim_conversation(messages: list, max_tokens: int = 2000) -> list:
    """Keep conversation under token limit while preserving context"""
    # Always keep system message
    system_msg = [m for m in messages if m["role"] == "system"]
    
    # Get recent messages
    other_msgs = [m for m in messages if m["role"] != "system"]
    
    # Estimate tokens (rough: 1 token ≈ 4 chars)
    total_chars = sum(len(m["content"]) for m in other_msgs)
    
    if total_chars > max_tokens * 4:
        # Keep last N messages to stay under limit
        keep_count = int(max_tokens * 4 / (total_chars / len(other_msgs)))
        other_msgs = other_msgs[-keep_count:]
    
    return system_msg + other_msgs

# Usage
full_conversation = [
    {"role": "system", "content": "Helpful assistant"},
    # ... 50 messages of conversation history ...
    {"role": "user", "content": "Current question"}
]

trimmed = trim_conversation(full_conversation, max_tokens=2000)

3. Multi-Provider Strategy

Use cheaper/faster providers for simple tasks:

class SmartRouter:
    def __init__(self):
        self.groq_client = OpenAI(base_url="https://api.groq.com/openai/v1", api_key=os.getenv("GROQ_KEY"))
        self.cerebras_client = OpenAI(base_url="https://api.cerebras.ai/v1", api_key=os.getenv("CEREBRAS_KEY"))
    
    def route_request(self, messages: list):
        # Estimate complexity
        prompt = messages[-1]["content"]
        
        # Simple/short queries -> Groq (fastest)
        if len(prompt) < 100 and "?" in prompt:
            return self.groq_client.chat.completions.create(
                model="llama-3.1-8b-instant",
                messages=messages
            )
        
        # Complex queries -> Cerebras (larger model, higher limits)
        return self.cerebras_client.chat.completions.create(
            model="llama3.3-70b",
            messages=messages
        )

router = SmartRouter()
response = router.route_request([{"role": "user", "content": "Hi!"}])

Privacy and Compliance

Data Training Policies

Providers using your data for training (on free tier):

Google AI Studio (outside EU/UK/EEA/CH)
Mistral La Plateforme (Experiment plan)

Providers NOT using your data:

Google AI Studio (EU/UK/EEA/CH regions)
Most other providers (check their privacy policy)

If handling sensitive data, always check the provider’s privacy policy and consider paying for a tier with stronger privacy guarantees.

PII and Sensitive Data

Never send sensitive information through free APIs:

Personal identifiable information (PII)
Health records (PHI)
Financial data (credit cards, SSNs)
Passwords or credentials
Trade secrets or confidential business data

Sanitize inputs:

import re

def remove_pii(text: str) -> str:
    # Remove email addresses
    text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '[EMAIL]', text)
    
    # Remove phone numbers
    text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', text)
    
    # Remove SSN-like patterns
    text = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]', text)
    
    return text

Geographic Restrictions

Consider data residency requirements:

EU data: Use Google AI Studio (EU regions) or Scaleway (France)
US data: Most providers are US-based
China data: Use Alibaba Cloud (International)

Check provider terms for:

Where data is processed
Where data is stored
How long data is retained

Error Handling

Implement comprehensive error handling:

from openai import OpenAI, APIError, RateLimitError, APIConnectionError
import logging

logger = logging.getLogger(__name__)

def safe_api_call(client, model, messages, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=messages
            )
            return response
            
        except RateLimitError as e:
            logger.warning(f"Rate limit hit on attempt {attempt + 1}: {e}")
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)  # Exponential backoff
            else:
                raise
        
        except APIConnectionError as e:
            logger.error(f"Connection error on attempt {attempt + 1}: {e}")
            if attempt < max_retries - 1:
                time.sleep(1)
            else:
                raise
        
        except APIError as e:
            logger.error(f"API error: {e}")
            raise  # Don't retry on API errors
        
        except Exception as e:
            logger.error(f"Unexpected error: {e}")
            raise

# Usage
try:
    response = safe_api_call(
        client,
        "llama-3.3-70b-versatile",
        [{"role": "user", "content": "Hello"}]
    )
except Exception as e:
    print(f"Failed after retries: {e}")

Monitoring and Logging

Track important metrics:

import time
from dataclasses import dataclass
from typing import List
import json

@dataclass
class APIMetrics:
    timestamp: float
    provider: str
    model: str
    tokens_used: int
    latency_ms: float
    success: bool
    error: str = None

class MetricsLogger:
    def __init__(self, log_file="api_metrics.jsonl"):
        self.log_file = log_file
    
    def log_call(self, metric: APIMetrics):
        with open(self.log_file, 'a') as f:
            f.write(json.dumps(metric.__dict__) + '\n')
    
    def get_daily_stats(self) -> dict:
        # Parse logs and compute stats
        with open(self.log_file, 'r') as f:
            metrics = [json.loads(line) for line in f]
        
        today_metrics = [
            m for m in metrics
            if time.time() - m['timestamp'] < 86400
        ]
        
        return {
            'total_requests': len(today_metrics),
            'total_tokens': sum(m['tokens_used'] for m in today_metrics),
            'avg_latency': sum(m['latency_ms'] for m in today_metrics) / len(today_metrics),
            'success_rate': sum(m['success'] for m in today_metrics) / len(today_metrics)
        }

logger = MetricsLogger()

# Log each API call
start = time.time()
try:
    response = client.chat.completions.create(...)
    logger.log_call(APIMetrics(
        timestamp=time.time(),
        provider='groq',
        model='llama-3.3-70b-versatile',
        tokens_used=response.usage.total_tokens,
        latency_ms=(time.time() - start) * 1000,
        success=True
    ))
except Exception as e:
    logger.log_call(APIMetrics(
        timestamp=time.time(),
        provider='groq',
        model='llama-3.3-70b-versatile',
        tokens_used=0,
        latency_ms=(time.time() - start) * 1000,
        success=False,
        error=str(e)
    ))

Testing and Development

Use Mocks in Tests

Don’t hit real APIs in unit tests - use mocked responses

Set Development Limits

Implement stricter limits in development to avoid accidentally exhausting production quotas

Separate API Keys

Use different API keys for dev, staging, and production

Test Failover

Regularly test your multi-provider fallback logic

Ethical Considerations

These free services exist to help developers learn and build. Please use them responsibly:

Don’t create multiple accounts to bypass rate limits
Don’t use free tiers for commercial production at scale
Don’t generate spam, harmful, or illegal content
Consider upgrading to paid tiers when your usage grows
Report bugs and issues to help improve the services

Checklist

Before deploying your application:

Next Steps

Rate Limits Guide

Deep dive into rate limit optimization strategies

Choosing a Provider

Find the best provider for your specific use case

Understanding Rate Limits

⌘I

Get Started

Guides

Overview

General Principles

Respect Rate Limits

Use Appropriate Models

Cache Aggressively

Monitor Usage

Security Best Practices

1. API Key Management

2. Input Validation

3. Rate Limiting on Your End

Performance Optimization

1. Response Caching

2. Model Selection Strategy

3. Streaming for Better UX

Cost Optimization

1. Token Efficiency

2. Conversation Management

3. Multi-Provider Strategy

Privacy and Compliance

Error Handling

Monitoring and Logging

Testing and Development

Use Mocks in Tests

Set Development Limits

Separate API Keys

Test Failover

Ethical Considerations

Checklist

Next Steps

Rate Limits Guide

Choosing a Provider

Build docs developers (and LLMs) love

Get Started

Guides

​Overview

​General Principles

Respect Rate Limits

Use Appropriate Models

Cache Aggressively

Monitor Usage

​Security Best Practices

​1. API Key Management

​2. Input Validation

​3. Rate Limiting on Your End

​Performance Optimization

​1. Response Caching

​2. Model Selection Strategy

​3. Streaming for Better UX

​Cost Optimization

​1. Token Efficiency

​2. Conversation Management

​3. Multi-Provider Strategy

​Privacy and Compliance

​Error Handling

​Monitoring and Logging

​Testing and Development

Use Mocks in Tests

Set Development Limits

Separate API Keys

Test Failover

​Ethical Considerations

​Checklist

​Next Steps

Rate Limits Guide

Choosing a Provider

Build docs developers (and LLMs) love

Overview

General Principles

Security Best Practices

1. API Key Management

2. Input Validation

3. Rate Limiting on Your End

Performance Optimization

1. Response Caching

2. Model Selection Strategy

3. Streaming for Better UX

Cost Optimization

1. Token Efficiency

2. Conversation Management

3. Multi-Provider Strategy

Privacy and Compliance

Error Handling

Monitoring and Logging

Testing and Development

Ethical Considerations

Checklist

Next Steps