Overview
Free LLM API resources exist because providers want developers to experiment and build amazing things. Following these best practices ensures these services remain available for everyone.
Don’t abuse these services. Excessive usage, automation attacks, or violating terms of service can result in these free tiers being shut down for everyone.
General Principles
Respect Rate Limits Never try to circumvent or work around rate limits with multiple accounts or IPs
Use Appropriate Models Choose the smallest model that can accomplish your task effectively
Cache Aggressively Store responses for identical requests to minimize redundant API calls
Monitor Usage Track your consumption to stay well within limits and plan accordingly
Security Best Practices
1. API Key Management
Never hardcode API keys in your source code: # ❌ BAD - API key in code
client = OpenAI(
base_url = "https://openrouter.ai/api/v1" ,
api_key = "sk-or-v1-abc123..." # NEVER DO THIS
)
# ✅ GOOD - Use environment variables
import os
client = OpenAI(
base_url = "https://openrouter.ai/api/v1" ,
api_key = os.getenv( "OPENROUTER_API_KEY" )
)
Use a .env file (and add it to .gitignore): OPENROUTER_API_KEY = sk-or-v1-your-key-here
GROQ_API_KEY = gsk_your-key-here
CEREBRAS_API_KEY = your-key-here
.env
.env.local
* .key
secrets/
Use tools like python-dotenv or dotenv (Node.js) to load environment variables from .env files.
Set a schedule to rotate API keys:
Monthly rotation for development projects
Weekly rotation for production applications
Immediate rotation if you suspect a key has been compromised
Most providers allow multiple keys: import os
from datetime import datetime
# Use date-based key selection for graceful rotation
if datetime.now().day < 15 :
api_key = os.getenv( "OPENROUTER_KEY_A" )
else :
api_key = os.getenv( "OPENROUTER_KEY_B" )
Use different keys for different environments:
Development: One set of keys
Staging: Different keys
Production: Separate keys
Benefits:
Easier to track usage by environment
Compromised dev key doesn’t affect production
Can revoke keys without disrupting all environments
Always validate and sanitize user inputs:
def validate_prompt ( user_input : str , max_length : int = 1000 ) -> str :
# Limit length
if len (user_input) > max_length:
raise ValueError ( f "Input too long: { len (user_input) } > { max_length } " )
# Remove potentially malicious content
user_input = user_input.strip()
# Check for prompt injection attempts
suspicious_patterns = [
"ignore previous instructions" ,
"disregard all previous" ,
"you are now" ,
"new instructions:"
]
for pattern in suspicious_patterns:
if pattern.lower() in user_input.lower():
raise ValueError ( "Potential prompt injection detected" )
return user_input
# Usage
try :
safe_input = validate_prompt(user_input)
response = client.chat.completions.create(
model = "llama-3.3-70b-versatile" ,
messages = [{ "role" : "user" , "content" : safe_input}]
)
except ValueError as e:
print ( f "Invalid input: { e } " )
Prompt injection is a real security concern. Always validate inputs, especially in user-facing applications.
3. Rate Limiting on Your End
Implement application-level rate limiting:
from flask import Flask, request
from flask_limiter import Limiter
from flask_limiter.util import get_remote_address
app = Flask( __name__ )
limiter = Limiter(
app = app,
key_func = get_remote_address,
default_limits = [ "100 per day" , "10 per minute" ]
)
@app.route ( "/api/chat" , methods = [ "POST" ])
@limiter.limit ( "5 per minute" )
def chat ():
user_message = request.json.get( "message" )
# Process with LLM API
return { "response" : "..." }
1. Response Caching
Python with Redis
JavaScript with Node-Cache
import redis
import json
import hashlib
from openai import OpenAI
redis_client = redis.Redis( host = 'localhost' , port = 6379 , db = 0 )
llm_client = OpenAI( base_url = "https://openrouter.ai/api/v1" , api_key = os.getenv( "OPENROUTER_KEY" ))
def get_cache_key ( model : str , messages : list ) -> str :
"""Generate cache key from model and messages"""
content = json.dumps({ "model" : model, "messages" : messages}, sort_keys = True )
return hashlib.sha256(content.encode()).hexdigest()
def cached_completion ( model : str , messages : list , ttl : int = 3600 ):
# Check cache
cache_key = get_cache_key(model, messages)
cached = redis_client.get(cache_key)
if cached:
print ( "Cache hit!" )
return json.loads(cached)
# Cache miss - call API
print ( "Cache miss - calling API" )
response = llm_client.chat.completions.create(
model = model,
messages = messages
)
# Store in cache
redis_client.setex(
cache_key,
ttl,
json.dumps(response.model_dump())
)
return response
# Usage
response = cached_completion(
"meta-llama/llama-3.3-70b-instruct:free" ,
[{ "role" : "user" , "content" : "What is Python?" }]
)
Cache responses for at least 1 hour for frequently asked questions. For dynamic content, use shorter TTLs (5-15 minutes).
2. Model Selection Strategy
Choose the right model for the task:
Task Type Recommended Model Reasoning Simple Q&A Llama 3.2 3B, Gemma 3 4B Fast, efficient, sufficient for basic queries Code Generation Qwen 3 Coder, Codestral Specialized for code Complex Reasoning Llama 3.3 70B, Qwen 3 235B Larger models for complex logic Long Documents Mistral models, Gemini Better context handling Multilingual Cohere Aya, Qwen 3 Optimized for multiple languages
def select_model ( task_type : str , complexity : str ) -> dict :
"""Select optimal model and provider based on task"""
if task_type == "code" and complexity == "simple" :
return {
"provider" : "openrouter" ,
"model" : "qwen/qwen3-coder:free" ,
"reason" : "Fast, free code model"
}
if task_type == "reasoning" and complexity == "complex" :
return {
"provider" : "cerebras" ,
"model" : "llama3.3-70b" ,
"reason" : "Large model with high daily limit"
}
# Default to fast, small model
return {
"provider" : "groq" ,
"model" : "llama-3.1-8b-instant" ,
"reason" : "Ultra-fast for general tasks"
}
3. Streaming for Better UX
Use streaming to show responses as they generate:
from openai import OpenAI
client = OpenAI(
base_url = "https://api.groq.com/openai/v1" ,
api_key = os.getenv( "GROQ_API_KEY" )
)
def stream_response ( prompt : str ):
stream = client.chat.completions.create(
model = "llama-3.3-70b-versatile" ,
messages = [{ "role" : "user" , "content" : prompt}],
stream = True
)
for chunk in stream:
if chunk.choices[ 0 ].delta.content:
content = chunk.choices[ 0 ].delta.content
print (content, end = "" , flush = True )
stream_response( "Explain machine learning" )
Streaming uses the same API quota but provides better perceived performance - users see results immediately.
Cost Optimization
1. Token Efficiency
Minimize token usage without sacrificing quality:
# ❌ Inefficient prompt
prompt = """
You are a highly knowledgeable and experienced assistant with extensive
expertise in software development, computer science, and technology.
Please provide a detailed, comprehensive, and well-structured explanation
of the following topic, including examples, best practices, and any relevant
considerations:
What is a REST API?
"""
# ✅ Efficient prompt
prompt = "Explain REST APIs concisely with an example."
# Both can produce similar quality responses, but the second uses ~90% fewer tokens
Use shorter system messages:
# ❌ Verbose system message
system = "You are an expert programmer with 20 years of experience..."
# ✅ Concise system message
system = "Expert programmer. Be concise."
2. Conversation Management
Trim conversation history intelligently:
def trim_conversation ( messages : list , max_tokens : int = 2000 ) -> list :
"""Keep conversation under token limit while preserving context"""
# Always keep system message
system_msg = [m for m in messages if m[ "role" ] == "system" ]
# Get recent messages
other_msgs = [m for m in messages if m[ "role" ] != "system" ]
# Estimate tokens (rough: 1 token ≈ 4 chars)
total_chars = sum ( len (m[ "content" ]) for m in other_msgs)
if total_chars > max_tokens * 4 :
# Keep last N messages to stay under limit
keep_count = int (max_tokens * 4 / (total_chars / len (other_msgs)))
other_msgs = other_msgs[ - keep_count:]
return system_msg + other_msgs
# Usage
full_conversation = [
{ "role" : "system" , "content" : "Helpful assistant" },
# ... 50 messages of conversation history ...
{ "role" : "user" , "content" : "Current question" }
]
trimmed = trim_conversation(full_conversation, max_tokens = 2000 )
3. Multi-Provider Strategy
Use cheaper/faster providers for simple tasks:
class SmartRouter :
def __init__ ( self ):
self .groq_client = OpenAI( base_url = "https://api.groq.com/openai/v1" , api_key = os.getenv( "GROQ_KEY" ))
self .cerebras_client = OpenAI( base_url = "https://api.cerebras.ai/v1" , api_key = os.getenv( "CEREBRAS_KEY" ))
def route_request ( self , messages : list ):
# Estimate complexity
prompt = messages[ - 1 ][ "content" ]
# Simple/short queries -> Groq (fastest)
if len (prompt) < 100 and "?" in prompt:
return self .groq_client.chat.completions.create(
model = "llama-3.1-8b-instant" ,
messages = messages
)
# Complex queries -> Cerebras (larger model, higher limits)
return self .cerebras_client.chat.completions.create(
model = "llama3.3-70b" ,
messages = messages
)
router = SmartRouter()
response = router.route_request([{ "role" : "user" , "content" : "Hi!" }])
Privacy and Compliance
Providers using your data for training (on free tier):
Google AI Studio (outside EU/UK/EEA/CH)
Mistral La Plateforme (Experiment plan)
Providers NOT using your data:
Google AI Studio (EU/UK/EEA/CH regions)
Most other providers (check their privacy policy)
If handling sensitive data, always check the provider’s privacy policy and consider paying for a tier with stronger privacy guarantees.
Never send sensitive information through free APIs:
Personal identifiable information (PII)
Health records (PHI)
Financial data (credit cards, SSNs)
Passwords or credentials
Trade secrets or confidential business data
Sanitize inputs: import re
def remove_pii ( text : str ) -> str :
# Remove email addresses
text = re.sub( r ' \b [ A-Za-z0-9._%+- ] + @ [ A-Za-z0-9.- ] + \. [ A-Z|a-z ] {2,} \b ' , '[EMAIL]' , text)
# Remove phone numbers
text = re.sub( r ' \b\d {3} [ -. ] ? \d {3} [ -. ] ? \d {4} \b ' , '[PHONE]' , text)
# Remove SSN-like patterns
text = re.sub( r ' \b\d {3} - \d {2} - \d {4} \b ' , '[SSN]' , text)
return text
Consider data residency requirements:
EU data: Use Google AI Studio (EU regions) or Scaleway (France)
US data: Most providers are US-based
China data: Use Alibaba Cloud (International)
Check provider terms for:
Where data is processed
Where data is stored
How long data is retained
Error Handling
Implement comprehensive error handling:
from openai import OpenAI, APIError, RateLimitError, APIConnectionError
import logging
logger = logging.getLogger( __name__ )
def safe_api_call ( client , model , messages , max_retries = 3 ):
for attempt in range (max_retries):
try :
response = client.chat.completions.create(
model = model,
messages = messages
)
return response
except RateLimitError as e:
logger.warning( f "Rate limit hit on attempt { attempt + 1 } : { e } " )
if attempt < max_retries - 1 :
time.sleep( 2 ** attempt) # Exponential backoff
else :
raise
except APIConnectionError as e:
logger.error( f "Connection error on attempt { attempt + 1 } : { e } " )
if attempt < max_retries - 1 :
time.sleep( 1 )
else :
raise
except APIError as e:
logger.error( f "API error: { e } " )
raise # Don't retry on API errors
except Exception as e:
logger.error( f "Unexpected error: { e } " )
raise
# Usage
try :
response = safe_api_call(
client,
"llama-3.3-70b-versatile" ,
[{ "role" : "user" , "content" : "Hello" }]
)
except Exception as e:
print ( f "Failed after retries: { e } " )
Monitoring and Logging
Track important metrics:
import time
from dataclasses import dataclass
from typing import List
import json
@dataclass
class APIMetrics :
timestamp: float
provider: str
model: str
tokens_used: int
latency_ms: float
success: bool
error: str = None
class MetricsLogger :
def __init__ ( self , log_file = "api_metrics.jsonl" ):
self .log_file = log_file
def log_call ( self , metric : APIMetrics):
with open ( self .log_file, 'a' ) as f:
f.write(json.dumps(metric. __dict__ ) + ' \n ' )
def get_daily_stats ( self ) -> dict :
# Parse logs and compute stats
with open ( self .log_file, 'r' ) as f:
metrics = [json.loads(line) for line in f]
today_metrics = [
m for m in metrics
if time.time() - m[ 'timestamp' ] < 86400
]
return {
'total_requests' : len (today_metrics),
'total_tokens' : sum (m[ 'tokens_used' ] for m in today_metrics),
'avg_latency' : sum (m[ 'latency_ms' ] for m in today_metrics) / len (today_metrics),
'success_rate' : sum (m[ 'success' ] for m in today_metrics) / len (today_metrics)
}
logger = MetricsLogger()
# Log each API call
start = time.time()
try :
response = client.chat.completions.create( ... )
logger.log_call(APIMetrics(
timestamp = time.time(),
provider = 'groq' ,
model = 'llama-3.3-70b-versatile' ,
tokens_used = response.usage.total_tokens,
latency_ms = (time.time() - start) * 1000 ,
success = True
))
except Exception as e:
logger.log_call(APIMetrics(
timestamp = time.time(),
provider = 'groq' ,
model = 'llama-3.3-70b-versatile' ,
tokens_used = 0 ,
latency_ms = (time.time() - start) * 1000 ,
success = False ,
error = str (e)
))
Testing and Development
Use Mocks in Tests Don’t hit real APIs in unit tests - use mocked responses
Set Development Limits Implement stricter limits in development to avoid accidentally exhausting production quotas
Separate API Keys Use different API keys for dev, staging, and production
Test Failover Regularly test your multi-provider fallback logic
Ethical Considerations
These free services exist to help developers learn and build. Please use them responsibly:
Don’t create multiple accounts to bypass rate limits
Don’t use free tiers for commercial production at scale
Don’t generate spam, harmful, or illegal content
Consider upgrading to paid tiers when your usage grows
Report bugs and issues to help improve the services
Checklist
Before deploying your application:
Next Steps
Rate Limits Guide Deep dive into rate limit optimization strategies
Choosing a Provider Find the best provider for your specific use case