Skip to main content

Overview

This guide will walk you through making your first API call to a free LLM provider. We’ll use OpenRouter as our example since it offers 20+ free models with a simple OpenAI-compatible API.
1

Choose a Provider

For this quickstart, we’ll use OpenRouter which offers:
  • 20+ free models including Llama 3.3 70B and Gemma 3
  • OpenAI-compatible API (easy migration)
  • 20 requests/minute, 50 requests/day (1000/day with $10 lifetime top-up)
Other great options for getting started: Groq (fastest inference), Cerebras (high token limits), or GitHub Models (40+ models).
2

Get Your API Key

  1. Visit OpenRouter.ai
  2. Sign up for a free account
  3. Navigate to API Keys
  4. Create a new API key and copy it
Keep your API key secure! Never commit it to version control or share it publicly.
3

Install Dependencies

Install the OpenAI SDK (works with OpenRouter’s compatible endpoint):
pip install openai
4

Make Your First API Call

from openai import OpenAI

# Initialize client with OpenRouter
client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="YOUR_OPENROUTER_API_KEY",
)

# Make a chat completion request
response = client.chat.completions.create(
    model="meta-llama/llama-3.3-70b-instruct:free",
    messages=[
        {"role": "user", "content": "Explain quantum computing in simple terms"}
    ],
)

print(response.choices[0].message.content)

Available Free Models on OpenRouter

OpenRouter offers 20+ free models. Here are some popular options:

Llama 3.3 70B

General-purpose powerhouse - great for reasoning and complex tasksmeta-llama/llama-3.3-70b-instruct:free

Gemma 3 27B

Google’s efficient model - balanced performance and speedgoogle/gemma-3-27b-it:free

Mistral Small 24B

Fast and capable for most tasksmistralai/mistral-small-3.1-24b-instruct:free

Qwen 3 Coder

Specialized for code generationqwen/qwen3-coder:free

Try Other Providers

All providers with OpenAI-compatible APIs work similarly - just change the base_url and model name:

Groq (Ultra-Fast Inference)

from openai import OpenAI

client = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key="YOUR_GROQ_API_KEY",
)

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Write a Python function to calculate fibonacci"}],
)

print(response.choices[0].message.content)
Groq Limits: 14,400 requests/day for Llama 3.1 8B, 1,000 requests/day for Llama 3.3 70B

Cerebras (High Token Limits)

from openai import OpenAI

client = OpenAI(
    base_url="https://api.cerebras.ai/v1",
    api_key="YOUR_CEREBRAS_API_KEY",
)

response = client.chat.completions.create(
    model="llama3.3-70b",
    messages=[{"role": "user", "content": "What are the benefits of edge computing?"}],
)

print(response.choices[0].message.content)
Cerebras Limits: 14,400 requests/day, 1,000,000 tokens/day - generous for most projects!

Advanced: Streaming Responses

For real-time applications like chatbots, use streaming to display responses as they’re generated:
from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="YOUR_OPENROUTER_API_KEY",
)

stream = client.chat.completions.create(
    model="meta-llama/llama-3.3-70b-instruct:free",
    messages=[{"role": "user", "content": "Write a short story about a robot"}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

Environment Variables Setup

Never hardcode API keys! Use environment variables:
OPENROUTER_API_KEY=sk-or-v1-...
GROQ_API_KEY=gsk_...
CEREBRAS_API_KEY=...

Rate Limit Management

Most free providers have rate limits. Handle them gracefully:
Python
from openai import OpenAI, RateLimitError
import time

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ.get("OPENROUTER_API_KEY"),
)

def make_request_with_retry(messages, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="meta-llama/llama-3.3-70b-instruct:free",
                messages=messages,
            )
            return response
        except RateLimitError:
            if attempt < max_retries - 1:
                wait_time = 2 ** attempt  # Exponential backoff
                print(f"Rate limited. Waiting {wait_time} seconds...")
                time.sleep(wait_time)
            else:
                raise

response = make_request_with_retry(
    [{"role": "user", "content": "Hello!"}]
)

Next Steps

Free Providers

Browse all 13 always-free providers

Trial Credits

Explore providers offering trial credits

Best Practices

Learn tips for optimal API usage

Choosing a Provider

Find the best provider for your needs
Pro Tip: Start with multiple providers and implement fallback logic. If one hits rate limits, automatically switch to another!

Build docs developers (and LLMs) love