Skip to main content

Overview

LiteLLM provides full support for OpenAI’s models including GPT-4o, O1, O3-mini, and more. You can use all OpenAI features including streaming, function calling, vision, audio, and batch processing.

Quick Start

1

Install LiteLLM

pip install litellm
2

Set API Key

export OPENAI_API_KEY="sk-..."
3

Make Your First Call

from litellm import completion

response = completion(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Hello, how are you?"}]
)
print(response.choices[0].message.content)

Supported Models

Latest and most capable GPT-4 models with optimized performance.
# GPT-4o - Best overall model
response = completion(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

# GPT-4o-mini - Fast and cost-effective
response = completion(
    model="openai/gpt-4o-mini",
    messages=[{"role": "user", "content": "Summarize this text"}]
)

# GPT-4o with vision
response = completion(
    model="openai/gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://..."}}
        ]
    }]
)

Authentication

Set your OpenAI API key as an environment variable:
export OPENAI_API_KEY="sk-..."
from litellm import completion

response = completion(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}]
)

Streaming

Get real-time responses as they’re generated:
from litellm import completion

response = completion(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Write a long story"}],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Async Streaming

from litellm import acompletion
import asyncio

async def stream_response():
    response = await acompletion(
        model="openai/gpt-4o",
        messages=[{"role": "user", "content": "Write a story"}],
        stream=True
    )
    
    async for chunk in response:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)

asyncio.run(stream_response())

Function Calling

OpenAI models support sophisticated function/tool calling:
from litellm import completion

tools = [{
    "type": "function",
    "function": {
        "name": "get_current_weather",
        "description": "Get the current weather in a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "City and state, e.g. San Francisco, CA"
                },
                "unit": {
                    "type": "string",
                    "enum": ["celsius", "fahrenheit"]
                }
            },
            "required": ["location"]
        }
    }
}]

response = completion(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "What's the weather in Boston?"}],
    tools=tools
)

if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]
    print(f"Function: {tool_call.function.name}")
    print(f"Arguments: {tool_call.function.arguments}")

Vision (Multimodal)

GPT-4o and GPT-4 Turbo support image inputs:
response = completion(
    model="openai/gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://example.com/image.jpg"
                }
            }
        ]
    }]
)

JSON Mode

Force models to return valid JSON:
response = completion(
    model="openai/gpt-4o",
    messages=[{
        "role": "user",
        "content": "Extract info: John is 30 years old and lives in NYC"
    }],
    response_format={"type": "json_object"}
)

import json
data = json.loads(response.choices[0].message.content)

Advanced Features

Seed for Reproducibility

response = completion(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Tell me a joke"}],
    seed=123,  # Same seed + inputs = similar outputs
    temperature=0.7
)

Logprobs

response = completion(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Say 'hello'"}],
    logprobs=True,
    top_logprobs=3  # Return top 3 token probabilities
)

for token in response.choices[0].logprobs.content:
    print(f"Token: {token.token}, Logprob: {token.logprob}")

Max Tokens and Stop Sequences

response = completion(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Write a story"}],
    max_tokens=500,  # Limit output length
    stop=["\n\n", "The End"]  # Stop at these sequences
)

Temperature and Top P

# More creative (temperature)
response = completion(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Write a poem"}],
    temperature=1.5  # 0 = deterministic, 2 = very random
)

# Nucleus sampling (top_p)
response = completion(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Generate text"}],
    top_p=0.9  # Consider tokens in top 90% probability mass
)

Embeddings

Generate text embeddings for semantic search and clustering:
from litellm import embedding

# Single text
response = embedding(
    model="openai/text-embedding-3-large",
    input="Hello world"
)
print(response.data[0].embedding)  # List of floats

# Multiple texts
response = embedding(
    model="openai/text-embedding-3-small",
    input=["Text 1", "Text 2", "Text 3"]
)

for item in response.data:
    print(f"Index {item.index}: {len(item.embedding)} dimensions")

# Specify dimensions (3-large and 3-small support this)
response = embedding(
    model="openai/text-embedding-3-large",
    input="Hello world",
    dimensions=256  # Reduce from default 3072
)

Available Embedding Models

ModelDimensionsUse Case
text-embedding-3-large3072 (default)Best performance
text-embedding-3-small1536 (default)Good balance
text-embedding-ada-0021536Legacy model

Batch Processing

Process large volumes of requests asynchronously:
from litellm import create_batch, retrieve_batch

# Create a batch job
batch = create_batch(
    custom_llm_provider="openai",
    input_file_id="file-abc123",  # Upload file first
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

print(f"Batch ID: {batch.id}")
print(f"Status: {batch.status}")

# Check batch status
batch_status = retrieve_batch(
    custom_llm_provider="openai",
    batch_id=batch.id
)

print(f"Completed: {batch_status.request_counts.completed}")
print(f"Failed: {batch_status.request_counts.failed}")

Error Handling

from litellm import completion
from litellm.exceptions import (
    AuthenticationError,
    RateLimitError,
    ContextWindowExceededError,
    APIError
)

try:
    response = completion(
        model="openai/gpt-4o",
        messages=[{"role": "user", "content": "Hello!"}]
    )
except AuthenticationError:
    print("Invalid API key")
except RateLimitError:
    print("Rate limit exceeded - retry later")
except ContextWindowExceededError:
    print("Message too long - reduce input size")
except APIError as e:
    print(f"API error: {e}")

Cost Tracking

from litellm import completion, completion_cost

response = completion(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}]
)

# Calculate cost
cost = completion_cost(completion_response=response)
print(f"Cost: ${cost:.6f}")

# Response includes token usage
print(f"Prompt tokens: {response.usage.prompt_tokens}")
print(f"Completion tokens: {response.usage.completion_tokens}")
print(f"Total tokens: {response.usage.total_tokens}")

Best Practices

Use GPT-4o-mini First

Start with gpt-4o-mini for testing - it’s fast and cost-effective. Upgrade to gpt-4o when you need maximum quality.

Set Max Tokens

Always set max_tokens to prevent unexpectedly long (and expensive) responses.

Use Streaming

Enable streaming for better user experience in interactive applications.

Handle Rate Limits

Implement exponential backoff when handling RateLimitError exceptions.

Streaming

Learn more about streaming responses

Function Calling

Deep dive into function calling

Vision

Working with images and vision models

Embeddings

Guide to embeddings and semantic search

Build docs developers (and LLMs) love