Model clients provide the interface between AutoGen agents and large language models. AutoGen supports multiple LLM providers through the autogen-ext package.
Installation
Install the extension for your chosen provider:
OpenAI
Anthropic
Azure
Ollama
Llama.cpp
pip install "autogen-ext[openai]"
OpenAI
The OpenAIChatCompletionClient supports GPT-4, GPT-3.5, o1, and o3 models.
Basic Usage
from autogen_ext.models.openai import OpenAIChatCompletionClient
from autogen_agentchat.agents import AssistantAgent
# Create OpenAI client
model_client = OpenAIChatCompletionClient(
model = "gpt-4o" ,
api_key = "sk-..." , # Or set OPENAI_API_KEY environment variable
)
# Use with an agent
agent = AssistantAgent(
name = "assistant" ,
model_client = model_client,
system_message = "You are a helpful assistant."
)
Configuration Options
The model name (e.g., gpt-4o, gpt-4-turbo, gpt-3.5-turbo)
OpenAI API key. If not provided, reads from OPENAI_API_KEY environment variable
Sampling temperature between 0 and 2
Nucleus sampling parameter
Maximum tokens to generate
Request timeout in seconds
Override the default OpenAI API endpoint
Advanced Example
from autogen_ext.models.openai import OpenAIChatCompletionClient
client = OpenAIChatCompletionClient(
model = "gpt-4o" ,
api_key = "sk-..." ,
temperature = 0.7 ,
top_p = 0.9 ,
max_tokens = 4096 ,
timeout = 120.0 ,
# For Azure-compatible endpoints
base_url = "https://custom-endpoint.openai.azure.com/" ,
)
Azure OpenAI
The AzureOpenAIChatCompletionClient connects to Azure OpenAI Service.
Basic Usage
from autogen_ext.models.openai import AzureOpenAIChatCompletionClient
client = AzureOpenAIChatCompletionClient(
model = "gpt-4o" ,
api_version = "2024-02-01" ,
azure_endpoint = "https://YOUR-RESOURCE-NAME.openai.azure.com" ,
api_key = "..." , # Or use Azure AD authentication
azure_deployment = "gpt-4o-deployment" , # Your deployment name
)
Configuration Options
The Azure OpenAI endpoint URL
Azure OpenAI API version (e.g., 2024-02-01)
Your deployment name in Azure
Azure Active Directory token for authentication
Azure AD Authentication
from azure.identity import DefaultAzureCredential
from autogen_ext.models.openai import AzureOpenAIChatCompletionClient
# Using Azure AD authentication
credential = DefaultAzureCredential()
token = credential.get_token( "https://cognitiveservices.azure.com/.default" )
client = AzureOpenAIChatCompletionClient(
model = "gpt-4o" ,
api_version = "2024-02-01" ,
azure_endpoint = "https://YOUR-RESOURCE-NAME.openai.azure.com" ,
azure_ad_token = token.token,
azure_deployment = "gpt-4o-deployment" ,
)
Anthropic
The AnthropicChatCompletionClient supports Claude models.
Basic Usage
from autogen_ext.models.anthropic import AnthropicChatCompletionClient
client = AnthropicChatCompletionClient(
model = "claude-3-5-sonnet-20241022" ,
api_key = "sk-ant-..." , # Or set ANTHROPIC_API_KEY
max_tokens = 4096 ,
)
Configuration Options
Claude model name:
claude-3-5-sonnet-20241022 - Most capable
claude-3-opus-20240229 - Previous flagship
claude-3-sonnet-20240229 - Balanced
claude-3-haiku-20240307 - Fast and compact
Anthropic API key. Falls back to ANTHROPIC_API_KEY environment variable
Maximum tokens to generate. Required for Anthropic models
Sampling temperature between 0 and 1
Nucleus sampling parameter
Only sample from top K options
Extended Thinking (Claude 3.5 Sonnet)
Claude 3.5 Sonnet supports extended thinking mode:
from autogen_ext.models.anthropic import AnthropicChatCompletionClient
client = AnthropicChatCompletionClient(
model = "claude-3-5-sonnet-20241022" ,
api_key = "sk-ant-..." ,
max_tokens = 16000 ,
thinking = {
"type" : "enabled" ,
"budget_tokens" : 10000 , # Tokens for thinking
},
)
AWS Bedrock
Use Claude models through AWS Bedrock:
from autogen_ext.models.anthropic import AnthropicBedrockChatCompletionClient
client = AnthropicBedrockChatCompletionClient(
model = "anthropic.claude-3-5-sonnet-20241022-v2:0" ,
max_tokens = 4096 ,
# AWS credentials from environment or ~/.aws/credentials
aws_region = "us-west-2" ,
)
AWS region (e.g., us-west-2, us-east-1)
AWS session token for temporary credentials
Ollama
The OllamaChatCompletionClient connects to local Ollama instances.
Basic Usage
from autogen_ext.models.ollama import OllamaChatCompletionClient
client = OllamaChatCompletionClient(
model = "llama3.2" ,
host = "http://localhost:11434" ,
)
Configuration Options
Ollama model name (e.g., llama3.2, mistral, qwen2.5)
host
string
default: "http://localhost:11434"
Ollama server URL
Nucleus sampling parameter
Maximum tokens to generate
Advanced Configuration
from autogen_ext.models.ollama import OllamaChatCompletionClient
client = OllamaChatCompletionClient(
model = "llama3.2" ,
host = "http://localhost:11434" ,
temperature = 0.7 ,
top_p = 0.9 ,
top_k = 40 ,
num_ctx = 8192 , # Context window
num_predict = 2048 , # Max generation
repeat_penalty = 1.1 ,
seed = 42 , # For reproducibility
)
Llama.cpp
Run GGUF models locally with llama.cpp:
Installation
pip install "autogen-ext[llama-cpp]"
Basic Usage
from autogen_ext.models.llama_cpp import LlamaCppChatCompletionClient
client = LlamaCppChatCompletionClient(
model_path = "./models/llama-3.2-3b-instruct-q8_0.gguf" ,
n_ctx = 8192 , # Context window
n_gpu_layers = 35 , # Offload layers to GPU
)
Configuration Options
Path to the GGUF model file
Number of layers to offload to GPU
Maximum tokens to generate
Azure AI
Connect to Azure AI model deployments:
from autogen_ext.models.azure import AzureAIChatCompletionClient
client = AzureAIChatCompletionClient(
endpoint = "https://YOUR-ENDPOINT.inference.ai.azure.com" ,
credential = "YOUR-API-KEY" ,
model = "gpt-4o" ,
)
Streaming Responses
All model clients support streaming:
from autogen_core import CancellationToken
from autogen_core.models import UserMessage
async def stream_example ( client ):
messages = [UserMessage( content = "Tell me a story" , source = "user" )]
async for chunk in client.create_stream(messages, CancellationToken()):
if chunk.content:
print (chunk.content, end = "" , flush = True )
Model Capabilities
Query model capabilities:
capabilities = client.capabilities
print ( f "Vision: { capabilities.vision } " )
print ( f "Function calling: { capabilities.function_calling } " )
print ( f "JSON output: { capabilities.json_output } " )
Token Counting
Count tokens before sending requests:
from autogen_core.models import UserMessage
messages = [UserMessage( content = "Hello, world!" , source = "user" )]
token_count = client.count_tokens(messages)
print ( f "Message uses { token_count } tokens" )
Usage Tracking
Track token usage from responses:
from autogen_core import CancellationToken
from autogen_core.models import UserMessage
messages = [UserMessage( content = "Explain quantum computing" , source = "user" )]
result = await client.create(messages, CancellationToken())
print ( f "Prompt tokens: { result.usage.prompt_tokens } " )
print ( f "Completion tokens: { result.usage.completion_tokens } " )
Error Handling
Handle common errors:
from openai import RateLimitError, APIError
from anthropic import AnthropicError
import asyncio
async def create_with_retry ( client , messages , max_retries = 3 ):
for attempt in range (max_retries):
try :
return await client.create(messages, CancellationToken())
except RateLimitError:
if attempt < max_retries - 1 :
await asyncio.sleep( 2 ** attempt) # Exponential backoff
else :
raise
except APIError as e:
print ( f "API error: { e } " )
raise
Environment Variables
Model clients respect standard environment variables:
# OpenAI
export OPENAI_API_KEY = "sk-..."
export OPENAI_ORG_ID = "org-..."
# Anthropic
export ANTHROPIC_API_KEY = "sk-ant-..."
# Azure OpenAI
export AZURE_OPENAI_ENDPOINT = "https://..."
export AZURE_OPENAI_API_KEY = "..."
# AWS (for Bedrock)
export AWS_ACCESS_KEY_ID = "..."
export AWS_SECRET_ACCESS_KEY = "..."
export AWS_REGION = "us-west-2"
Best Practices
Use Environment Variables
Store API keys in environment variables instead of hardcoding:
import os
from autogen_ext.models.openai import OpenAIChatCompletionClient
# Good: reads from environment
client = OpenAIChatCompletionClient(
model = "gpt-4o" ,
api_key = os.getenv( "OPENAI_API_KEY" ),
)
# Better: automatic from environment
client = OpenAIChatCompletionClient( model = "gpt-4o" )
Set Timeouts
Always configure appropriate timeouts:
client = OpenAIChatCompletionClient(
model = "gpt-4o" ,
timeout = 120.0 , # 2 minute timeout
)
Monitor Usage
Track token usage to manage costs:
total_prompt_tokens = 0
total_completion_tokens = 0
result = await client.create(messages, CancellationToken())
total_prompt_tokens += result.usage.prompt_tokens
total_completion_tokens += result.usage.completion_tokens
print ( f "Total usage: { total_prompt_tokens + total_completion_tokens } tokens" )
Next Steps
Code Executors Set up code execution environments
Tools Add tools and capabilities to agents