Model Clients - AutoGen

Model clients provide the interface between AutoGen agents and large language models. AutoGen supports multiple LLM providers through the autogen-ext package.

Installation

Install the extension for your chosen provider:

pip install "autogen-ext[openai]"

OpenAI

The OpenAIChatCompletionClient supports GPT-4, GPT-3.5, o1, and o3 models.

Basic Usage

from autogen_ext.models.openai import OpenAIChatCompletionClient
from autogen_agentchat.agents import AssistantAgent

# Create OpenAI client
model_client = OpenAIChatCompletionClient(
    model="gpt-4o",
    api_key="sk-...",  # Or set OPENAI_API_KEY environment variable
)

# Use with an agent
agent = AssistantAgent(
    name="assistant",
    model_client=model_client,
    system_message="You are a helpful assistant."
)

Configuration Options

model

string

required

The model name (e.g., gpt-4o, gpt-4-turbo, gpt-3.5-turbo)

api_key

string

OpenAI API key. If not provided, reads from OPENAI_API_KEY environment variable

temperature

float

default:"1.0"

Sampling temperature between 0 and 2

top_p

float

default:"1.0"

Nucleus sampling parameter

max_tokens

int

Maximum tokens to generate

timeout

float

default:"60.0"

Request timeout in seconds

base_url

string

Override the default OpenAI API endpoint

Advanced Example

from autogen_ext.models.openai import OpenAIChatCompletionClient

client = OpenAIChatCompletionClient(
    model="gpt-4o",
    api_key="sk-...",
    temperature=0.7,
    top_p=0.9,
    max_tokens=4096,
    timeout=120.0,
    # For Azure-compatible endpoints
    base_url="https://custom-endpoint.openai.azure.com/",
)

Azure OpenAI

The AzureOpenAIChatCompletionClient connects to Azure OpenAI Service.

Basic Usage

from autogen_ext.models.openai import AzureOpenAIChatCompletionClient

client = AzureOpenAIChatCompletionClient(
    model="gpt-4o",
    api_version="2024-02-01",
    azure_endpoint="https://YOUR-RESOURCE-NAME.openai.azure.com",
    api_key="...",  # Or use Azure AD authentication
    azure_deployment="gpt-4o-deployment",  # Your deployment name
)

Configuration Options

azure_endpoint

string

required

The Azure OpenAI endpoint URL

api_version

string

required

Azure OpenAI API version (e.g., 2024-02-01)

azure_deployment

string

required

Your deployment name in Azure

api_key

string

Azure OpenAI API key

azure_ad_token

string

Azure Active Directory token for authentication

Azure AD Authentication

from azure.identity import DefaultAzureCredential
from autogen_ext.models.openai import AzureOpenAIChatCompletionClient

# Using Azure AD authentication
credential = DefaultAzureCredential()
token = credential.get_token("https://cognitiveservices.azure.com/.default")

client = AzureOpenAIChatCompletionClient(
    model="gpt-4o",
    api_version="2024-02-01",
    azure_endpoint="https://YOUR-RESOURCE-NAME.openai.azure.com",
    azure_ad_token=token.token,
    azure_deployment="gpt-4o-deployment",
)

Anthropic

The AnthropicChatCompletionClient supports Claude models.

Basic Usage

from autogen_ext.models.anthropic import AnthropicChatCompletionClient

client = AnthropicChatCompletionClient(
    model="claude-3-5-sonnet-20241022",
    api_key="sk-ant-...",  # Or set ANTHROPIC_API_KEY
    max_tokens=4096,
)

Configuration Options

model

string

required

Claude model name:

claude-3-5-sonnet-20241022 - Most capable
claude-3-opus-20240229 - Previous flagship
claude-3-sonnet-20240229 - Balanced
claude-3-haiku-20240307 - Fast and compact

api_key

string

Anthropic API key. Falls back to ANTHROPIC_API_KEY environment variable

max_tokens

int

required

Maximum tokens to generate. Required for Anthropic models

temperature

float

default:"1.0"

Sampling temperature between 0 and 1

top_p

float

Nucleus sampling parameter

top_k

int

Only sample from top K options

Extended Thinking (Claude 3.5 Sonnet)

Claude 3.5 Sonnet supports extended thinking mode:

from autogen_ext.models.anthropic import AnthropicChatCompletionClient

client = AnthropicChatCompletionClient(
    model="claude-3-5-sonnet-20241022",
    api_key="sk-ant-...",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000,  # Tokens for thinking
    },
)

AWS Bedrock

Use Claude models through AWS Bedrock:

from autogen_ext.models.anthropic import AnthropicBedrockChatCompletionClient

client = AnthropicBedrockChatCompletionClient(
    model="anthropic.claude-3-5-sonnet-20241022-v2:0",
    max_tokens=4096,
    # AWS credentials from environment or ~/.aws/credentials
    aws_region="us-west-2",
)

aws_region

string

AWS region (e.g., us-west-2, us-east-1)

aws_access_key

string

AWS access key ID

aws_secret_key

string

AWS secret access key

aws_session_token

string

AWS session token for temporary credentials

Ollama

The OllamaChatCompletionClient connects to local Ollama instances.

Basic Usage

from autogen_ext.models.ollama import OllamaChatCompletionClient

client = OllamaChatCompletionClient(
    model="llama3.2",
    host="http://localhost:11434",
)

Configuration Options

model

string

required

Ollama model name (e.g., llama3.2, mistral, qwen2.5)

host

string

default:"http://localhost:11434"

Ollama server URL

temperature

float

Sampling temperature

top_p

float

Nucleus sampling parameter

top_k

int

Top-K sampling parameter

num_ctx

int

Context window size

num_predict

int

Maximum tokens to generate

Advanced Configuration

from autogen_ext.models.ollama import OllamaChatCompletionClient

client = OllamaChatCompletionClient(
    model="llama3.2",
    host="http://localhost:11434",
    temperature=0.7,
    top_p=0.9,
    top_k=40,
    num_ctx=8192,  # Context window
    num_predict=2048,  # Max generation
    repeat_penalty=1.1,
    seed=42,  # For reproducibility
)

Llama.cpp

Run GGUF models locally with llama.cpp:

Installation

pip install "autogen-ext[llama-cpp]"

Basic Usage

from autogen_ext.models.llama_cpp import LlamaCppChatCompletionClient

client = LlamaCppChatCompletionClient(
    model_path="./models/llama-3.2-3b-instruct-q8_0.gguf",
    n_ctx=8192,  # Context window
    n_gpu_layers=35,  # Offload layers to GPU
)

Configuration Options

model_path

string

required

Path to the GGUF model file

n_ctx

int

default:"2048"

Context window size

n_gpu_layers

int

default:"0"

Number of layers to offload to GPU

temperature

float

default:"0.8"

Sampling temperature

top_p

float

default:"0.95"

Nucleus sampling

top_k

int

default:"40"

Top-K sampling

max_tokens

int

default:"512"

Maximum tokens to generate

Azure AI

Connect to Azure AI model deployments:

from autogen_ext.models.azure import AzureAIChatCompletionClient

client = AzureAIChatCompletionClient(
    endpoint="https://YOUR-ENDPOINT.inference.ai.azure.com",
    credential="YOUR-API-KEY",
    model="gpt-4o",
)

Streaming Responses

All model clients support streaming:

from autogen_core import CancellationToken
from autogen_core.models import UserMessage

async def stream_example(client):
    messages = [UserMessage(content="Tell me a story", source="user")]
    
    async for chunk in client.create_stream(messages, CancellationToken()):
        if chunk.content:
            print(chunk.content, end="", flush=True)

Model Capabilities

Query model capabilities:

capabilities = client.capabilities

print(f"Vision: {capabilities.vision}")
print(f"Function calling: {capabilities.function_calling}")
print(f"JSON output: {capabilities.json_output}")

Token Counting

Count tokens before sending requests:

from autogen_core.models import UserMessage

messages = [UserMessage(content="Hello, world!", source="user")]
token_count = client.count_tokens(messages)
print(f"Message uses {token_count} tokens")

Usage Tracking

Track token usage from responses:

from autogen_core import CancellationToken
from autogen_core.models import UserMessage

messages = [UserMessage(content="Explain quantum computing", source="user")]
result = await client.create(messages, CancellationToken())

print(f"Prompt tokens: {result.usage.prompt_tokens}")
print(f"Completion tokens: {result.usage.completion_tokens}")

Error Handling

Handle common errors:

from openai import RateLimitError, APIError
from anthropic import AnthropicError
import asyncio

async def create_with_retry(client, messages, max_retries=3):
    for attempt in range(max_retries):
        try:
            return await client.create(messages, CancellationToken())
        except RateLimitError:
            if attempt < max_retries - 1:
                await asyncio.sleep(2 ** attempt)  # Exponential backoff
            else:
                raise
        except APIError as e:
            print(f"API error: {e}")
            raise

Environment Variables

Model clients respect standard environment variables:

# OpenAI
export OPENAI_API_KEY="sk-..."
export OPENAI_ORG_ID="org-..."

# Anthropic
export ANTHROPIC_API_KEY="sk-ant-..."

# Azure OpenAI
export AZURE_OPENAI_ENDPOINT="https://..."
export AZURE_OPENAI_API_KEY="..."

# AWS (for Bedrock)
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_REGION="us-west-2"

Best Practices

Use Environment Variables

Store API keys in environment variables instead of hardcoding:

import os
from autogen_ext.models.openai import OpenAIChatCompletionClient

# Good: reads from environment
client = OpenAIChatCompletionClient(
    model="gpt-4o",
    api_key=os.getenv("OPENAI_API_KEY"),
)

# Better: automatic from environment
client = OpenAIChatCompletionClient(model="gpt-4o")

Set Timeouts

Always configure appropriate timeouts:

client = OpenAIChatCompletionClient(
    model="gpt-4o",
    timeout=120.0,  # 2 minute timeout
)

Monitor Usage

Track token usage to manage costs:

total_prompt_tokens = 0
total_completion_tokens = 0

result = await client.create(messages, CancellationToken())
total_prompt_tokens += result.usage.prompt_tokens
total_completion_tokens += result.usage.completion_tokens

print(f"Total usage: {total_prompt_tokens + total_completion_tokens} tokens")

Getting Started

AgentChat

Core API

Extensions

Developer Tools

Guides

​Installation

​OpenAI

​Basic Usage

​Configuration Options

​Advanced Example

​Azure OpenAI

​Basic Usage

​Configuration Options

​Azure AD Authentication

​Anthropic

​Basic Usage

​Configuration Options

​Extended Thinking (Claude 3.5 Sonnet)

​AWS Bedrock

​Ollama

​Basic Usage

​Configuration Options

​Advanced Configuration

​Llama.cpp

​Installation

​Basic Usage

​Configuration Options

​Azure AI

​Streaming Responses

​Model Capabilities

​Token Counting

​Usage Tracking

​Error Handling

​Environment Variables

​Best Practices

​Use Environment Variables

​Set Timeouts

​Monitor Usage

​Next Steps

Code Executors

Tools

Build docs developers (and LLMs) love

Installation

OpenAI

Basic Usage

Configuration Options

Advanced Example

Azure OpenAI

Basic Usage

Configuration Options

Azure AD Authentication

Anthropic

Basic Usage

Configuration Options

Extended Thinking (Claude 3.5 Sonnet)

AWS Bedrock

Ollama

Basic Usage

Configuration Options

Advanced Configuration

Llama.cpp

Installation

Basic Usage

Configuration Options

Azure AI

Streaming Responses

Model Capabilities

Token Counting

Usage Tracking

Error Handling

Environment Variables

Best Practices

Use Environment Variables

Set Timeouts

Monitor Usage

Next Steps