LLM Providers

Loom provides a unified LlmClient trait with implementations for multiple LLM providers. All providers support streaming, tool calling, and automatic retry with exponential backoff.

Supported Providers

Anthropic (Claude)

Claude 3.5 Sonnet, Opus, and Haiku via Messages API

OpenAI

GPT-4, GPT-4 Turbo, and GPT-3.5 Turbo

Google Vertex AI

Gemini 1.5 Pro and Flash via Vertex AI

ZAI (智谱AI)

Chinese language models from ZhipuAI

Anthropic (Claude)

Loom’s Anthropic integration supports both API key and OAuth authentication, with account pooling for high-volume deployments.

Authentication

API Key
OAuth (Recommended)

use loom_server_llm_anthropic::{AnthropicClient, AnthropicConfig};

let config = AnthropicConfig::new("sk-ant-api03-...");
let client = AnthropicClient::new(config)?;

Environment variable:

ANTHROPIC_API_KEY=sk-ant-api03-...

OAuth provides better rate limits and usage tracking:

use loom_server_llm_anthropic::{AnthropicAuth, OAuthCredentials};

let credentials = OAuthCredentials {
    access_token: "...".to_string(),
    refresh_token: Some("...".to_string()),
    expires_at: Some(Utc::now() + Duration::hours(1)),
};

let auth = AnthropicAuth::oauth(credentials, credential_store);
let config = AnthropicConfig::new_with_auth(auth);

OAuth flow:

use loom_server_llm_anthropic::auth::{authorize, exchange_code, Pkce};

// 1. Generate PKCE challenge
let pkce = Pkce::generate();

// 2. Get authorization URL
let auth_request = authorize(&pkce, None)?;
// Redirect user to auth_request.url

// 3. Exchange code for tokens
let result = exchange_code(auth_code, &pkce).await?;
// Store result.credentials

Account Pooling

For high-volume deployments, use AnthropicPool to manage multiple accounts with automatic failover:

use loom_server_llm_anthropic::{
    AnthropicPool, AnthropicPoolConfig, AccountSelectionStrategy
};

let config = AnthropicPoolConfig {
    accounts: vec![
        AnthropicConfig::new("sk-ant-api03-account1..."),
        AnthropicConfig::new("sk-ant-api03-account2..."),
        AnthropicConfig::new("sk-ant-api03-account3..."),
    ],
    strategy: AccountSelectionStrategy::RoundRobin,
    health_check_interval: Duration::from_secs(60),
};

let pool = AnthropicPool::new(config).await?;

// Use pool like a regular client
let response = pool.complete(request).await?;

Selection strategies:

RoundRobin

Distributes requests evenly across all healthy accounts. Best for balanced load distribution.

LeastUsed

Routes to the account with the lowest recent usage. Best for quota management.

Failover

Uses a primary account until quota exhausted, then fails over to backup accounts.

Automatic failover behavior:

// loom-server-llm-anthropic/src/client.rs:72
fn classify_error(status: u16, message: &str) -> ClientErrorKind {
    if status == 401 || status == 403 {
        return ClientErrorKind::Permanent;  // Disable account
    }
    
    if status == 429 && is_quota_message(message) {
        return ClientErrorKind::QuotaExceeded;  // Failover to next account
    }
    
    if matches!(status, 408 | 429 | 500 | 502 | 503 | 504) {
        return ClientErrorKind::Transient;  // Retry on same account
    }
    
    ClientErrorKind::Permanent
}

// loom-server-llm-anthropic/src/client.rs:47
pub fn is_quota_message(msg: &str) -> bool {
    let lower = msg.to_ascii_lowercase();
    lower.contains("5-hour")
        || lower.contains("rolling window")
        || lower.contains("usage limit for your plan")
        || lower.contains("subscription usage limit")
}

Anthropic enforces a 5-hour rolling window for API usage. The pool automatically detects quota exhaustion errors and fails over to the next healthy account.

Health Monitoring

Monitor pool health via the status API:

let status = pool.get_status().await;

for (idx, account) in status.accounts.iter().enumerate() {
    println!("Account {}: {:?}", idx, account.health);
    println!("  Requests: {}", account.request_count);
    println!("  Errors: {}", account.error_count);
    println!("  Last error: {:?}", account.last_error);
}

Health statuses:

Healthy - Account is operational
QuotaExceeded - 5-hour quota exhausted, retrying after cooldown
Unhealthy - Permanent authentication failure, account disabled

Configuration

use loom_server_llm_anthropic::AnthropicConfig;

let config = AnthropicConfig::new("sk-ant-api03-...")
    .with_model("claude-3-5-sonnet-20241022")  // Default model
    .with_base_url("https://api.anthropic.com")  // Custom endpoint
    .with_max_tokens(4096);  // Default max_tokens

let client = AnthropicClient::new(config)?;

Environment variables:

ANTHROPIC_API_KEY=sk-ant-api03-...
ANTHROPIC_BASE_URL=https://api.anthropic.com  # Optional
ANTHROPIC_MODEL=claude-3-5-sonnet-20241022     # Optional

OpenAI

OpenAI integration provides access to GPT models via the Chat Completions API.

Configuration

use loom_server_llm_openai::{OpenAIClient, OpenAIConfig};

let config = OpenAIConfig::new("sk-...")
    .with_model("gpt-4-turbo")  // or gpt-4, gpt-3.5-turbo
    .with_organization("org-...");  // Optional

let client = OpenAIClient::new(config)?;

Environment variables:

OPENAI_API_KEY=sk-...
OPENAI_ORGANIZATION=org-...  # Optional
OPENAI_MODEL=gpt-4-turbo     # Optional

Retry Configuration

All LLM clients support configurable retry with exponential backoff:

use loom_common_http::RetryConfig;
use std::time::Duration;

let retry_config = RetryConfig {
    max_attempts: 3,
    base_delay: Duration::from_millis(500),
    max_delay: Duration::from_secs(30),
    backoff_factor: 2.0,  // Exponential backoff: 500ms, 1s, 2s, ...
    jitter: true,  // Add randomness to prevent thundering herd
    retryable_statuses: vec![
        reqwest::StatusCode::TOO_MANY_REQUESTS,  // 429
        reqwest::StatusCode::REQUEST_TIMEOUT,     // 408
        reqwest::StatusCode::INTERNAL_SERVER_ERROR,  // 500
        reqwest::StatusCode::BAD_GATEWAY,         // 502
        reqwest::StatusCode::SERVICE_UNAVAILABLE, // 503
        reqwest::StatusCode::GATEWAY_TIMEOUT,     // 504
    ],
};

let client = OpenAIClient::new(config)?
    .with_retry_config(retry_config);

Implementation:

// loom-server-llm-openai/src/client.rs:160
let result = retry(&self.retry_config, || async {
    let req = self.build_request(&request, false);
    
    let response = req.send().await.map_err(|e| {
        if e.is_timeout() {
            OpenAIRequestError(LlmError::Timeout)
        } else {
            OpenAIRequestError(LlmError::Http(e.to_string()))
        }
    })?;
    
    if !response.status().is_success() {
        let error = self.handle_error_response(response).await;
        return Err(OpenAIRequestError(error));
    }
    
    let openai_response: OpenAIResponse = response.json().await
        .map_err(|e| OpenAIRequestError(LlmError::InvalidResponse(e.to_string())))?;
    
    Ok(LlmResponse::from(openai_response))
}).await;

Rate Limiting

OpenAI returns rate limit information in headers:

// loom-server-llm-openai/src/client.rs:117
if status_code == 429 {
    let retry_after = response
        .headers()
        .get("retry-after")
        .and_then(|v| v.to_str().ok())
        .and_then(|v| v.parse().ok());
    
    return LlmError::RateLimited {
        retry_after_secs: retry_after,
    };
}

Google Vertex AI (Gemini)

Vertex AI provides access to Google’s Gemini models via GCP.

Authentication

Vertex AI uses Application Default Credentials (ADC):

Set up credentials

Choose one of the following methods:Service Account (Production):

export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json

Default Service Account (GCE/GKE):

# Automatically uses the compute engine default service account
# No environment variables needed

User Credentials (Development):

gcloud auth application-default login

Configure client

use loom_server_llm_vertex::{VertexClient, VertexConfig};

let config = VertexConfig::new("my-gcp-project", "us-central1")
    .with_model("gemini-1.5-pro");

let client = VertexClient::new(config)?;

Token Caching

Vertex AI automatically caches access tokens to reduce auth overhead:

// loom-server-llm-vertex/src/client.rs:134
async fn get_access_token(&self) -> Result<String, ClientError> {
    // Check cache first (60s buffer before expiry)
    {
        let cached = self.cached_token.read().await;
        if let Some(ref token) = *cached {
            if token.expires_at > std::time::Instant::now() + Duration::from_secs(60) {
                return Ok(token.token.clone());
            }
        }
    }
    
    // Initialize auth provider lazily
    let mut provider_guard = self.auth_provider.write().await;
    if provider_guard.is_none() {
        let provider = gcp_auth::provider().await?;
        *provider_guard = Some(provider);
    }
    
    // Get fresh token
    let provider = provider_guard.as_ref().unwrap();
    let scopes = &["https://www.googleapis.com/auth/cloud-platform"];
    let token = provider.token(scopes).await?;
    
    // Cache for ~1 hour
    let mut cached = self.cached_token.write().await;
    *cached = Some(CachedToken {
        token: token_str.clone(),
        expires_at: std::time::Instant::now() + Duration::from_secs(3500),
    });
    
    Ok(token_str)
}

Available Models

gemini-1.5-pro

Best for: Complex reasoning, long context (1M tokens)Flagship model with advanced reasoning capabilities

gemini-1.5-flash

Best for: Fast responses, high throughputOptimized for speed and efficiency

gemini-1.0-pro

Best for: Production workloads, stable APIPrevious generation, highly reliable

Regional endpoints:

let config = VertexConfig::new("my-project", "us-central1");  // US
let config = VertexConfig::new("my-project", "europe-west1");  // Europe
let config = VertexConfig::new("my-project", "asia-southeast1");  // Asia

ZAI (智谱AI)

ZAI provides Chinese language models from ZhipuAI, compatible with OpenAI’s API format.

Configuration

use loom_server_llm_zai::{ZaiClient, ZaiConfig};

let config = ZaiConfig::new("...")  // API key from ZhipuAI
    .with_model("glm-4");  // or glm-4-plus, glm-3-turbo

let client = ZaiClient::new(config)?;

Environment variables:

ZAI_API_KEY=...
ZAI_BASE_URL=https://open.bigmodel.cn/api/paas/v4  # Default
ZAI_MODEL=glm-4  # Optional

Available Models

glm-4-plus - Most capable model, best for complex tasks
glm-4 - Balanced performance and cost
glm-3-turbo - Fast responses, cost-effective

ZAI uses an OpenAI-compatible API, so the client implementation is nearly identical to OpenAIClient with ZAI-specific endpoints.

Unified LlmClient Interface

All providers implement the same LlmClient trait for consistency:

use loom_common_core::{LlmClient, LlmRequest, Message};

#[async_trait]
pub trait LlmClient: Send + Sync {
    /// Perform a non-streaming completion.
    async fn complete(&self, request: LlmRequest) -> Result<LlmResponse, LlmError>;
    
    /// Perform a streaming completion.
    async fn stream(&self, request: LlmRequest) -> Result<LlmStream, LlmError>;
}

Making Requests

use loom_common_core::{LlmRequest, Message, Tool};

// Build request
let request = LlmRequest::new("claude-3-5-sonnet-20241022")
    .with_messages(vec![
        Message::system("You are a helpful coding assistant."),
        Message::user("Write a Rust function to reverse a string."),
    ])
    .with_max_tokens(4096)
    .with_temperature(0.7)
    .with_tools(vec![
        Tool::new(
            "search_code",
            "Search for code examples",
            json!({
                "type": "object",
                "properties": {
                    "query": {"type": "string"}
                },
                "required": ["query"]
            })
        )
    ]);

// Execute
let response = client.complete(request).await?;

println!("Response: {}", response.message.content);
for tool_call in response.tool_calls {
    println!("Tool: {} with args: {}", tool_call.tool_name, tool_call.arguments_json);
}

Streaming Responses

use futures::StreamExt;

let mut stream = client.stream(request).await?;

while let Some(event) = stream.next().await {
    match event? {
        LlmEvent::ContentDelta(text) => {
            print!("{}", text);
        }
        LlmEvent::ToolCall(tool_call) => {
            println!("\nCalling tool: {}", tool_call.tool_name);
        }
        LlmEvent::Done { usage } => {
            println!("\nTokens: {} in, {} out", usage.input_tokens, usage.output_tokens);
            break;
        }
        LlmEvent::Error(error) => {
            eprintln!("Stream error: {}", error);
            break;
        }
    }
}

Error Handling

use loom_common_core::LlmError;

match client.complete(request).await {
    Ok(response) => { /* ... */ }
    Err(LlmError::RateLimited { retry_after_secs }) => {
        println!("Rate limited, retry after {:?} seconds", retry_after_secs);
    }
    Err(LlmError::Timeout) => {
        println!("Request timed out");
    }
    Err(LlmError::Api(message)) => {
        println!("API error: {}", message);
    }
    Err(LlmError::Http(error)) => {
        println!("HTTP error: {}", error);
    }
    Err(LlmError::InvalidResponse(error)) => {
        println!("Invalid response: {}", error);
    }
}

Usage Tracking

All providers return token usage information:

let response = client.complete(request).await?;

if let Some(usage) = response.usage {
    println!("Input tokens: {}", usage.input_tokens);
    println!("Output tokens: {}", usage.output_tokens);
    println!("Total tokens: {}", usage.input_tokens + usage.output_tokens);
}

Best Practices

Use Streaming

Stream responses for better UX. Users see output immediately instead of waiting for the entire response.

Set Timeouts

Configure appropriate timeouts (default: 5 minutes). Long-running requests should use streaming to avoid timeouts.

Handle Rate Limits

Respect retry-after headers and implement exponential backoff. Use account pooling for high-volume workloads.

Monitor Usage

Track token usage to optimize costs. Consider caching responses for repeated queries.

For implementation details, see the source in crates/loom-server-llm-{anthropic,openai,vertex,zai}/.

Get Started

Core Features

Architecture

Integrations

Observability

Deployment

​Supported Providers

Anthropic (Claude)

OpenAI

Google Vertex AI

ZAI (智谱AI)

​Anthropic (Claude)

​Authentication

​Account Pooling

​Health Monitoring

​Configuration

​OpenAI

​Configuration

​Retry Configuration

​Rate Limiting

​Google Vertex AI (Gemini)

​Authentication

​Token Caching

​Available Models

gemini-1.5-pro

gemini-1.5-flash

gemini-1.0-pro

​ZAI (智谱AI)

​Configuration

​Available Models

​Unified LlmClient Interface

​Making Requests

​Streaming Responses

​Error Handling

​Usage Tracking

​Best Practices

Use Streaming

Set Timeouts

Handle Rate Limits

Monitor Usage

Build docs developers (and LLMs) love

Supported Providers

Anthropic (Claude)

Authentication

Account Pooling

Health Monitoring

Configuration

OpenAI

Configuration

Retry Configuration

Rate Limiting

Google Vertex AI (Gemini)

Authentication

Token Caching

Available Models

ZAI (智谱AI)

Configuration

Available Models

Unified LlmClient Interface

Making Requests

Streaming Responses

Error Handling

Usage Tracking

Best Practices