Skip to main content
Loom provides a unified LlmClient trait with implementations for multiple LLM providers. All providers support streaming, tool calling, and automatic retry with exponential backoff.

Supported Providers

Anthropic (Claude)

Claude 3.5 Sonnet, Opus, and Haiku via Messages API

OpenAI

GPT-4, GPT-4 Turbo, and GPT-3.5 Turbo

Google Vertex AI

Gemini 1.5 Pro and Flash via Vertex AI

ZAI (智谱AI)

Chinese language models from ZhipuAI

Anthropic (Claude)

Loom’s Anthropic integration supports both API key and OAuth authentication, with account pooling for high-volume deployments.

Authentication

use loom_server_llm_anthropic::{AnthropicClient, AnthropicConfig};

let config = AnthropicConfig::new("sk-ant-api03-...");
let client = AnthropicClient::new(config)?;
Environment variable:
ANTHROPIC_API_KEY=sk-ant-api03-...

Account Pooling

For high-volume deployments, use AnthropicPool to manage multiple accounts with automatic failover:
use loom_server_llm_anthropic::{
    AnthropicPool, AnthropicPoolConfig, AccountSelectionStrategy
};

let config = AnthropicPoolConfig {
    accounts: vec![
        AnthropicConfig::new("sk-ant-api03-account1..."),
        AnthropicConfig::new("sk-ant-api03-account2..."),
        AnthropicConfig::new("sk-ant-api03-account3..."),
    ],
    strategy: AccountSelectionStrategy::RoundRobin,
    health_check_interval: Duration::from_secs(60),
};

let pool = AnthropicPool::new(config).await?;

// Use pool like a regular client
let response = pool.complete(request).await?;
Selection strategies:
Distributes requests evenly across all healthy accounts. Best for balanced load distribution.
Routes to the account with the lowest recent usage. Best for quota management.
Uses a primary account until quota exhausted, then fails over to backup accounts.
Automatic failover behavior:
// loom-server-llm-anthropic/src/client.rs:72
fn classify_error(status: u16, message: &str) -> ClientErrorKind {
    if status == 401 || status == 403 {
        return ClientErrorKind::Permanent;  // Disable account
    }
    
    if status == 429 && is_quota_message(message) {
        return ClientErrorKind::QuotaExceeded;  // Failover to next account
    }
    
    if matches!(status, 408 | 429 | 500 | 502 | 503 | 504) {
        return ClientErrorKind::Transient;  // Retry on same account
    }
    
    ClientErrorKind::Permanent
}

// loom-server-llm-anthropic/src/client.rs:47
pub fn is_quota_message(msg: &str) -> bool {
    let lower = msg.to_ascii_lowercase();
    lower.contains("5-hour")
        || lower.contains("rolling window")
        || lower.contains("usage limit for your plan")
        || lower.contains("subscription usage limit")
}
Anthropic enforces a 5-hour rolling window for API usage. The pool automatically detects quota exhaustion errors and fails over to the next healthy account.

Health Monitoring

Monitor pool health via the status API:
let status = pool.get_status().await;

for (idx, account) in status.accounts.iter().enumerate() {
    println!("Account {}: {:?}", idx, account.health);
    println!("  Requests: {}", account.request_count);
    println!("  Errors: {}", account.error_count);
    println!("  Last error: {:?}", account.last_error);
}
Health statuses:
  • Healthy - Account is operational
  • QuotaExceeded - 5-hour quota exhausted, retrying after cooldown
  • Unhealthy - Permanent authentication failure, account disabled

Configuration

use loom_server_llm_anthropic::AnthropicConfig;

let config = AnthropicConfig::new("sk-ant-api03-...")
    .with_model("claude-3-5-sonnet-20241022")  // Default model
    .with_base_url("https://api.anthropic.com")  // Custom endpoint
    .with_max_tokens(4096);  // Default max_tokens

let client = AnthropicClient::new(config)?;
Environment variables:
ANTHROPIC_API_KEY=sk-ant-api03-...
ANTHROPIC_BASE_URL=https://api.anthropic.com  # Optional
ANTHROPIC_MODEL=claude-3-5-sonnet-20241022     # Optional

OpenAI

OpenAI integration provides access to GPT models via the Chat Completions API.

Configuration

use loom_server_llm_openai::{OpenAIClient, OpenAIConfig};

let config = OpenAIConfig::new("sk-...")
    .with_model("gpt-4-turbo")  // or gpt-4, gpt-3.5-turbo
    .with_organization("org-...");  // Optional

let client = OpenAIClient::new(config)?;
Environment variables:
OPENAI_API_KEY=sk-...
OPENAI_ORGANIZATION=org-...  # Optional
OPENAI_MODEL=gpt-4-turbo     # Optional

Retry Configuration

All LLM clients support configurable retry with exponential backoff:
use loom_common_http::RetryConfig;
use std::time::Duration;

let retry_config = RetryConfig {
    max_attempts: 3,
    base_delay: Duration::from_millis(500),
    max_delay: Duration::from_secs(30),
    backoff_factor: 2.0,  // Exponential backoff: 500ms, 1s, 2s, ...
    jitter: true,  // Add randomness to prevent thundering herd
    retryable_statuses: vec![
        reqwest::StatusCode::TOO_MANY_REQUESTS,  // 429
        reqwest::StatusCode::REQUEST_TIMEOUT,     // 408
        reqwest::StatusCode::INTERNAL_SERVER_ERROR,  // 500
        reqwest::StatusCode::BAD_GATEWAY,         // 502
        reqwest::StatusCode::SERVICE_UNAVAILABLE, // 503
        reqwest::StatusCode::GATEWAY_TIMEOUT,     // 504
    ],
};

let client = OpenAIClient::new(config)?
    .with_retry_config(retry_config);
Implementation:
// loom-server-llm-openai/src/client.rs:160
let result = retry(&self.retry_config, || async {
    let req = self.build_request(&request, false);
    
    let response = req.send().await.map_err(|e| {
        if e.is_timeout() {
            OpenAIRequestError(LlmError::Timeout)
        } else {
            OpenAIRequestError(LlmError::Http(e.to_string()))
        }
    })?;
    
    if !response.status().is_success() {
        let error = self.handle_error_response(response).await;
        return Err(OpenAIRequestError(error));
    }
    
    let openai_response: OpenAIResponse = response.json().await
        .map_err(|e| OpenAIRequestError(LlmError::InvalidResponse(e.to_string())))?;
    
    Ok(LlmResponse::from(openai_response))
}).await;

Rate Limiting

OpenAI returns rate limit information in headers:
// loom-server-llm-openai/src/client.rs:117
if status_code == 429 {
    let retry_after = response
        .headers()
        .get("retry-after")
        .and_then(|v| v.to_str().ok())
        .and_then(|v| v.parse().ok());
    
    return LlmError::RateLimited {
        retry_after_secs: retry_after,
    };
}

Google Vertex AI (Gemini)

Vertex AI provides access to Google’s Gemini models via GCP.

Authentication

Vertex AI uses Application Default Credentials (ADC):
1

Set up credentials

Choose one of the following methods:Service Account (Production):
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
Default Service Account (GCE/GKE):
# Automatically uses the compute engine default service account
# No environment variables needed
User Credentials (Development):
gcloud auth application-default login
2

Configure client

use loom_server_llm_vertex::{VertexClient, VertexConfig};

let config = VertexConfig::new("my-gcp-project", "us-central1")
    .with_model("gemini-1.5-pro");

let client = VertexClient::new(config)?;

Token Caching

Vertex AI automatically caches access tokens to reduce auth overhead:
// loom-server-llm-vertex/src/client.rs:134
async fn get_access_token(&self) -> Result<String, ClientError> {
    // Check cache first (60s buffer before expiry)
    {
        let cached = self.cached_token.read().await;
        if let Some(ref token) = *cached {
            if token.expires_at > std::time::Instant::now() + Duration::from_secs(60) {
                return Ok(token.token.clone());
            }
        }
    }
    
    // Initialize auth provider lazily
    let mut provider_guard = self.auth_provider.write().await;
    if provider_guard.is_none() {
        let provider = gcp_auth::provider().await?;
        *provider_guard = Some(provider);
    }
    
    // Get fresh token
    let provider = provider_guard.as_ref().unwrap();
    let scopes = &["https://www.googleapis.com/auth/cloud-platform"];
    let token = provider.token(scopes).await?;
    
    // Cache for ~1 hour
    let mut cached = self.cached_token.write().await;
    *cached = Some(CachedToken {
        token: token_str.clone(),
        expires_at: std::time::Instant::now() + Duration::from_secs(3500),
    });
    
    Ok(token_str)
}

Available Models

gemini-1.5-pro

Best for: Complex reasoning, long context (1M tokens)Flagship model with advanced reasoning capabilities

gemini-1.5-flash

Best for: Fast responses, high throughputOptimized for speed and efficiency

gemini-1.0-pro

Best for: Production workloads, stable APIPrevious generation, highly reliable
Regional endpoints:
let config = VertexConfig::new("my-project", "us-central1");  // US
let config = VertexConfig::new("my-project", "europe-west1");  // Europe
let config = VertexConfig::new("my-project", "asia-southeast1");  // Asia

ZAI (智谱AI)

ZAI provides Chinese language models from ZhipuAI, compatible with OpenAI’s API format.

Configuration

use loom_server_llm_zai::{ZaiClient, ZaiConfig};

let config = ZaiConfig::new("...")  // API key from ZhipuAI
    .with_model("glm-4");  // or glm-4-plus, glm-3-turbo

let client = ZaiClient::new(config)?;
Environment variables:
ZAI_API_KEY=...
ZAI_BASE_URL=https://open.bigmodel.cn/api/paas/v4  # Default
ZAI_MODEL=glm-4  # Optional

Available Models

  • glm-4-plus - Most capable model, best for complex tasks
  • glm-4 - Balanced performance and cost
  • glm-3-turbo - Fast responses, cost-effective
ZAI uses an OpenAI-compatible API, so the client implementation is nearly identical to OpenAIClient with ZAI-specific endpoints.

Unified LlmClient Interface

All providers implement the same LlmClient trait for consistency:
use loom_common_core::{LlmClient, LlmRequest, Message};

#[async_trait]
pub trait LlmClient: Send + Sync {
    /// Perform a non-streaming completion.
    async fn complete(&self, request: LlmRequest) -> Result<LlmResponse, LlmError>;
    
    /// Perform a streaming completion.
    async fn stream(&self, request: LlmRequest) -> Result<LlmStream, LlmError>;
}

Making Requests

use loom_common_core::{LlmRequest, Message, Tool};

// Build request
let request = LlmRequest::new("claude-3-5-sonnet-20241022")
    .with_messages(vec![
        Message::system("You are a helpful coding assistant."),
        Message::user("Write a Rust function to reverse a string."),
    ])
    .with_max_tokens(4096)
    .with_temperature(0.7)
    .with_tools(vec![
        Tool::new(
            "search_code",
            "Search for code examples",
            json!({
                "type": "object",
                "properties": {
                    "query": {"type": "string"}
                },
                "required": ["query"]
            })
        )
    ]);

// Execute
let response = client.complete(request).await?;

println!("Response: {}", response.message.content);
for tool_call in response.tool_calls {
    println!("Tool: {} with args: {}", tool_call.tool_name, tool_call.arguments_json);
}

Streaming Responses

use futures::StreamExt;

let mut stream = client.stream(request).await?;

while let Some(event) = stream.next().await {
    match event? {
        LlmEvent::ContentDelta(text) => {
            print!("{}", text);
        }
        LlmEvent::ToolCall(tool_call) => {
            println!("\nCalling tool: {}", tool_call.tool_name);
        }
        LlmEvent::Done { usage } => {
            println!("\nTokens: {} in, {} out", usage.input_tokens, usage.output_tokens);
            break;
        }
        LlmEvent::Error(error) => {
            eprintln!("Stream error: {}", error);
            break;
        }
    }
}

Error Handling

use loom_common_core::LlmError;

match client.complete(request).await {
    Ok(response) => { /* ... */ }
    Err(LlmError::RateLimited { retry_after_secs }) => {
        println!("Rate limited, retry after {:?} seconds", retry_after_secs);
    }
    Err(LlmError::Timeout) => {
        println!("Request timed out");
    }
    Err(LlmError::Api(message)) => {
        println!("API error: {}", message);
    }
    Err(LlmError::Http(error)) => {
        println!("HTTP error: {}", error);
    }
    Err(LlmError::InvalidResponse(error)) => {
        println!("Invalid response: {}", error);
    }
}

Usage Tracking

All providers return token usage information:
let response = client.complete(request).await?;

if let Some(usage) = response.usage {
    println!("Input tokens: {}", usage.input_tokens);
    println!("Output tokens: {}", usage.output_tokens);
    println!("Total tokens: {}", usage.input_tokens + usage.output_tokens);
}

Best Practices

Use Streaming

Stream responses for better UX. Users see output immediately instead of waiting for the entire response.

Set Timeouts

Configure appropriate timeouts (default: 5 minutes). Long-running requests should use streaming to avoid timeouts.

Handle Rate Limits

Respect retry-after headers and implement exponential backoff. Use account pooling for high-volume workloads.

Monitor Usage

Track token usage to optimize costs. Consider caching responses for repeated queries.
For implementation details, see the source in crates/loom-server-llm-{anthropic,openai,vertex,zai}/.

Build docs developers (and LLMs) love