Loom provides a unified LlmClient trait with implementations for multiple LLM providers. All providers support streaming, tool calling, and automatic retry with exponential backoff.
Supported Providers
Anthropic (Claude) Claude 3.5 Sonnet, Opus, and Haiku via Messages API
OpenAI GPT-4, GPT-4 Turbo, and GPT-3.5 Turbo
Google Vertex AI Gemini 1.5 Pro and Flash via Vertex AI
ZAI (智谱AI) Chinese language models from ZhipuAI
Anthropic (Claude)
Loom’s Anthropic integration supports both API key and OAuth authentication, with account pooling for high-volume deployments.
Authentication
API Key
OAuth (Recommended)
use loom_server_llm_anthropic :: { AnthropicClient , AnthropicConfig };
let config = AnthropicConfig :: new ( "sk-ant-api03-..." );
let client = AnthropicClient :: new ( config ) ? ;
Environment variable: ANTHROPIC_API_KEY = sk-ant-api03-...
OAuth provides better rate limits and usage tracking: use loom_server_llm_anthropic :: { AnthropicAuth , OAuthCredentials };
let credentials = OAuthCredentials {
access_token : "..." . to_string (),
refresh_token : Some ( "..." . to_string ()),
expires_at : Some ( Utc :: now () + Duration :: hours ( 1 )),
};
let auth = AnthropicAuth :: oauth ( credentials , credential_store );
let config = AnthropicConfig :: new_with_auth ( auth );
OAuth flow: use loom_server_llm_anthropic :: auth :: {authorize, exchange_code, Pkce };
// 1. Generate PKCE challenge
let pkce = Pkce :: generate ();
// 2. Get authorization URL
let auth_request = authorize ( & pkce , None ) ? ;
// Redirect user to auth_request.url
// 3. Exchange code for tokens
let result = exchange_code ( auth_code , & pkce ) . await ? ;
// Store result.credentials
Account Pooling
For high-volume deployments, use AnthropicPool to manage multiple accounts with automatic failover:
use loom_server_llm_anthropic :: {
AnthropicPool , AnthropicPoolConfig , AccountSelectionStrategy
};
let config = AnthropicPoolConfig {
accounts : vec! [
AnthropicConfig :: new ( "sk-ant-api03-account1..." ),
AnthropicConfig :: new ( "sk-ant-api03-account2..." ),
AnthropicConfig :: new ( "sk-ant-api03-account3..." ),
],
strategy : AccountSelectionStrategy :: RoundRobin ,
health_check_interval : Duration :: from_secs ( 60 ),
};
let pool = AnthropicPool :: new ( config ) . await ? ;
// Use pool like a regular client
let response = pool . complete ( request ) . await ? ;
Selection strategies:
Distributes requests evenly across all healthy accounts. Best for balanced load distribution.
Routes to the account with the lowest recent usage. Best for quota management.
Uses a primary account until quota exhausted, then fails over to backup accounts.
Automatic failover behavior:
// loom-server-llm-anthropic/src/client.rs:72
fn classify_error ( status : u16 , message : & str ) -> ClientErrorKind {
if status == 401 || status == 403 {
return ClientErrorKind :: Permanent ; // Disable account
}
if status == 429 && is_quota_message ( message ) {
return ClientErrorKind :: QuotaExceeded ; // Failover to next account
}
if matches! ( status , 408 | 429 | 500 | 502 | 503 | 504 ) {
return ClientErrorKind :: Transient ; // Retry on same account
}
ClientErrorKind :: Permanent
}
// loom-server-llm-anthropic/src/client.rs:47
pub fn is_quota_message ( msg : & str ) -> bool {
let lower = msg . to_ascii_lowercase ();
lower . contains ( "5-hour" )
|| lower . contains ( "rolling window" )
|| lower . contains ( "usage limit for your plan" )
|| lower . contains ( "subscription usage limit" )
}
Anthropic enforces a 5-hour rolling window for API usage. The pool automatically detects quota exhaustion errors and fails over to the next healthy account.
Health Monitoring
Monitor pool health via the status API:
let status = pool . get_status () . await ;
for ( idx , account ) in status . accounts . iter () . enumerate () {
println! ( "Account {}: {:?}" , idx , account . health);
println! ( " Requests: {}" , account . request_count);
println! ( " Errors: {}" , account . error_count);
println! ( " Last error: {:?}" , account . last_error);
}
Health statuses:
Healthy - Account is operational
QuotaExceeded - 5-hour quota exhausted, retrying after cooldown
Unhealthy - Permanent authentication failure, account disabled
Configuration
use loom_server_llm_anthropic :: AnthropicConfig ;
let config = AnthropicConfig :: new ( "sk-ant-api03-..." )
. with_model ( "claude-3-5-sonnet-20241022" ) // Default model
. with_base_url ( "https://api.anthropic.com" ) // Custom endpoint
. with_max_tokens ( 4096 ); // Default max_tokens
let client = AnthropicClient :: new ( config ) ? ;
Environment variables:
ANTHROPIC_API_KEY = sk-ant-api03-...
ANTHROPIC_BASE_URL = https://api.anthropic.com # Optional
ANTHROPIC_MODEL = claude-3-5-sonnet-20241022 # Optional
OpenAI
OpenAI integration provides access to GPT models via the Chat Completions API.
Configuration
use loom_server_llm_openai :: { OpenAIClient , OpenAIConfig };
let config = OpenAIConfig :: new ( "sk-..." )
. with_model ( "gpt-4-turbo" ) // or gpt-4, gpt-3.5-turbo
. with_organization ( "org-..." ); // Optional
let client = OpenAIClient :: new ( config ) ? ;
Environment variables:
OPENAI_API_KEY = sk-...
OPENAI_ORGANIZATION = org-... # Optional
OPENAI_MODEL = gpt-4-turbo # Optional
Retry Configuration
All LLM clients support configurable retry with exponential backoff:
use loom_common_http :: RetryConfig ;
use std :: time :: Duration ;
let retry_config = RetryConfig {
max_attempts : 3 ,
base_delay : Duration :: from_millis ( 500 ),
max_delay : Duration :: from_secs ( 30 ),
backoff_factor : 2.0 , // Exponential backoff: 500ms, 1s, 2s, ...
jitter : true , // Add randomness to prevent thundering herd
retryable_statuses : vec! [
reqwest :: StatusCode :: TOO_MANY_REQUESTS , // 429
reqwest :: StatusCode :: REQUEST_TIMEOUT , // 408
reqwest :: StatusCode :: INTERNAL_SERVER_ERROR , // 500
reqwest :: StatusCode :: BAD_GATEWAY , // 502
reqwest :: StatusCode :: SERVICE_UNAVAILABLE , // 503
reqwest :: StatusCode :: GATEWAY_TIMEOUT , // 504
],
};
let client = OpenAIClient :: new ( config ) ?
. with_retry_config ( retry_config );
Implementation:
// loom-server-llm-openai/src/client.rs:160
let result = retry ( & self . retry_config, || async {
let req = self . build_request ( & request , false );
let response = req . send () . await . map_err ( | e | {
if e . is_timeout () {
OpenAIRequestError ( LlmError :: Timeout )
} else {
OpenAIRequestError ( LlmError :: Http ( e . to_string ()))
}
}) ? ;
if ! response . status () . is_success () {
let error = self . handle_error_response ( response ) . await ;
return Err ( OpenAIRequestError ( error ));
}
let openai_response : OpenAIResponse = response . json () . await
. map_err ( | e | OpenAIRequestError ( LlmError :: InvalidResponse ( e . to_string ()))) ? ;
Ok ( LlmResponse :: from ( openai_response ))
}) . await ;
Rate Limiting
OpenAI returns rate limit information in headers:
// loom-server-llm-openai/src/client.rs:117
if status_code == 429 {
let retry_after = response
. headers ()
. get ( "retry-after" )
. and_then ( | v | v . to_str () . ok ())
. and_then ( | v | v . parse () . ok ());
return LlmError :: RateLimited {
retry_after_secs : retry_after ,
};
}
Google Vertex AI (Gemini)
Vertex AI provides access to Google’s Gemini models via GCP.
Authentication
Vertex AI uses Application Default Credentials (ADC) :
Set up credentials
Choose one of the following methods: Service Account (Production): export GOOGLE_APPLICATION_CREDENTIALS = / path / to / service-account . json
Default Service Account (GCE/GKE): # Automatically uses the compute engine default service account
# No environment variables needed
User Credentials (Development): gcloud auth application-default login
Configure client
use loom_server_llm_vertex :: { VertexClient , VertexConfig };
let config = VertexConfig :: new ( "my-gcp-project" , "us-central1" )
. with_model ( "gemini-1.5-pro" );
let client = VertexClient :: new ( config ) ? ;
Token Caching
Vertex AI automatically caches access tokens to reduce auth overhead:
// loom-server-llm-vertex/src/client.rs:134
async fn get_access_token ( & self ) -> Result < String , ClientError > {
// Check cache first (60s buffer before expiry)
{
let cached = self . cached_token . read () . await ;
if let Some ( ref token ) = * cached {
if token . expires_at > std :: time :: Instant :: now () + Duration :: from_secs ( 60 ) {
return Ok ( token . token . clone ());
}
}
}
// Initialize auth provider lazily
let mut provider_guard = self . auth_provider . write () . await ;
if provider_guard . is_none () {
let provider = gcp_auth :: provider () . await ? ;
* provider_guard = Some ( provider );
}
// Get fresh token
let provider = provider_guard . as_ref () . unwrap ();
let scopes = & [ "https://www.googleapis.com/auth/cloud-platform" ];
let token = provider . token ( scopes ) . await ? ;
// Cache for ~1 hour
let mut cached = self . cached_token . write () . await ;
* cached = Some ( CachedToken {
token : token_str . clone (),
expires_at : std :: time :: Instant :: now () + Duration :: from_secs ( 3500 ),
});
Ok ( token_str )
}
Available Models
gemini-1.5-pro Best for: Complex reasoning, long context (1M tokens)Flagship model with advanced reasoning capabilities
gemini-1.5-flash Best for: Fast responses, high throughputOptimized for speed and efficiency
gemini-1.0-pro Best for: Production workloads, stable APIPrevious generation, highly reliable
Regional endpoints:
let config = VertexConfig :: new ( "my-project" , "us-central1" ); // US
let config = VertexConfig :: new ( "my-project" , "europe-west1" ); // Europe
let config = VertexConfig :: new ( "my-project" , "asia-southeast1" ); // Asia
ZAI (智谱AI)
ZAI provides Chinese language models from ZhipuAI, compatible with OpenAI’s API format.
Configuration
use loom_server_llm_zai :: { ZaiClient , ZaiConfig };
let config = ZaiConfig :: new ( "..." ) // API key from ZhipuAI
. with_model ( "glm-4" ); // or glm-4-plus, glm-3-turbo
let client = ZaiClient :: new ( config ) ? ;
Environment variables:
ZAI_API_KEY = ...
ZAI_BASE_URL = https://open.bigmodel.cn/api/paas/v4 # Default
ZAI_MODEL = glm-4 # Optional
Available Models
glm-4-plus - Most capable model, best for complex tasks
glm-4 - Balanced performance and cost
glm-3-turbo - Fast responses, cost-effective
ZAI uses an OpenAI-compatible API, so the client implementation is nearly identical to OpenAIClient with ZAI-specific endpoints.
Unified LlmClient Interface
All providers implement the same LlmClient trait for consistency:
use loom_common_core :: { LlmClient , LlmRequest , Message };
#[async_trait]
pub trait LlmClient : Send + Sync {
/// Perform a non-streaming completion.
async fn complete ( & self , request : LlmRequest ) -> Result < LlmResponse , LlmError >;
/// Perform a streaming completion.
async fn stream ( & self , request : LlmRequest ) -> Result < LlmStream , LlmError >;
}
Making Requests
use loom_common_core :: { LlmRequest , Message , Tool };
// Build request
let request = LlmRequest :: new ( "claude-3-5-sonnet-20241022" )
. with_messages ( vec! [
Message :: system ( "You are a helpful coding assistant." ),
Message :: user ( "Write a Rust function to reverse a string." ),
])
. with_max_tokens ( 4096 )
. with_temperature ( 0.7 )
. with_tools ( vec! [
Tool :: new (
"search_code" ,
"Search for code examples" ,
json! ({
"type" : "object" ,
"properties" : {
"query" : { "type" : "string" }
},
"required" : [ "query" ]
})
)
]);
// Execute
let response = client . complete ( request ) . await ? ;
println! ( "Response: {}" , response . message . content);
for tool_call in response . tool_calls {
println! ( "Tool: {} with args: {}" , tool_call . tool_name, tool_call . arguments_json);
}
Streaming Responses
use futures :: StreamExt ;
let mut stream = client . stream ( request ) . await ? ;
while let Some ( event ) = stream . next () . await {
match event ? {
LlmEvent :: ContentDelta ( text ) => {
print! ( "{}" , text );
}
LlmEvent :: ToolCall ( tool_call ) => {
println! ( " \n Calling tool: {}" , tool_call . tool_name);
}
LlmEvent :: Done { usage } => {
println! ( " \n Tokens: {} in, {} out" , usage . input_tokens, usage . output_tokens);
break ;
}
LlmEvent :: Error ( error ) => {
eprintln! ( "Stream error: {}" , error );
break ;
}
}
}
Error Handling
use loom_common_core :: LlmError ;
match client . complete ( request ) . await {
Ok ( response ) => { /* ... */ }
Err ( LlmError :: RateLimited { retry_after_secs }) => {
println! ( "Rate limited, retry after {:?} seconds" , retry_after_secs );
}
Err ( LlmError :: Timeout ) => {
println! ( "Request timed out" );
}
Err ( LlmError :: Api ( message )) => {
println! ( "API error: {}" , message );
}
Err ( LlmError :: Http ( error )) => {
println! ( "HTTP error: {}" , error );
}
Err ( LlmError :: InvalidResponse ( error )) => {
println! ( "Invalid response: {}" , error );
}
}
Usage Tracking
All providers return token usage information:
let response = client . complete ( request ) . await ? ;
if let Some ( usage ) = response . usage {
println! ( "Input tokens: {}" , usage . input_tokens);
println! ( "Output tokens: {}" , usage . output_tokens);
println! ( "Total tokens: {}" , usage . input_tokens + usage . output_tokens);
}
Best Practices
Use Streaming Stream responses for better UX. Users see output immediately instead of waiting for the entire response.
Set Timeouts Configure appropriate timeouts (default: 5 minutes). Long-running requests should use streaming to avoid timeouts.
Handle Rate Limits Respect retry-after headers and implement exponential backoff. Use account pooling for high-volume workloads.
Monitor Usage Track token usage to optimize costs. Consider caching responses for repeated queries.
For implementation details, see the source in crates/loom-server-llm-{anthropic,openai,vertex,zai}/.