Skip to main content

Overview

The LLMClient provides a robust interface for interacting with multiple LLM endpoints. It implements weighted routing, automatic failover, latency-based load balancing, and health tracking. Location: packages/orchestrator/src/llm-client.ts

Class: LLMClient

Constructor

new LLMClient(config: LLMClientConfig | LLMClientSingleConfig)
Multi-endpoint Configuration:
interface LLMClientConfig {
  endpoints: LLMEndpoint[]  // Array of endpoints
  model: string
  maxTokens: number
  temperature: number
  timeoutMs?: number        // Default: 120_000 (2 min)
}
Single-endpoint Configuration:
interface LLMClientSingleConfig {
  endpoint: string
  model: string
  maxTokens: number
  temperature: number
  apiKey?: string
  timeoutMs?: number
}
endpoints
LLMEndpoint[]
required
Array of LLM endpoints with name, URL, API key, and weight
model
string
required
Model identifier (e.g., “glm-5”, “gpt-4”)
maxTokens
number
required
Maximum tokens per completion (default: 65536)
temperature
number
required
Sampling temperature (default: 0.7)

LLMEndpoint

interface LLMEndpoint {
  name: string           // Endpoint identifier
  endpoint: string       // Base URL (will normalize to /v1)
  apiKey?: string        // Optional API key
  weight: number         // Routing weight (higher = more traffic)
}
Example:
const client = new LLMClient({
  endpoints: [
    { name: 'primary', endpoint: 'http://localhost:8000', weight: 70 },
    { name: 'backup', endpoint: 'http://backup:8000', weight: 30 }
  ],
  model: 'glm-5',
  maxTokens: 65536,
  temperature: 0.7
})

Core Method

complete()

Sends a chat completion request with automatic failover.
async complete(
  messages: LLMMessage[],
  overrides?: Partial<Pick<LLMClientConfig, 'model' | 'temperature' | 'maxTokens'>>,
  parentSpan?: Span
): Promise<LLMResponse>
messages
LLMMessage[]
required
Array of chat messages with role and content
overrides
object
Override model, temperature, or maxTokens for this request
parentSpan
Span
Parent tracing span for distributed tracing
Returns: LLMResponse
interface LLMResponse {
  content: string
  usage: {
    promptTokens: number
    completionTokens: number
    totalTokens: number
  }
  finishReason: string
  endpoint: string        // Which endpoint served this request
  latencyMs: number
}
Message Format:
interface LLMMessage {
  role: 'system' | 'user' | 'assistant'
  content: string
}
Example:
const response = await client.complete([
  { role: 'system', content: 'You are a helpful assistant.' },
  { role: 'user', content: 'Explain task decomposition.' }
])

console.log(response.content)
console.log(`Tokens used: ${response.usage.totalTokens}`)
console.log(`Served by: ${response.endpoint}`)
console.log(`Latency: ${response.latencyMs}ms`)

Weighted Routing

Endpoint Selection

Endpoints are selected using weighted random sampling:
private selectEndpoints(): EndpointState[]
Selection Order:
  1. Healthy endpoints - Weighted random (respects effective weights)
  2. Unhealthy endpoints - Fallback in original order
Weight Calculation:
const totalWeight = endpoints.reduce((sum, e) => sum + e.effectiveWeight, 0)
let pick = Math.random() * totalWeight

for (const endpoint of endpoints) {
  pick -= endpoint.effectiveWeight
  if (pick <= 0) {
    return endpoint  // Selected
  }
}

Latency-Adaptive Weighting

private rebalanceWeights(): void
Adjusts effective weights based on observed latency:
const minLatency = Math.min(...endpoints.map(e => e.avgLatencyMs))

for (const endpoint of endpoints) {
  const latencyRatio = endpoint.avgLatencyMs / minLatency
  const latencyScale = Math.max(0.5, 1.0 / latencyRatio)
  endpoint.effectiveWeight = endpoint.baseWeight * latencyScale
}
Example:
Endpoint A: base=70, latency=100ms → effective=70 * 1.0 = 70
Endpoint B: base=30, latency=200ms → effective=30 * 0.5 = 15

Endpoint A now receives ~82% of traffic (70/(70+15))
Smoothing: Uses exponential moving average (EMA) with α=0.3:
const LATENCY_ALPHA = 0.3
avgLatency = LATENCY_ALPHA * newLatency + (1 - LATENCY_ALPHA) * avgLatency

Health Tracking

Endpoint States

interface EndpointState {
  config: LLMEndpoint
  effectiveWeight: number
  avgLatencyMs: number
  totalRequests: number
  totalFailures: number
  consecutiveFailures: number
  lastFailureAt: number
  healthy: boolean
}

Health Transitions

Mark Unhealthy:
const UNHEALTHY_THRESHOLD = 3

if (state.consecutiveFailures >= UNHEALTHY_THRESHOLD) {
  state.healthy = false
  logger.warn(`Endpoint ${state.name} marked unhealthy`)
}
Recovery Probe:
const RECOVERY_PROBE_MS = 30_000  // 30 seconds

if (!state.healthy && now - state.lastFailureAt > RECOVERY_PROBE_MS) {
  state.healthy = true
  state.consecutiveFailures = 0
  logger.info(`Endpoint ${state.name} marked healthy for recovery probe`)
}
Unhealthy endpoints are periodically retried to detect recovery.

Success/Failure Recording

private recordSuccess(state: EndpointState, latencyMs: number): void {
  state.consecutiveFailures = 0
  state.healthy = true
  
  // Update EMA
  if (state.avgLatencyMs === 0) {
    state.avgLatencyMs = latencyMs
  } else {
    state.avgLatencyMs = LATENCY_ALPHA * latencyMs + 
                         (1 - LATENCY_ALPHA) * state.avgLatencyMs
  }
  
  this.rebalanceWeights()
}

private recordFailure(state: EndpointState, error: Error): void {
  state.totalFailures++
  state.consecutiveFailures++
  state.lastFailureAt = Date.now()
  
  if (state.consecutiveFailures >= UNHEALTHY_THRESHOLD) {
    state.healthy = false
  }
}

Failover Behavior

for (const endpoint of orderedEndpoints) {
  try {
    const result = await this.sendRequest(endpoint, messages)
    return result  // Success
  } catch (error) {
    this.recordFailure(endpoint, error)
    
    if (attemptIndex < orderedEndpoints.length - 1) {
      logger.warn(`Endpoint ${endpoint.name} failed, trying next`)
    }
  }
}

throw new Error(`All ${endpoints.length} LLM endpoints failed`)
Behavior:
  1. Try primary endpoint (highest weight, lowest latency)
  2. On failure, immediately try next endpoint
  3. Continue until success or all endpoints exhausted
  4. Record failures for health tracking

Request Execution

sendRequest()

private async sendRequest(
  state: EndpointState,
  messages: LLMMessage[],
  overrides?: Partial<LLMClientConfig>,
  parentSpan?: Span
): Promise<LLMResponse>
HTTP Request:
const url = `${endpoint.endpoint}/v1/chat/completions`

const response = await fetch(url, {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    ...(endpoint.apiKey ? { 'Authorization': `Bearer ${endpoint.apiKey}` } : {})
  },
  body: JSON.stringify({
    model,
    messages,
    temperature,
    max_tokens: maxTokens
  }),
  signal: AbortSignal.timeout(timeoutMs ?? 120_000)
})
Response Validation:
function isChatCompletionResponse(value: unknown): value is ChatCompletionResponse
Validates OpenAI-compatible response shape.

Readiness Probe

waitForReady()

Waits for at least one endpoint to become available.
async waitForReady(options?: {
  maxWaitMs?: number      // Default: 120_000 (2 min)
  pollIntervalMs?: number // Default: 5_000 (5 sec)
}): Promise<void>
Behavior:
while (Date.now() < deadline) {
  for (const endpoint of endpoints) {
    try {
      const res = await fetch(`${endpoint.endpoint}/v1/models`, {
        signal: AbortSignal.timeout(5_000)
      })
      if (res.ok) {
        return  // Ready!
      }
    } catch {
      // Not ready, continue
    }
  }
  
  await sleep(pollIntervalMs)
}

throw new Error('LLM readiness probe timed out')
Usage:
await client.waitForReady({ maxWaitMs: 60_000 })
logger.info('LLM client ready')

Statistics

getEndpointStats()

getEndpointStats(): Array<{
  name: string
  endpoint: string
  healthy: boolean
  effectiveWeight: number
  avgLatencyMs: number
  totalRequests: number
  totalFailures: number
}>
Example Output:
[
  {
    name: 'primary',
    endpoint: 'http://localhost:8000',
    healthy: true,
    effectiveWeight: 70.0,
    avgLatencyMs: 250,
    totalRequests: 42,
    totalFailures: 0
  },
  {
    name: 'backup',
    endpoint: 'http://backup:8000',
    healthy: false,
    effectiveWeight: 15.0,
    avgLatencyMs: 1200,
    totalRequests: 8,
    totalFailures: 5
  }
]

totalRequests

get totalRequests(): number
Returns total number of completion requests made.

Usage Example

import { LLMClient } from '@longshot/orchestrator'

// Multi-endpoint client
const client = new LLMClient({
  endpoints: [
    { name: 'primary', endpoint: 'http://localhost:8000', weight: 70 },
    { name: 'secondary', endpoint: 'http://localhost:8001', weight: 30 }
  ],
  model: 'glm-5',
  maxTokens: 65536,
  temperature: 0.7,
  timeoutMs: 120_000
})

// Wait for readiness
await client.waitForReady()

// Send completion request
const response = await client.complete([
  { role: 'system', content: 'You are a task planner.' },
  { role: 'user', content: 'Break down: Implement authentication' }
])

console.log(response.content)
console.log(`Served by ${response.endpoint} in ${response.latencyMs}ms`)

// Check endpoint health
const stats = client.getEndpointStats()
for (const stat of stats) {
  console.log(`${stat.name}: healthy=${stat.healthy}, latency=${stat.avgLatencyMs}ms`)
}

Constants

const LATENCY_ALPHA = 0.3           // EMA smoothing factor
const UNHEALTHY_THRESHOLD = 3       // Consecutive failures
const RECOVERY_PROBE_MS = 30_000    // 30 seconds

Build docs developers (and LLMs) love