LLM Client

Overview

The LLMClient provides a robust interface for interacting with multiple LLM endpoints. It implements weighted routing, automatic failover, latency-based load balancing, and health tracking. Location: packages/orchestrator/src/llm-client.ts

Class: `LLMClient`

Constructor

new LLMClient(config: LLMClientConfig | LLMClientSingleConfig)

Multi-endpoint Configuration:

interface LLMClientConfig {
  endpoints: LLMEndpoint[]  // Array of endpoints
  model: string
  maxTokens: number
  temperature: number
  timeoutMs?: number        // Default: 120_000 (2 min)
}

Single-endpoint Configuration:

interface LLMClientSingleConfig {
  endpoint: string
  model: string
  maxTokens: number
  temperature: number
  apiKey?: string
  timeoutMs?: number
}

endpoints

LLMEndpoint[]

required

Array of LLM endpoints with name, URL, API key, and weight

model

string

required

Model identifier (e.g., “glm-5”, “gpt-4”)

maxTokens

number

required

Maximum tokens per completion (default: 65536)

temperature

number

required

Sampling temperature (default: 0.7)

`LLMEndpoint`

interface LLMEndpoint {
  name: string           // Endpoint identifier
  endpoint: string       // Base URL (will normalize to /v1)
  apiKey?: string        // Optional API key
  weight: number         // Routing weight (higher = more traffic)
}

Example:

const client = new LLMClient({
  endpoints: [
    { name: 'primary', endpoint: 'http://localhost:8000', weight: 70 },
    { name: 'backup', endpoint: 'http://backup:8000', weight: 30 }
  ],
  model: 'glm-5',
  maxTokens: 65536,
  temperature: 0.7
})

Core Method

`complete()`

Sends a chat completion request with automatic failover.

async complete(
  messages: LLMMessage[],
  overrides?: Partial<Pick<LLMClientConfig, 'model' | 'temperature' | 'maxTokens'>>,
  parentSpan?: Span
): Promise<LLMResponse>

messages

LLMMessage[]

required

Array of chat messages with role and content

overrides

object

Override model, temperature, or maxTokens for this request

parentSpan

Span

Parent tracing span for distributed tracing

Returns: LLMResponse

interface LLMResponse {
  content: string
  usage: {
    promptTokens: number
    completionTokens: number
    totalTokens: number
  }
  finishReason: string
  endpoint: string        // Which endpoint served this request
  latencyMs: number
}

Message Format:

interface LLMMessage {
  role: 'system' | 'user' | 'assistant'
  content: string
}

Example:

const response = await client.complete([
  { role: 'system', content: 'You are a helpful assistant.' },
  { role: 'user', content: 'Explain task decomposition.' }
])

console.log(response.content)
console.log(`Tokens used: ${response.usage.totalTokens}`)
console.log(`Served by: ${response.endpoint}`)
console.log(`Latency: ${response.latencyMs}ms`)

Weighted Routing

Endpoint Selection

Endpoints are selected using weighted random sampling:

private selectEndpoints(): EndpointState[]

Selection Order:

Healthy endpoints - Weighted random (respects effective weights)
Unhealthy endpoints - Fallback in original order

Weight Calculation:

const totalWeight = endpoints.reduce((sum, e) => sum + e.effectiveWeight, 0)
let pick = Math.random() * totalWeight

for (const endpoint of endpoints) {
  pick -= endpoint.effectiveWeight
  if (pick <= 0) {
    return endpoint  // Selected
  }
}

Latency-Adaptive Weighting

private rebalanceWeights(): void

Adjusts effective weights based on observed latency:

const minLatency = Math.min(...endpoints.map(e => e.avgLatencyMs))

for (const endpoint of endpoints) {
  const latencyRatio = endpoint.avgLatencyMs / minLatency
  const latencyScale = Math.max(0.5, 1.0 / latencyRatio)
  endpoint.effectiveWeight = endpoint.baseWeight * latencyScale
}

Example:

Endpoint A: base=70, latency=100ms → effective=70 * 1.0 = 70
Endpoint B: base=30, latency=200ms → effective=30 * 0.5 = 15

Endpoint A now receives ~82% of traffic (70/(70+15))

Smoothing: Uses exponential moving average (EMA) with α=0.3:

const LATENCY_ALPHA = 0.3
avgLatency = LATENCY_ALPHA * newLatency + (1 - LATENCY_ALPHA) * avgLatency

Health Tracking

Endpoint States

interface EndpointState {
  config: LLMEndpoint
  effectiveWeight: number
  avgLatencyMs: number
  totalRequests: number
  totalFailures: number
  consecutiveFailures: number
  lastFailureAt: number
  healthy: boolean
}

Health Transitions

Mark Unhealthy:

const UNHEALTHY_THRESHOLD = 3

if (state.consecutiveFailures >= UNHEALTHY_THRESHOLD) {
  state.healthy = false
  logger.warn(`Endpoint ${state.name} marked unhealthy`)
}

Recovery Probe:

const RECOVERY_PROBE_MS = 30_000  // 30 seconds

if (!state.healthy && now - state.lastFailureAt > RECOVERY_PROBE_MS) {
  state.healthy = true
  state.consecutiveFailures = 0
  logger.info(`Endpoint ${state.name} marked healthy for recovery probe`)
}

Unhealthy endpoints are periodically retried to detect recovery.

Success/Failure Recording

private recordSuccess(state: EndpointState, latencyMs: number): void {
  state.consecutiveFailures = 0
  state.healthy = true
  
  // Update EMA
  if (state.avgLatencyMs === 0) {
    state.avgLatencyMs = latencyMs
  } else {
    state.avgLatencyMs = LATENCY_ALPHA * latencyMs + 
                         (1 - LATENCY_ALPHA) * state.avgLatencyMs
  }
  
  this.rebalanceWeights()
}

private recordFailure(state: EndpointState, error: Error): void {
  state.totalFailures++
  state.consecutiveFailures++
  state.lastFailureAt = Date.now()
  
  if (state.consecutiveFailures >= UNHEALTHY_THRESHOLD) {
    state.healthy = false
  }
}

Failover Behavior

for (const endpoint of orderedEndpoints) {
  try {
    const result = await this.sendRequest(endpoint, messages)
    return result  // Success
  } catch (error) {
    this.recordFailure(endpoint, error)
    
    if (attemptIndex < orderedEndpoints.length - 1) {
      logger.warn(`Endpoint ${endpoint.name} failed, trying next`)
    }
  }
}

throw new Error(`All ${endpoints.length} LLM endpoints failed`)

Behavior:

Try primary endpoint (highest weight, lowest latency)
On failure, immediately try next endpoint
Continue until success or all endpoints exhausted
Record failures for health tracking

Request Execution

`sendRequest()`

private async sendRequest(
  state: EndpointState,
  messages: LLMMessage[],
  overrides?: Partial<LLMClientConfig>,
  parentSpan?: Span
): Promise<LLMResponse>

HTTP Request:

const url = `${endpoint.endpoint}/v1/chat/completions`

const response = await fetch(url, {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    ...(endpoint.apiKey ? { 'Authorization': `Bearer ${endpoint.apiKey}` } : {})
  },
  body: JSON.stringify({
    model,
    messages,
    temperature,
    max_tokens: maxTokens
  }),
  signal: AbortSignal.timeout(timeoutMs ?? 120_000)
})

Response Validation:

function isChatCompletionResponse(value: unknown): value is ChatCompletionResponse

Validates OpenAI-compatible response shape.

Readiness Probe

`waitForReady()`

Waits for at least one endpoint to become available.

async waitForReady(options?: {
  maxWaitMs?: number      // Default: 120_000 (2 min)
  pollIntervalMs?: number // Default: 5_000 (5 sec)
}): Promise<void>

Behavior:

while (Date.now() < deadline) {
  for (const endpoint of endpoints) {
    try {
      const res = await fetch(`${endpoint.endpoint}/v1/models`, {
        signal: AbortSignal.timeout(5_000)
      })
      if (res.ok) {
        return  // Ready!
      }
    } catch {
      // Not ready, continue
    }
  }
  
  await sleep(pollIntervalMs)
}

throw new Error('LLM readiness probe timed out')

Usage:

await client.waitForReady({ maxWaitMs: 60_000 })
logger.info('LLM client ready')

Statistics

`getEndpointStats()`

getEndpointStats(): Array<{
  name: string
  endpoint: string
  healthy: boolean
  effectiveWeight: number
  avgLatencyMs: number
  totalRequests: number
  totalFailures: number
}>

Example Output:

[
  {
    name: 'primary',
    endpoint: 'http://localhost:8000',
    healthy: true,
    effectiveWeight: 70.0,
    avgLatencyMs: 250,
    totalRequests: 42,
    totalFailures: 0
  },
  {
    name: 'backup',
    endpoint: 'http://backup:8000',
    healthy: false,
    effectiveWeight: 15.0,
    avgLatencyMs: 1200,
    totalRequests: 8,
    totalFailures: 5
  }
]

`totalRequests`

get totalRequests(): number

Returns total number of completion requests made.

Usage Example

import { LLMClient } from '@longshot/orchestrator'

// Multi-endpoint client
const client = new LLMClient({
  endpoints: [
    { name: 'primary', endpoint: 'http://localhost:8000', weight: 70 },
    { name: 'secondary', endpoint: 'http://localhost:8001', weight: 30 }
  ],
  model: 'glm-5',
  maxTokens: 65536,
  temperature: 0.7,
  timeoutMs: 120_000
})

// Wait for readiness
await client.waitForReady()

// Send completion request
const response = await client.complete([
  { role: 'system', content: 'You are a task planner.' },
  { role: 'user', content: 'Break down: Implement authentication' }
])

console.log(response.content)
console.log(`Served by ${response.endpoint} in ${response.latencyMs}ms`)

// Check endpoint health
const stats = client.getEndpointStats()
for (const stat of stats) {
  console.log(`${stat.name}: healthy=${stat.healthy}, latency=${stat.avgLatencyMs}ms`)
}

Constants

const LATENCY_ALPHA = 0.3           // EMA smoothing factor
const UNHEALTHY_THRESHOLD = 3       // Consecutive failures
const RECOVERY_PROBE_MS = 30_000    // 30 seconds

Packages

Core

Orchestrator

Sandbox

CLI

Overview

Class: `LLMClient`

Constructor

`LLMEndpoint`

Core Method

`complete()`

Weighted Routing

Endpoint Selection

Latency-Adaptive Weighting

Health Tracking

Endpoint States

Health Transitions

Success/Failure Recording

Failover Behavior

Request Execution

`sendRequest()`

Readiness Probe

`waitForReady()`

Statistics

`getEndpointStats()`

`totalRequests`

Usage Example

Constants

Build docs developers (and LLMs) love

Packages

Core

Orchestrator

Sandbox

CLI

​Overview

​Class: LLMClient

​Constructor

​LLMEndpoint

​Core Method

​complete()

​Weighted Routing

​Endpoint Selection

​Latency-Adaptive Weighting

​Health Tracking

​Endpoint States

​Health Transitions

​Success/Failure Recording

​Failover Behavior

​Request Execution

​sendRequest()

​Readiness Probe

​waitForReady()

​Statistics

​getEndpointStats()

​totalRequests

​Usage Example

​Constants

Build docs developers (and LLMs) love

Overview

Class: `LLMClient`

Constructor

`LLMEndpoint`

Core Method

`complete()`

Weighted Routing

Endpoint Selection

Latency-Adaptive Weighting

Health Tracking

Endpoint States

Health Transitions

Success/Failure Recording

Failover Behavior

Request Execution

`sendRequest()`

Readiness Probe

`waitForReady()`

Statistics

`getEndpointStats()`

`totalRequests`

Usage Example

Constants