Overview
The LLMClient provides a robust interface for interacting with multiple LLM endpoints. It implements weighted routing, automatic failover, latency-based load balancing, and health tracking.
Location: packages/orchestrator/src/llm-client.ts
Class: LLMClient
Constructor
new LLMClient(config: LLMClientConfig | LLMClientSingleConfig)
Multi-endpoint Configuration:
interface LLMClientConfig {
endpoints: LLMEndpoint[] // Array of endpoints
model: string
maxTokens: number
temperature: number
timeoutMs?: number // Default: 120_000 (2 min)
}
Single-endpoint Configuration:
interface LLMClientSingleConfig {
endpoint: string
model: string
maxTokens: number
temperature: number
apiKey?: string
timeoutMs?: number
}
Array of LLM endpoints with name, URL, API key, and weight
Model identifier (e.g., “glm-5”, “gpt-4”)
Maximum tokens per completion (default: 65536)
Sampling temperature (default: 0.7)
LLMEndpoint
interface LLMEndpoint {
name: string // Endpoint identifier
endpoint: string // Base URL (will normalize to /v1)
apiKey?: string // Optional API key
weight: number // Routing weight (higher = more traffic)
}
Example:
const client = new LLMClient({
endpoints: [
{ name: 'primary', endpoint: 'http://localhost:8000', weight: 70 },
{ name: 'backup', endpoint: 'http://backup:8000', weight: 30 }
],
model: 'glm-5',
maxTokens: 65536,
temperature: 0.7
})
Core Method
complete()
Sends a chat completion request with automatic failover.
async complete(
messages: LLMMessage[],
overrides?: Partial<Pick<LLMClientConfig, 'model' | 'temperature' | 'maxTokens'>>,
parentSpan?: Span
): Promise<LLMResponse>
Array of chat messages with role and content
Override model, temperature, or maxTokens for this request
Parent tracing span for distributed tracing
Returns: LLMResponse
interface LLMResponse {
content: string
usage: {
promptTokens: number
completionTokens: number
totalTokens: number
}
finishReason: string
endpoint: string // Which endpoint served this request
latencyMs: number
}
Message Format:
interface LLMMessage {
role: 'system' | 'user' | 'assistant'
content: string
}
Example:
const response = await client.complete([
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'Explain task decomposition.' }
])
console.log(response.content)
console.log(`Tokens used: ${response.usage.totalTokens}`)
console.log(`Served by: ${response.endpoint}`)
console.log(`Latency: ${response.latencyMs}ms`)
Weighted Routing
Endpoint Selection
Endpoints are selected using weighted random sampling:
private selectEndpoints(): EndpointState[]
Selection Order:
- Healthy endpoints - Weighted random (respects effective weights)
- Unhealthy endpoints - Fallback in original order
Weight Calculation:
const totalWeight = endpoints.reduce((sum, e) => sum + e.effectiveWeight, 0)
let pick = Math.random() * totalWeight
for (const endpoint of endpoints) {
pick -= endpoint.effectiveWeight
if (pick <= 0) {
return endpoint // Selected
}
}
Latency-Adaptive Weighting
private rebalanceWeights(): void
Adjusts effective weights based on observed latency:
const minLatency = Math.min(...endpoints.map(e => e.avgLatencyMs))
for (const endpoint of endpoints) {
const latencyRatio = endpoint.avgLatencyMs / minLatency
const latencyScale = Math.max(0.5, 1.0 / latencyRatio)
endpoint.effectiveWeight = endpoint.baseWeight * latencyScale
}
Example:
Endpoint A: base=70, latency=100ms → effective=70 * 1.0 = 70
Endpoint B: base=30, latency=200ms → effective=30 * 0.5 = 15
Endpoint A now receives ~82% of traffic (70/(70+15))
Smoothing: Uses exponential moving average (EMA) with α=0.3:
const LATENCY_ALPHA = 0.3
avgLatency = LATENCY_ALPHA * newLatency + (1 - LATENCY_ALPHA) * avgLatency
Health Tracking
Endpoint States
interface EndpointState {
config: LLMEndpoint
effectiveWeight: number
avgLatencyMs: number
totalRequests: number
totalFailures: number
consecutiveFailures: number
lastFailureAt: number
healthy: boolean
}
Health Transitions
Mark Unhealthy:
const UNHEALTHY_THRESHOLD = 3
if (state.consecutiveFailures >= UNHEALTHY_THRESHOLD) {
state.healthy = false
logger.warn(`Endpoint ${state.name} marked unhealthy`)
}
Recovery Probe:
const RECOVERY_PROBE_MS = 30_000 // 30 seconds
if (!state.healthy && now - state.lastFailureAt > RECOVERY_PROBE_MS) {
state.healthy = true
state.consecutiveFailures = 0
logger.info(`Endpoint ${state.name} marked healthy for recovery probe`)
}
Unhealthy endpoints are periodically retried to detect recovery.
Success/Failure Recording
private recordSuccess(state: EndpointState, latencyMs: number): void {
state.consecutiveFailures = 0
state.healthy = true
// Update EMA
if (state.avgLatencyMs === 0) {
state.avgLatencyMs = latencyMs
} else {
state.avgLatencyMs = LATENCY_ALPHA * latencyMs +
(1 - LATENCY_ALPHA) * state.avgLatencyMs
}
this.rebalanceWeights()
}
private recordFailure(state: EndpointState, error: Error): void {
state.totalFailures++
state.consecutiveFailures++
state.lastFailureAt = Date.now()
if (state.consecutiveFailures >= UNHEALTHY_THRESHOLD) {
state.healthy = false
}
}
Failover Behavior
for (const endpoint of orderedEndpoints) {
try {
const result = await this.sendRequest(endpoint, messages)
return result // Success
} catch (error) {
this.recordFailure(endpoint, error)
if (attemptIndex < orderedEndpoints.length - 1) {
logger.warn(`Endpoint ${endpoint.name} failed, trying next`)
}
}
}
throw new Error(`All ${endpoints.length} LLM endpoints failed`)
Behavior:
- Try primary endpoint (highest weight, lowest latency)
- On failure, immediately try next endpoint
- Continue until success or all endpoints exhausted
- Record failures for health tracking
Request Execution
sendRequest()
private async sendRequest(
state: EndpointState,
messages: LLMMessage[],
overrides?: Partial<LLMClientConfig>,
parentSpan?: Span
): Promise<LLMResponse>
HTTP Request:
const url = `${endpoint.endpoint}/v1/chat/completions`
const response = await fetch(url, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
...(endpoint.apiKey ? { 'Authorization': `Bearer ${endpoint.apiKey}` } : {})
},
body: JSON.stringify({
model,
messages,
temperature,
max_tokens: maxTokens
}),
signal: AbortSignal.timeout(timeoutMs ?? 120_000)
})
Response Validation:
function isChatCompletionResponse(value: unknown): value is ChatCompletionResponse
Validates OpenAI-compatible response shape.
Readiness Probe
waitForReady()
Waits for at least one endpoint to become available.
async waitForReady(options?: {
maxWaitMs?: number // Default: 120_000 (2 min)
pollIntervalMs?: number // Default: 5_000 (5 sec)
}): Promise<void>
Behavior:
while (Date.now() < deadline) {
for (const endpoint of endpoints) {
try {
const res = await fetch(`${endpoint.endpoint}/v1/models`, {
signal: AbortSignal.timeout(5_000)
})
if (res.ok) {
return // Ready!
}
} catch {
// Not ready, continue
}
}
await sleep(pollIntervalMs)
}
throw new Error('LLM readiness probe timed out')
Usage:
await client.waitForReady({ maxWaitMs: 60_000 })
logger.info('LLM client ready')
Statistics
getEndpointStats()
getEndpointStats(): Array<{
name: string
endpoint: string
healthy: boolean
effectiveWeight: number
avgLatencyMs: number
totalRequests: number
totalFailures: number
}>
Example Output:
[
{
name: 'primary',
endpoint: 'http://localhost:8000',
healthy: true,
effectiveWeight: 70.0,
avgLatencyMs: 250,
totalRequests: 42,
totalFailures: 0
},
{
name: 'backup',
endpoint: 'http://backup:8000',
healthy: false,
effectiveWeight: 15.0,
avgLatencyMs: 1200,
totalRequests: 8,
totalFailures: 5
}
]
totalRequests
get totalRequests(): number
Returns total number of completion requests made.
Usage Example
import { LLMClient } from '@longshot/orchestrator'
// Multi-endpoint client
const client = new LLMClient({
endpoints: [
{ name: 'primary', endpoint: 'http://localhost:8000', weight: 70 },
{ name: 'secondary', endpoint: 'http://localhost:8001', weight: 30 }
],
model: 'glm-5',
maxTokens: 65536,
temperature: 0.7,
timeoutMs: 120_000
})
// Wait for readiness
await client.waitForReady()
// Send completion request
const response = await client.complete([
{ role: 'system', content: 'You are a task planner.' },
{ role: 'user', content: 'Break down: Implement authentication' }
])
console.log(response.content)
console.log(`Served by ${response.endpoint} in ${response.latencyMs}ms`)
// Check endpoint health
const stats = client.getEndpointStats()
for (const stat of stats) {
console.log(`${stat.name}: healthy=${stat.healthy}, latency=${stat.avgLatencyMs}ms`)
}
Constants
const LATENCY_ALPHA = 0.3 // EMA smoothing factor
const UNHEALTHY_THRESHOLD = 3 // Consecutive failures
const RECOVERY_PROBE_MS = 30_000 // 30 seconds