Retry Policies and Fallbacks

BAML provides robust mechanisms for handling failures: retry policies for transient errors, fallback clients for resilience, and timeouts for preventing hangs.

Retry Policies

Retry policies automatically retry requests that fail due to network errors or transient issues.

Basic Retry Policy

retry_policy MyRetryPolicy {
  max_retries 3
}

client<llm> ResilientClient {
  provider openai
  retry_policy MyRetryPolicy
  options {
    model "gpt-4"
    api_key env.OPENAI_API_KEY
  }
}

This will retry up to 3 additional times after the initial request fails (4 total attempts).

Retry Strategies

Constant Delay

Wait a fixed amount of time between retries:

retry_policy ConstantRetry {
  max_retries 3
  strategy {
    type constant_delay
    delay_ms 200  // Wait 200ms between retries
  }
}

Exponential Backoff

Increase delay between retries exponentially:

retry_policy ExponentialRetry {
  max_retries 5
  strategy {
    type exponential_backoff
    delay_ms 200         // Start with 200ms
    multiplier 1.5       // Multiply delay by 1.5 each time
    max_delay_ms 10000   // Cap at 10 seconds
  }
}

Delay sequence: 200ms → 300ms → 450ms → 675ms → 1012ms

Exponential backoff is recommended for production as it gives services time to recover while avoiding thundering herd problems.

Fallback Clients

Fallback clients try multiple LLM providers in sequence until one succeeds:

client<llm> PrimaryClient {
  provider openai
  options {
    model "gpt-5-mini"
    api_key env.OPENAI_API_KEY
  }
}

client<llm> SecondaryClient {
  provider anthropic
  options {
    model "claude-sonnet-4-20250514"
    api_key env.ANTHROPIC_API_KEY
  }
}

client<llm> TertiaryClient {
  provider google-ai
  options {
    model "gemini-2.0-flash-exp"
    api_key env.GOOGLE_API_KEY
  }
}

client<llm> FallbackClient {
  provider fallback
  options {
    strategy [
      PrimaryClient,
      SecondaryClient,
      TertiaryClient
    ]
  }
}

function ExtractData(input: string) -> DataSchema {
  client FallbackClient
  prompt #"
    Extract information from: {{ input }}
    {{ ctx.output_format }}
  "#
}

How Fallbacks Work

Try Primary Client

BAML attempts to call the first client in the strategy list.

Handle Failure

If the primary client fails (network error, timeout, validation error), BAML moves to the next client.

Continue Chain

BAML continues down the list until a client succeeds or all clients fail.

Report Error

If all clients fail, BAML raises an error with the last client’s error type, but detailed_message contains the complete history.

Combining Retries and Fallbacks

You can add retry policies to fallback clients:

retry_policy AggressiveRetry {
  max_retries 2
  strategy {
    type exponential_backoff
  }
}

client<llm> FallbackWithRetries {
  provider fallback
  retry_policy AggressiveRetry  // Retry the entire fallback chain
  options {
    strategy [
      PrimaryClient,
      SecondaryClient
    ]
  }
}

This will:

Try PrimaryClient
If it fails, try SecondaryClient
If both fail, retry the entire sequence up to 2 more times

Nested Fallbacks

Create complex fallback chains:

client<llm> OpenAIFallback {
  provider fallback
  options {
    strategy [
      "openai/gpt-5-mini",
      "openai/gpt-4o"
    ]
  }
}

client<llm> AnthropicFallback {
  provider fallback
  options {
    strategy [
      "anthropic/claude-sonnet-4-20250514",
      "anthropic/claude-opus-4-1-20250805"
    ]
  }
}

client<llm> UltraResilientClient {
  provider fallback
  options {
    strategy [
      OpenAIFallback,
      AnthropicFallback,
      "google-ai/gemini-2.0-flash-exp"
    ]
  }
}

This tries: gpt-5-mini → gpt-4o → claude-sonnet → claude-opus → gemini

Timeouts

Timeouts prevent requests from hanging indefinitely.

Timeout Types

BAML supports four types of timeouts:

client<llm> TimedClient {
  provider openai
  options {
    model "gpt-4"
    api_key env.OPENAI_API_KEY
    
    http {
      connect_timeout_ms 5000              // Time to establish connection
      time_to_first_token_timeout_ms 10000 // Time until first token
      idle_timeout_ms 15000                // Time between chunks
      request_timeout_ms 60000             // Total request time
    }
  }
}

`connect_timeout_ms`

Maximum time to establish a connection to the LLM provider. Use case: Detect unreachable endpoints quickly.

http {
  connect_timeout_ms 3000  // Fail if can't connect within 3s
}

`time_to_first_token_timeout_ms`

Maximum time to receive the first token after sending the request. Use case: Detect when the provider accepts your request but takes too long to start generating.

http {
  time_to_first_token_timeout_ms 10000  // First token within 10s
}

Especially useful for streaming responses where you want the LLM to start responding quickly.

`idle_timeout_ms`

Maximum time between receiving data chunks during streaming. Use case: Detect stalled connections where the provider stops sending data mid-response.

http {
  idle_timeout_ms 15000  // No more than 15s between chunks
}

`request_timeout_ms`

Maximum total time for the entire request-response cycle. Use case: Ensure requests complete within your application’s latency requirements.

http {
  request_timeout_ms 60000  // Complete within 60s total
}

Timeouts with Retries

Each retry attempt gets the full timeout duration:

retry_policy Aggressive {
  max_retries 3
  strategy {
    type exponential_backoff
  }
}

client<llm> MyClient {
  provider openai
  retry_policy Aggressive
  options {
    model "gpt-4"
    api_key env.OPENAI_API_KEY
    http {
      request_timeout_ms 30000  // 30s per attempt
    }
  }
}

Total potential time: 4 attempts × 30s + retry delays ≈ 2+ minutes

Handling Timeout Errors

Python
TypeScript

from baml_client import b
from baml_py.errors import BamlTimeoutError, BamlClientError

try:
    result = await b.ExtractData(input)
except BamlTimeoutError as e:
    print(f"Request timed out: {e.message}")
    print(f"Timeout type: {e.timeout_type}")
    print(f"Configured: {e.configured_value_ms}ms")
    print(f"Elapsed: {e.elapsed_ms}ms")
except BamlClientError as e:
    print(f"Client error: {e.message}")

import { b } from './baml_client'
import { BamlTimeoutError } from '@boundaryml/baml'

try {
  const result = await b.ExtractData(input)
} catch (e) {
  if (e instanceof BamlTimeoutError) {
    console.log(`Request timed out: ${e.message}`)
    console.log(`Timeout type: ${e.timeout_type}`)
    console.log(`Configured: ${e.configured_value_ms}ms`)
    console.log(`Elapsed: ${e.elapsed_ms}ms`)
  } else {
    console.log(`Error: ${e}`)
  }
}

Recommended Production Timeouts

For most applications:

client<llm> ProductionClient {
  provider openai
  options {
    model "gpt-4"
    api_key env.OPENAI_API_KEY
    http {
      connect_timeout_ms 10000                // 10s to connect
      time_to_first_token_timeout_ms 30000    // 30s to first token
      idle_timeout_ms 2000                    // 2s between chunks
      request_timeout_ms 300000               // 5 minutes total
    }
  }
}

For faster models:

client<llm> FastModel {
  provider openai
  options {
    model "gpt-5-mini"
    api_key env.OPENAI_API_KEY
    http {
      connect_timeout_ms 5000
      time_to_first_token_timeout_ms 10000
      idle_timeout_ms 2000
      request_timeout_ms 30000  // Mini is fast
    }
  }
}

Production Patterns

Pattern 1: Fast with Fallback

Try fast/cheap model first, fall back to capable/expensive:

client<llm> ProductionClient {
  provider fallback
  options {
    strategy [
      "openai/gpt-5-mini",     // Fast and cheap
      "openai/gpt-4o",          // More capable
      "anthropic/claude-opus-4-1-20250805"  // Most capable
    ]
    http {
      request_timeout_ms 30000  // Aggressive timeout for fast failover
    }
  }
}

Pattern 2: Provider Diversity

Distribute across providers for maximum reliability:

retry_policy QuickRetry {
  max_retries 1
  strategy {
    type constant_delay
    delay_ms 100
  }
}

client<llm> DiverseClient {
  provider fallback
  retry_policy QuickRetry
  options {
    strategy [
      "openai/gpt-4o",
      "anthropic/claude-sonnet-4-20250514",
      "google-ai/gemini-2.0-flash-exp"
    ]
  }
}

Pattern 3: Graceful Degradation

Handle failures gracefully in application code:

async def extract_with_fallback(input: str):
    try:
        # Try primary extraction
        return await b.ExtractData(input)
    except BamlError as e:
        logger.warning(f"Primary extraction failed: {e}")
        
        try:
            # Try simpler extraction
            return await b.ExtractDataSimple(input)
        except BamlError as e2:
            logger.error(f"All extraction methods failed: {e2}")
            # Return safe defaults
            return {
                "status": "error",
                "data": None,
                "error": str(e)
            }

Pattern 4: Monitoring

Track fallback usage to optimize your strategy:

from baml_py.errors import BamlError
import logging

logger = logging.getLogger(__name__)

async def monitored_extract(input: str):
    try:
        result = await b.ExtractData(input)
        logger.info("Primary client succeeded")
        return result
    except BamlError as e:
        # Check detailed_message to see which clients were tried
        if "FallbackClient" in str(type(e)):
            logger.warning(
                "Fallback was used",
                extra={"error_chain": e.detailed_message}
            )
        raise

Best Practices

Start Conservative

Begin with generous timeouts and few retries. Tighten based on real-world data.

Monitor Fallback Rates

If fallbacks trigger frequently, investigate why primary clients fail.

Use Different Models

Fallback to different model architectures (OpenAI → Anthropic → Google) for true resilience.

Balance Cost and Reliability

Order fallback strategy by cost: try cheap models first, expensive as fallback.

Test Failure Scenarios

Simulate failures to ensure your retry/fallback logic works correctly.

Timeouts vs Abort Controllers

Timeouts: Automatic, configuration-based time limits
Abort Controllers: Manual, user-initiated cancellation

Use both together:

const controller = new AbortController()

// User clicks cancel
button.onclick = () => controller.abort()

try {
  const result = await b.ExtractData(input, {
    abortController: controller
    // Client still has configured timeouts
  })
} catch (e) {
  if (e instanceof BamlAbortError) {
    console.log('User cancelled')
  } else if (e instanceof BamlTimeoutError) {
    console.log('Request timed out')
  }
}

Next Steps

Learn about error handling for comprehensive error recovery
Explore streaming with timeouts and cancellation
Set up observability to monitor retry and fallback usage

Get Started

Installation

Core Concepts

Guides

Advanced

Deployment

Retry Policies and Fallbacks

Retry Policies and Fallbacks

Retry Policies

Basic Retry Policy

Retry Strategies

Constant Delay

Exponential Backoff

Fallback Clients

How Fallbacks Work

Combining Retries and Fallbacks

Nested Fallbacks

Timeouts

Timeout Types

`connect_timeout_ms`

`time_to_first_token_timeout_ms`

`idle_timeout_ms`

`request_timeout_ms`

Timeouts with Retries

Handling Timeout Errors

Recommended Production Timeouts

Production Patterns

Pattern 1: Fast with Fallback

Pattern 2: Provider Diversity

Pattern 3: Graceful Degradation

Pattern 4: Monitoring

Best Practices

Timeouts vs Abort Controllers

Next Steps

Build docs developers (and LLMs) love

Get Started

Installation

Core Concepts

Guides

Advanced

Deployment

​Retry Policies and Fallbacks

​Retry Policies

​Basic Retry Policy

​Retry Strategies

​Constant Delay

​Exponential Backoff

​Fallback Clients

​How Fallbacks Work

​Combining Retries and Fallbacks

​Nested Fallbacks

​Timeouts

​Timeout Types

​connect_timeout_ms

​time_to_first_token_timeout_ms

​idle_timeout_ms

​request_timeout_ms

​Timeouts with Retries

​Handling Timeout Errors

​Recommended Production Timeouts

​Production Patterns

​Pattern 1: Fast with Fallback

​Pattern 2: Provider Diversity

​Pattern 3: Graceful Degradation

​Pattern 4: Monitoring

​Best Practices

​Timeouts vs Abort Controllers

​Next Steps

Build docs developers (and LLMs) love

Retry Policies and Fallbacks

Retry Policies

Basic Retry Policy

Retry Strategies

Constant Delay

Exponential Backoff

Fallback Clients

How Fallbacks Work

Combining Retries and Fallbacks

Nested Fallbacks

Timeouts

Timeout Types

`connect_timeout_ms`

`time_to_first_token_timeout_ms`

`idle_timeout_ms`

`request_timeout_ms`

Timeouts with Retries

Handling Timeout Errors

Recommended Production Timeouts

Production Patterns

Pattern 1: Fast with Fallback

Pattern 2: Provider Diversity

Pattern 3: Graceful Degradation

Pattern 4: Monitoring

Best Practices

Timeouts vs Abort Controllers

Next Steps