Optimization Techniques

BAML provides multiple optimization techniques to improve your LLM applications: automated prompt optimization, manual optimization strategies, and performance tuning.

Automated Prompt Optimization

Requires BAML version 0.215.0 or higher. Prompt optimization is currently in beta.

BAML includes an automatic prompt optimizer using the GEPA (Genetic Pareto) algorithm from DSPy.

Quick Start

Optimize prompts based on your test suite:

example.baml

class Person {
  name string
  age int?
}

function ExtractSubject(sentence: string) -> Person? {
  client "anthropic/claude-sonnet-4-5"
  prompt #"
    {{ ctx.output_format }}
    
    Extract the subject from:
    {{ sentence }}
  "#
}

test EasyTest {
  functions [ExtractSubject]
  args {
    sentence "Ellie, who is 4, ran to Kalina's house to play."
  }
  @@assert({{ this.name == "Ellie" }})
  @@assert({{ this.age == 4 }})
}

test HardTest {
  functions [ExtractSubject]
  args {
    sentence "Meg gave Pam a dog for her 30th birthday. She was 21."
  }
  @@assert({{ this.name == "Meg" }})
  @@assert({{ this.age == 21 }})
}

Run optimization:

pnpm exec baml-cli optimize --beta

Optimization uses Anthropic’s Claude models by default. Ensure ANTHROPIC_API_KEY is set in your environment.

Optimization Interface

The optimizer displays a TUI showing:

Left panel: Candidate prompts with accuracy scores
Right panel: Selected candidate details including:
- Accuracy score
- Rationale for changes
- Improved prompt text

Navigate with arrow keys, press Enter to select a candidate and update your BAML file.

Multi-Objective Optimization

Optimize for multiple metrics simultaneously:

# Balance accuracy vs. token usage
baml-cli optimize --beta --weight accuracy=0.2,prompt_tokens=0.8

Built-in metrics:

accuracy: Test pass rate
prompt_tokens: Input token count
completion_tokens: Output token count
tokens: Total tokens (input + output)
latency: Response time

Custom Optimization Metrics

Define custom metrics using named checks:

example.baml

test NoHallucination {
  functions [ExtractSubject]
  args {
    sentence "Bill spoke to the wall-facer"
  }
  @@check(no_hallucination, {{ this.age == null }})
}

baml-cli optimize --beta --weights accuracy=0.1,no_hallucination=0.9

Optimization Controls

# Limit number of candidate prompts
baml-cli optimize --beta --trials 10

# Limit total test evaluations
baml-cli optimize --beta --max-evals 50

# Optimize specific function
baml-cli optimize --beta --function ExtractSubject

# Filter to specific tests
baml-cli optimize --beta --test "*::HardTest"

# Resume previous optimization run
baml-cli optimize --beta --resume .baml_optimize/run_20251208_150606

Customizing the Algorithm

Optimization prompts are defined in .baml_optimize/gepa/baml_src/gepa.baml. Initialize them:

# Create gepa.baml for customization
baml-cli optimize --beta --reset-gepa-prompts

# Edit .baml_optimize/gepa/baml_src/gepa.baml

# Run with customizations
baml-cli optimize --beta

Safe modifications:

Change client field to use different models (e.g., anthropic/claude-opus-4-5)
Add text to prompts in ProposeImprovements, MergeVariants, or AnalyzeFailurePatterns

Do not modify class definitions or remove existing prompt text - the internal implementation depends on the current structure.

Manual Optimization Strategies

Reduce Prompt Tokens

Minimize prompt size while maintaining clarity:

// Before: Verbose schema descriptions
class User {
  name string @description("The full legal name of the user including first, middle, and last names")
  email string @description("The user's primary email address used for account communications")
  age int @description("The user's age in years as of their last birthday")
}

// After: Concise descriptions
class User {
  name string @description("Full name")
  email string @description("Primary email")
  age int @description("Age in years")
}

// Or use aliases to shorten field names in prompts
class User {
  name string @alias("n")
  email string @alias("e")
  age int @alias("a")
}

Optimize Type Definitions

Use the most specific types possible:

// Before: Generic string
class Product {
  category string  // Could be anything
}

// After: Constrained enum
enum ProductCategory {
  Electronics
  Clothing
  Books
  HomeGoods
}

class Product {
  category ProductCategory  // LLM has clear options
}

Leverage Template Strings

Reuse common prompt patterns:

template_string ExtractionInstructions() #"
  Extract information carefully:
  - Only extract explicitly stated information
  - Use null for missing data
  - Maintain exact spelling and formatting
"#

function ExtractUser(text: string) -> User {
  client GPT4
  prompt #"
    {{ ExtractionInstructions() }}
    
    Text: {{ text }}
    {{ ctx.output_format }}
  "#
}

function ExtractProduct(text: string) -> Product {
  client GPT4
  prompt #"
    {{ ExtractionInstructions() }}
    
    Text: {{ text }}
    {{ ctx.output_format }}
  "#
}

Use Streaming for Long Outputs

Stream responses for better perceived performance:

from baml_client import b

async for chunk in b.stream.GenerateLongResponse(prompt):
    # Process chunks as they arrive
    print(chunk, end="", flush=True)

Batch Similar Requests

Process multiple items in a single request:

class Email {
  subject string
  body string
}

class EmailClassification {
  category string
  priority string
  needs_response bool
}

// Instead of calling once per email:
function ClassifyEmail(email: Email) -> EmailClassification {
  client GPT4
  prompt #"..."#
}

// Batch multiple emails:
function ClassifyEmails(emails: Email[]) -> EmailClassification[] {
  client GPT4
  prompt #"
    Classify each of these emails:
    
    {% for email in emails %}
    Email {{ loop.index }}:
    Subject: {{ email.subject }}
    Body: {{ email.body }}
    
    {% endfor %}
    
    {{ ctx.output_format }}
  "#
}

Cost Optimization

Model Selection Strategy

Use different models based on task complexity:

from baml_client import b
from baml_py import ClientRegistry

async def smart_extraction(text: str, complexity: str):
    cr = ClientRegistry()
    
    if complexity == "simple":
        # Use cheaper model for simple tasks
        cr.add_llm_client(
            'FastModel',
            'openai',
            {'model': 'gpt-4o-mini'}
        )
    else:
        # Use powerful model for complex tasks
        cr.add_llm_client(
            'PowerfulModel',
            'openai',
            {'model': 'gpt-4o'}
        )
    
    cr.set_primary('FastModel' if complexity == 'simple' else 'PowerfulModel')
    
    return await b.ExtractData(text, {'client_registry': cr})

Implement Prompt Caching

Cache large, repeated content:

function AnalyzeWithCache(docs: string, query: string) -> string {
  client AnthropicClient
  prompt #"
    {{ _.role("user") }}
    Reference documents:
    {{ docs }}
    
    {{ _.role("user", cache_control={"type": "ephemeral"}) }}
    Query: {{ query }}
  "#
}

See Prompt Caching for details.

Monitor and Optimize Usage

Track costs across your application:

from baml_client import b
from baml_py import Collector
import os

class CostTracker:
    def __init__(self):
        self.collector = Collector(name="cost-tracking")
        # Model pricing per 1M tokens
        self.pricing = {
            'gpt-4o': {'input': 2.50, 'output': 10.00},
            'gpt-4o-mini': {'input': 0.15, 'output': 0.60},
            'claude-opus-4-5': {'input': 3.00, 'output': 15.00},
        }
    
    async def call_with_tracking(self, func, *args, **kwargs):
        result = await func(*args, **kwargs, baml_options={'collector': self.collector})
        self.log_costs()
        return result
    
    def log_costs(self):
        total_cost = 0
        for log in self.collector.logs:
            for call in log.calls:
                model = call.http_request.body.json().get('model', '')
                if model in self.pricing and call.usage:
                    input_cost = (call.usage.input_tokens / 1_000_000) * self.pricing[model]['input']
                    output_cost = (call.usage.output_tokens / 1_000_000) * self.pricing[model]['output']
                    total_cost += input_cost + output_cost
        
        print(f"Total cost: ${total_cost:.4f}")

Performance Optimization

Parallel Execution

Run independent function calls in parallel:

import asyncio
from baml_client import b

async def process_documents(docs: list[str]):
    # Process all documents in parallel
    tasks = [b.ExtractData(doc) for doc in docs]
    results = await asyncio.gather(*tasks)
    return results

Implement Timeouts

Prevent slow requests from blocking:

import asyncio
from baml_client import b

async def extract_with_timeout(text: str, timeout_seconds: int = 30):
    try:
        result = await asyncio.wait_for(
            b.ExtractData(text),
            timeout=timeout_seconds
        )
        return result
    except asyncio.TimeoutError:
        print(f"Request timed out after {timeout_seconds}s")
        return None

Testing and Validation

Comprehensive Test Coverage

Write tests for edge cases to drive optimization:

tests.baml

test EmptyInput {
  functions [ExtractUser]
  args { text "" }
  @@assert({{ this == null }})
}

test PartialData {
  functions [ExtractUser]
  args { text "Name: John" }
  @@assert({{ this.name == "John" }})
  @@assert({{ this.age == null }})
}

test MalformedInput {
  functions [ExtractUser]
  args { text "asdf;lkj32984u23" }
  @@assert({{ this == null }})
}

test UnicodeCharacters {
  functions [ExtractUser]
  args { text "Name: José García, Age: 25" }
  @@assert({{ this.name == "José García" }})
}

Best Practices

Start with Tests: Write comprehensive tests before optimizing
Measure First: Use Collectors to establish baseline performance
Optimize Iteratively: Make one change at a time and measure impact
Balance Trade-offs: Consider accuracy vs. cost vs. latency trade-offs
Use Appropriate Models: Match model capability to task complexity
Cache Aggressively: Cache large, repeated content to reduce costs
Monitor Production: Track real-world usage patterns and costs
Version Prompts: Keep prompt history to enable rollbacks

Limitations

Automated Optimization

Cannot modify type structures (only descriptions and aliases)
Doesn’t discover template_strings in your codebase
Treats all errors equally (network issues vs. prompt issues)
Limited to single-function optimization (no workflow optimization yet)

Get Started

Installation

Core Concepts

Guides

Advanced

Deployment

Optimization Techniques

Automated Prompt Optimization

Quick Start

Optimization Interface

Multi-Objective Optimization

Custom Optimization Metrics

Optimization Controls

Customizing the Algorithm

Manual Optimization Strategies

Reduce Prompt Tokens

Optimize Type Definitions

Leverage Template Strings

Use Streaming for Long Outputs

Batch Similar Requests

Cost Optimization

Model Selection Strategy

Implement Prompt Caching

Monitor and Optimize Usage

Performance Optimization

Parallel Execution

Implement Timeouts

Testing and Validation

Comprehensive Test Coverage

Best Practices

Limitations

Automated Optimization

Build docs developers (and LLMs) love

Get Started

Installation

Core Concepts

Guides

Advanced

Deployment

​Automated Prompt Optimization

​Quick Start

​Optimization Interface

​Multi-Objective Optimization

​Custom Optimization Metrics

​Optimization Controls

​Customizing the Algorithm

​Manual Optimization Strategies

​Reduce Prompt Tokens

​Optimize Type Definitions

​Leverage Template Strings

​Use Streaming for Long Outputs

​Batch Similar Requests

​Cost Optimization

​Model Selection Strategy

​Implement Prompt Caching

​Monitor and Optimize Usage

​Performance Optimization

​Parallel Execution

​Implement Timeouts

​Testing and Validation

​Comprehensive Test Coverage

​Best Practices

​Limitations

​Automated Optimization

​Related Topics

Build docs developers (and LLMs) love

Automated Prompt Optimization

Quick Start

Optimization Interface

Multi-Objective Optimization

Custom Optimization Metrics

Optimization Controls

Customizing the Algorithm

Manual Optimization Strategies

Reduce Prompt Tokens

Optimize Type Definitions

Leverage Template Strings

Use Streaming for Long Outputs

Batch Similar Requests

Cost Optimization

Model Selection Strategy

Implement Prompt Caching

Monitor and Optimize Usage

Performance Optimization

Parallel Execution

Implement Timeouts

Testing and Validation

Comprehensive Test Coverage

Best Practices

Limitations

Automated Optimization

Related Topics