BAML provides multiple optimization techniques to improve your LLM applications: automated prompt optimization, manual optimization strategies, and performance tuning.
Automated Prompt Optimization
Requires BAML version 0.215.0 or higher. Prompt optimization is currently in beta.
BAML includes an automatic prompt optimizer using the GEPA (Genetic Pareto) algorithm from DSPy .
Quick Start
Optimize prompts based on your test suite:
class Person {
name string
age int?
}
function ExtractSubject(sentence: string) -> Person? {
client "anthropic/claude-sonnet-4-5"
prompt #"
{{ ctx.output_format }}
Extract the subject from:
{{ sentence }}
"#
}
test EasyTest {
functions [ExtractSubject]
args {
sentence "Ellie, who is 4, ran to Kalina's house to play."
}
@@assert({{ this.name == "Ellie" }})
@@assert({{ this.age == 4 }})
}
test HardTest {
functions [ExtractSubject]
args {
sentence "Meg gave Pam a dog for her 30th birthday. She was 21."
}
@@assert({{ this.name == "Meg" }})
@@assert({{ this.age == 21 }})
}
Run optimization:
TypeScript
Python
Standalone
pnpm exec baml-cli optimize --beta
Optimization uses Anthropic’s Claude models by default. Ensure ANTHROPIC_API_KEY is set in your environment.
Optimization Interface
The optimizer displays a TUI showing:
Left panel : Candidate prompts with accuracy scores
Right panel : Selected candidate details including:
Accuracy score
Rationale for changes
Improved prompt text
Navigate with arrow keys, press Enter to select a candidate and update your BAML file.
Multi-Objective Optimization
Optimize for multiple metrics simultaneously:
# Balance accuracy vs. token usage
baml-cli optimize --beta --weight accuracy=0.2,prompt_tokens= 0.8
Built-in metrics:
accuracy: Test pass rate
prompt_tokens: Input token count
completion_tokens: Output token count
tokens: Total tokens (input + output)
latency: Response time
Custom Optimization Metrics
Define custom metrics using named checks:
test NoHallucination {
functions [ExtractSubject]
args {
sentence "Bill spoke to the wall-facer"
}
@@check(no_hallucination, {{ this.age == null }})
}
baml-cli optimize --beta --weights accuracy=0.1,no_hallucination= 0.9
Optimization Controls
# Limit number of candidate prompts
baml-cli optimize --beta --trials 10
# Limit total test evaluations
baml-cli optimize --beta --max-evals 50
# Optimize specific function
baml-cli optimize --beta --function ExtractSubject
# Filter to specific tests
baml-cli optimize --beta --test "*::HardTest"
# Resume previous optimization run
baml-cli optimize --beta --resume .baml_optimize/run_20251208_150606
Customizing the Algorithm
Optimization prompts are defined in .baml_optimize/gepa/baml_src/gepa.baml. Initialize them:
# Create gepa.baml for customization
baml-cli optimize --beta --reset-gepa-prompts
# Edit .baml_optimize/gepa/baml_src/gepa.baml
# Run with customizations
baml-cli optimize --beta
Safe modifications:
Change client field to use different models (e.g., anthropic/claude-opus-4-5)
Add text to prompts in ProposeImprovements, MergeVariants, or AnalyzeFailurePatterns
Do not modify class definitions or remove existing prompt text - the internal implementation depends on the current structure.
Manual Optimization Strategies
Reduce Prompt Tokens
Minimize prompt size while maintaining clarity:
// Before: Verbose schema descriptions
class User {
name string @description("The full legal name of the user including first, middle, and last names")
email string @description("The user's primary email address used for account communications")
age int @description("The user's age in years as of their last birthday")
}
// After: Concise descriptions
class User {
name string @description("Full name")
email string @description("Primary email")
age int @description("Age in years")
}
// Or use aliases to shorten field names in prompts
class User {
name string @alias("n")
email string @alias("e")
age int @alias("a")
}
Optimize Type Definitions
Use the most specific types possible:
// Before: Generic string
class Product {
category string // Could be anything
}
// After: Constrained enum
enum ProductCategory {
Electronics
Clothing
Books
HomeGoods
}
class Product {
category ProductCategory // LLM has clear options
}
Leverage Template Strings
Reuse common prompt patterns:
template_string ExtractionInstructions() #"
Extract information carefully:
- Only extract explicitly stated information
- Use null for missing data
- Maintain exact spelling and formatting
"#
function ExtractUser(text: string) -> User {
client GPT4
prompt #"
{{ ExtractionInstructions() }}
Text: {{ text }}
{{ ctx.output_format }}
"#
}
function ExtractProduct(text: string) -> Product {
client GPT4
prompt #"
{{ ExtractionInstructions() }}
Text: {{ text }}
{{ ctx.output_format }}
"#
}
Use Streaming for Long Outputs
Stream responses for better perceived performance:
from baml_client import b
async for chunk in b.stream.GenerateLongResponse(prompt):
# Process chunks as they arrive
print (chunk, end = "" , flush = True )
Batch Similar Requests
Process multiple items in a single request:
class Email {
subject string
body string
}
class EmailClassification {
category string
priority string
needs_response bool
}
// Instead of calling once per email:
function ClassifyEmail(email: Email) -> EmailClassification {
client GPT4
prompt #"..."#
}
// Batch multiple emails:
function ClassifyEmails(emails: Email[]) -> EmailClassification[] {
client GPT4
prompt #"
Classify each of these emails:
{% for email in emails %}
Email {{ loop.index }}:
Subject: {{ email.subject }}
Body: {{ email.body }}
{% endfor %}
{{ ctx.output_format }}
"#
}
Cost Optimization
Model Selection Strategy
Use different models based on task complexity:
from baml_client import b
from baml_py import ClientRegistry
async def smart_extraction ( text : str , complexity : str ):
cr = ClientRegistry()
if complexity == "simple" :
# Use cheaper model for simple tasks
cr.add_llm_client(
'FastModel' ,
'openai' ,
{ 'model' : 'gpt-4o-mini' }
)
else :
# Use powerful model for complex tasks
cr.add_llm_client(
'PowerfulModel' ,
'openai' ,
{ 'model' : 'gpt-4o' }
)
cr.set_primary( 'FastModel' if complexity == 'simple' else 'PowerfulModel' )
return await b.ExtractData(text, { 'client_registry' : cr})
Implement Prompt Caching
Cache large, repeated content:
function AnalyzeWithCache(docs: string, query: string) -> string {
client AnthropicClient
prompt #"
{{ _.role("user") }}
Reference documents:
{{ docs }}
{{ _.role("user", cache_control={"type": "ephemeral"}) }}
Query: {{ query }}
"#
}
See Prompt Caching for details.
Monitor and Optimize Usage
Track costs across your application:
from baml_client import b
from baml_py import Collector
import os
class CostTracker :
def __init__ ( self ):
self .collector = Collector( name = "cost-tracking" )
# Model pricing per 1M tokens
self .pricing = {
'gpt-4o' : { 'input' : 2.50 , 'output' : 10.00 },
'gpt-4o-mini' : { 'input' : 0.15 , 'output' : 0.60 },
'claude-opus-4-5' : { 'input' : 3.00 , 'output' : 15.00 },
}
async def call_with_tracking ( self , func , * args , ** kwargs ):
result = await func( * args, ** kwargs, baml_options = { 'collector' : self .collector})
self .log_costs()
return result
def log_costs ( self ):
total_cost = 0
for log in self .collector.logs:
for call in log.calls:
model = call.http_request.body.json().get( 'model' , '' )
if model in self .pricing and call.usage:
input_cost = (call.usage.input_tokens / 1_000_000 ) * self .pricing[model][ 'input' ]
output_cost = (call.usage.output_tokens / 1_000_000 ) * self .pricing[model][ 'output' ]
total_cost += input_cost + output_cost
print ( f "Total cost: $ { total_cost :.4f} " )
Parallel Execution
Run independent function calls in parallel:
import asyncio
from baml_client import b
async def process_documents ( docs : list[ str ]):
# Process all documents in parallel
tasks = [b.ExtractData(doc) for doc in docs]
results = await asyncio.gather( * tasks)
return results
Implement Timeouts
Prevent slow requests from blocking:
import asyncio
from baml_client import b
async def extract_with_timeout ( text : str , timeout_seconds : int = 30 ):
try :
result = await asyncio.wait_for(
b.ExtractData(text),
timeout = timeout_seconds
)
return result
except asyncio.TimeoutError:
print ( f "Request timed out after { timeout_seconds } s" )
return None
Testing and Validation
Comprehensive Test Coverage
Write tests for edge cases to drive optimization:
test EmptyInput {
functions [ExtractUser]
args { text "" }
@@assert({{ this == null }})
}
test PartialData {
functions [ExtractUser]
args { text "Name: John" }
@@assert({{ this.name == "John" }})
@@assert({{ this.age == null }})
}
test MalformedInput {
functions [ExtractUser]
args { text "asdf;lkj32984u23" }
@@assert({{ this == null }})
}
test UnicodeCharacters {
functions [ExtractUser]
args { text "Name: José García, Age: 25" }
@@assert({{ this.name == "José García" }})
}
Best Practices
Start with Tests : Write comprehensive tests before optimizing
Measure First : Use Collectors to establish baseline performance
Optimize Iteratively : Make one change at a time and measure impact
Balance Trade-offs : Consider accuracy vs. cost vs. latency trade-offs
Use Appropriate Models : Match model capability to task complexity
Cache Aggressively : Cache large, repeated content to reduce costs
Monitor Production : Track real-world usage patterns and costs
Version Prompts : Keep prompt history to enable rollbacks
Limitations
Automated Optimization
Cannot modify type structures (only descriptions and aliases)
Doesn’t discover template_strings in your codebase
Treats all errors equally (network issues vs. prompt issues)
Limited to single-function optimization (no workflow optimization yet)