Overview
RAPTOR provides comprehensive cost tracking and budget enforcement for LLM API usage across multiple providers. All costs are tracked in real-time and enforced before budget limits are exceeded.
Real-Time Tracking Track costs as requests are made, not after the fact
Budget Enforcement Hard limits prevent runaway costs from expensive operations
Multi-Provider Support Unified cost tracking for Anthropic, OpenAI, Gemini, Mistral, and Ollama
Cost Callbacks LiteLLM integration provides automatic token counting and cost calculation
Cost Configuration
Setting Budget Limits
from packages.llm_analysis.llm.config import LLMConfig
# Configure with custom budget
config = LLMConfig(
enable_cost_tracking = True ,
max_cost_per_scan = 10.0 , # USD
)
Default budget: $10.00 per scan
For production security research on large codebases, consider increasing to $25-50. For quick tests on small targets, $1-5 is usually sufficient.
Per-Model Pricing
RAPTOR automatically configures per-model costs based on current provider pricing:
# Anthropic Claude
claude_opus = ModelConfig(
provider = "anthropic" ,
model_name = "claude-opus-4.5" ,
cost_per_1k_tokens = 0.015 , # $15 per million tokens
)
claude_sonnet = ModelConfig(
provider = "anthropic" ,
model_name = "claude-sonnet-4.5" ,
cost_per_1k_tokens = 0.003 , # $3 per million tokens
)
# OpenAI GPT
gpt_5 = ModelConfig(
provider = "openai" ,
model_name = "gpt-5.2" ,
cost_per_1k_tokens = 0.005 , # $5 per million tokens
)
# Google Gemini
gemini = ModelConfig(
provider = "gemini" ,
model_name = "gemini-3-pro" ,
cost_per_1k_tokens = 0.0001 , # $0.10 per million tokens
)
# Ollama (local)
ollama = ModelConfig(
provider = "ollama" ,
model_name = "llama3:70b" ,
cost_per_1k_tokens = 0.0 , # FREE - runs locally
)
Pricing is configured at initialization time. Update config.py if provider pricing changes.
Budget Enforcement
Pre-Request Checks
Before making any LLM request, RAPTOR checks if the estimated cost would exceed the budget:
def _check_budget ( self , estimated_cost : float = 0.1 ) -> bool :
"""Check if we're within budget."""
if not self .config.enable_cost_tracking:
return True
if self .total_cost + estimated_cost > self .config.max_cost_per_scan:
logger.error(
f "Budget exceeded: $ { self .total_cost :.2f} + $ { estimated_cost :.2f} "
f "> $ { self .config.max_cost_per_scan :.2f} "
)
return False
return True
Behavior:
Within budget: Request proceeds
Would exceed budget: Request blocked with clear error message
Hard Budget Limit
When budget is exceeded, RAPTOR raises RuntimeError with guidance:
if not self ._check_budget():
raise RuntimeError (
f "LLM budget exceeded: $ { self .total_cost :.4f} spent > "
f "$ { self .config.max_cost_per_scan :.4f} limit. "
f "Increase budget with: LLMConfig(max_cost_per_scan= { self .config.max_cost_per_scan * 2 :.1f} )"
)
Example error message:
RuntimeError: LLM budget exceeded: $10.2345 spent > $10.0000 limit.
Increase budget with: LLMConfig(max_cost_per_scan=20.0)
Budget enforcement is a hard limit . Once exceeded, all LLM requests will fail until budget is increased or costs are reset.
Real-Time Cost Tracking
Token-Based Calculation
Costs are calculated based on actual token usage reported by LiteLLM:
# After successful LLM call
response = provider.generate(prompt, system_prompt)
# Track cost
self .total_cost += response.cost
self .request_count += 1
logger.info(
f "Generation successful: { model.provider } / { model.model_name } "
f "(tokens: { response.tokens_used } , cost: $ { response.cost :.4f} )"
)
Cost calculation:
tokens_used = response.usage.total_tokens # Input + output tokens
cost = (tokens_used / 1000 ) * model_config.cost_per_1k_tokens
Per-Provider Tracking
Each provider maintains its own cost counter:
class LLMProvider :
def __init__ ( self , model_config : ModelConfig):
self .total_cost = 0.0
self .total_tokens = 0
def generate ( self , prompt : str , system_prompt : Optional[ str ] = None ):
response = litellm.completion( ... )
# Calculate cost for this request
tokens = response.usage.total_tokens
cost = (tokens / 1000 ) * self .model_config.cost_per_1k_tokens
# Update provider totals
self .total_cost += cost
self .total_tokens += tokens
return LLMResponse( cost = cost, tokens_used = tokens, ... )
Global Cost Aggregation
The client aggregates costs across all providers:
class LLMClient :
def __init__ ( self ):
self .total_cost = 0.0
self .providers = {} # provider_key -> LLMProvider
def get_stats ( self ) -> Dict[ str , Any]:
return {
"total_cost" : self .total_cost,
"budget_remaining" : self .config.max_cost_per_scan - self .total_cost,
"request_count" : self .request_count,
"providers" : {
key: {
"total_tokens" : provider.total_tokens,
"total_cost" : provider.total_cost,
}
for key, provider in self .providers.items()
}
}
LiteLLM Integration
Automatic Token Counting
LiteLLM provides automatic token counting for all providers:
import litellm
response = litellm.completion(
model = "claude-sonnet-4.5" ,
messages = [{ "role" : "user" , "content" : prompt}],
)
# Token usage automatically populated
print (response.usage.prompt_tokens) # Input tokens
print (response.usage.completion_tokens) # Output tokens
print (response.usage.total_tokens) # Sum
Supported providers:
Anthropic (Claude)
OpenAI (GPT)
Google (Gemini, PaLM)
Mistral
Ollama (reports tokens even though free)
Cost Callbacks
RAPTOR registers a LiteLLM callback for detailed cost visibility:
class RaptorLLMLogger :
"""LiteLLM callback logger for RAPTOR visibility."""
def log_success_event ( self , kwargs , response_obj , start_time , end_time ):
model = kwargs.get( "model" , "unknown" )
tokens_used = response_obj.usage.total_tokens
duration = end_time - start_time
logger.debug(
f "[LiteLLM] Success: model= { model } , "
f "tokens= { tokens_used } , duration= { duration :.2f} s"
)
def log_failure_event ( self , kwargs , response_obj , start_time , end_time ):
model = kwargs.get( "model" , "unknown" )
error_msg = str (response_obj)
logger.debug( f "[LiteLLM] Failure: model= { model } , error= { error_msg } " )
# Register callback (singleton pattern)
callback = RaptorLLMLogger()
litellm.callbacks.append(callback)
Callback benefits:
Atomic logging of every LLM call
Token usage from LiteLLM’s perspective (not just our calculation)
Duration tracking for performance analysis
Automatic error capture
Callbacks complement manual logging. Manual logs provide RAPTOR-level context (retries, fallbacks), while callbacks provide LiteLLM-level metrics (tokens, duration).
Cost Optimization Strategies
1. Response Caching
Avoid re-computing identical requests:
config = LLMConfig(
enable_caching = True ,
cache_dir = Path( "out/llm_cache" ),
)
client = LLMClient(config)
# First call: Costs $0.15
response1 = client.generate( "Analyze this vulnerability..." )
# Second identical call: FREE (cached)
response2 = client.generate( "Analyze this vulnerability..." )
print (response2.finish_reason) # "cached"
Cache key: sha256(model + system_prompt + user_prompt)
Cache format:
{
"content" : "Analysis: This is a buffer overflow..." ,
"model" : "claude-sonnet-4.5" ,
"provider" : "anthropic" ,
"tokens_used" : 1234 ,
"timestamp" : 1625097600.0
}
Cache is not invalidated automatically. Clear out/llm_cache/ if you update prompts or want fresh analysis.
2. Model Selection
Use cheaper models for simpler tasks:
config = LLMConfig(
# Primary: Expensive, high-capability
primary_model = ModelConfig(
provider = "anthropic" ,
model_name = "claude-opus-4.5" ,
cost_per_1k_tokens = 0.015 ,
),
# Specialized: Cheaper for specific tasks
specialized_models = {
"code_analysis" : ModelConfig(
provider = "anthropic" ,
model_name = "claude-sonnet-4.5" ,
cost_per_1k_tokens = 0.003 , # 5x cheaper
),
"simple_classification" : ModelConfig(
provider = "gemini" ,
model_name = "gemini-3-pro" ,
cost_per_1k_tokens = 0.0001 , # 150x cheaper!
),
},
)
# Use specialized model for task
response = client.generate(
prompt = "Is this a buffer overflow? Yes/No" ,
task_type = "simple_classification" , # Uses Gemini
)
Task-specific models:
Task Type Recommended Model Cost/1M Tokens Reasoning Exploit generation Claude Opus $15 Needs deep reasoning Vulnerability analysis Claude Sonnet $3 Good balance Code classification Gemini Pro $0.10 Fast, cheap Simple extraction Gemini Flash $0.02 Ultra cheap
3. Prompt Optimization
Reduce token usage with shorter prompts:
# ❌ BAD: Verbose prompt (500 tokens)
prompt = """
I would like you to carefully analyze the following code snippet
and provide a detailed explanation of any security vulnerabilities
that might be present. Please consider buffer overflows, integer
overflows, use-after-free, and any other memory safety issues...
[500 more words]
"""
# ✅ GOOD: Concise prompt (50 tokens)
prompt = """
Analyze for security vulnerabilities:
- Memory safety issues
- Logic errors
- Input validation
Code:
{code_snippet}
"""
Savings: 90% reduction in prompt tokens = 45% reduction in total cost (assuming 50/50 input/output split)
4. Quota Detection and Fallback
Automatic fallback when quota exceeded:
def _is_quota_error ( error : Exception ) -> bool :
"""Detect quota/rate limit errors."""
if isinstance (error, litellm.RateLimitError):
return True
error_str = str (error).lower()
return any ([
"429" in error_str,
"quota exceeded" in error_str,
"rate limit" in error_str,
])
def generate ( self , prompt : str , ** kwargs ):
try :
# Try primary model
return self ._generate_with_model( self .config.primary_model, prompt)
except Exception as e:
if _is_quota_error(e):
logger.warning( f "Quota exceeded for { self .config.primary_model.provider } " )
# Fall back to different provider
for fallback in self .config.fallback_models:
try :
return self ._generate_with_model(fallback, prompt)
except Exception :
continue
Quota guidance:
def _get_quota_guidance ( model_name : str , provider : str ) -> str :
if provider == "gemini" :
return "→ Google Gemini quota/rate limit exceeded"
elif provider == "openai" :
return "→ OpenAI rate limit exceeded"
elif provider == "anthropic" :
return "→ Anthropic rate limit exceeded"
else :
return f "→ { provider.title() } rate limit exceeded"
5. Local Model Usage
Use Ollama for free inference:
config = LLMConfig(
primary_model = ModelConfig(
provider = "ollama" ,
model_name = "llama3:70b" ,
api_base = "http://localhost:11434" ,
cost_per_1k_tokens = 0.0 , # FREE!
),
)
Trade-offs:
Zero cost : No API fees
Privacy : Data never leaves your machine
No rate limits : Run as many requests as hardware allows
Offline capable : Works without internet
Lower quality : Local models less capable than GPT-5/Claude Opus
Slower : Inference speed depends on hardware
Unreliable exploits : May generate non-functional PoCs
Higher false positives : Less accurate vulnerability assessment
RAPTOR warns when using Ollama for exploit generation: ⚠️ Local model - exploit PoCs may be unreliable
For production security research, consider cloud models.
Cost Reporting
Real-Time Statistics
Get current cost statistics during scan:
client = LLMClient()
# ... make some requests ...
stats = client.get_stats()
print ( f "Total cost: $ { stats[ 'total_cost' ] :.4f} " )
print ( f "Budget remaining: $ { stats[ 'budget_remaining' ] :.4f} " )
print ( f "Requests: { stats[ 'request_count' ] } " )
for provider, metrics in stats[ 'providers' ].items():
print ( f " { provider } : { metrics[ 'total_tokens' ] } tokens, $ { metrics[ 'total_cost' ] :.4f} " )
Example output:
Total cost: $2.4567
Budget remaining: $7.5433
Requests: 15
anthropic:claude-sonnet-4.5: 123456 tokens, $1.8519
openai:gpt-5.2: 87654 tokens, $0.6048
Post-Scan Summary
After scan completion, RAPTOR reports total costs:
✅ AUTONOMOUS ORCHESTRATION COMPLETE
====================================================================
Total findings: 12
Processed: 5
Analyzed: 5
Exploitable: 2
Autonomous Actions:
✓ Exploits generated: 2
✓ Patches generated: 5
Execution time: 245.67s
LLM Cost: $3 .45 / $10 .00 budget (34.5% used )
Results saved to: out/agentic_myapp_20250713_143022/
====================================================================
Budget Exhaustion Warning
When budget is nearly exhausted:
⚠️ Warning: 90% of LLM budget used ($9.00 / $10.00)
Consider increasing budget for remaining findings.
When budget exceeded:
❌ Error: LLM budget exceeded: $10.2345 spent > $10.0000 limit.
Increase budget with: LLMConfig(max_cost_per_scan=20.0)
Advanced Configuration
Per-Scan Budget Override
# Default: $10
default_client = LLMClient()
# High-value target: $50
high_value_config = LLMConfig( max_cost_per_scan = 50.0 )
high_value_client = LLMClient(high_value_config)
# Quick test: $1
quick_test_config = LLMConfig( max_cost_per_scan = 1.0 )
quick_test_client = LLMClient(quick_test_config)
Cost Reset
Reset cost tracking between scans:
client = LLMClient()
# Scan 1
result1 = run_scan(client)
print ( f "Scan 1 cost: $ { client.total_cost :.2f} " )
# Reset for Scan 2
client.reset_costs()
result2 = run_scan(client)
print ( f "Scan 2 cost: $ { client.total_cost :.2f} " )
Disable Cost Tracking
# For testing or when cost is not a concern
config = LLMConfig(
enable_cost_tracking = False ,
)
client = LLMClient(config)
# Budget checks are skipped, all requests allowed
Only disable cost tracking in development/testing environments. Production scans should always enforce budgets.
Cost Breakdown by Task
class CostTracker :
def __init__ ( self ):
self .costs_by_task = {}
def track_task ( self , task_name : str , cost : float ):
if task_name not in self .costs_by_task:
self .costs_by_task[task_name] = []
self .costs_by_task[task_name].append(cost)
def report ( self ):
for task, costs in self .costs_by_task.items():
total = sum (costs)
avg = total / len (costs)
print ( f " { task } : $ { total :.4f} total, $ { avg :.4f} avg ( { len (costs) } calls)" )
tracker = CostTracker()
# Track each analysis task
for finding in findings:
cost_before = client.total_cost
analyze_finding(finding)
cost_after = client.total_cost
tracker.track_task( "vulnerability_analysis" , cost_after - cost_before)
tracker.report()
Example output:
vulnerability_analysis: $1.2345 total, $0.2469 avg (5 calls)
exploit_generation: $0.8765 total, $0.4383 avg (2 calls)
patch_creation: $0.6543 total, $0.1309 avg (5 calls)
Cost Prediction
def estimate_scan_cost ( num_findings : int , avg_code_size : int ) -> float :
"""
Estimate total cost for a scan based on historical data.
Args:
num_findings: Expected number of vulnerabilities
avg_code_size: Average lines of code per finding context
Returns:
Estimated cost in USD
"""
# Historical averages from production scans
cost_per_finding_analysis = 0.25 # $0.25 per vulnerability
cost_per_exploit = 0.45 # $0.45 per exploit (50% exploitable)
cost_per_patch = 0.15 # $0.15 per patch
# Adjust for code size
size_multiplier = min (avg_code_size / 100 , 3.0 ) # Cap at 3x
analysis_cost = num_findings * cost_per_finding_analysis * size_multiplier
exploit_cost = (num_findings * 0.5 ) * cost_per_exploit * size_multiplier
patch_cost = num_findings * cost_per_patch * size_multiplier
return analysis_cost + exploit_cost + patch_cost
# Example
estimated = estimate_scan_cost( num_findings = 10 , avg_code_size = 150 )
print ( f "Estimated cost: $ { estimated :.2f} " )
# Output: Estimated cost: $5.85
Best Practices
Set Realistic Budgets Start with 10 f o r s m a l l s c a n s , 10 for small scans, 10 f ors ma ll sc an s , 25-50 for production. Monitor first scan to calibrate.
Enable Caching Cache responses to avoid re-computing identical requests (can save 30-50%).
Use Task-Specific Models Route simple tasks to cheaper models (Gemini for classification, Sonnet for analysis).
Monitor in Real-Time Check client.get_stats() during scan to detect runaway costs early.
Don't Disable Cost Tracking in Production
Always enforce budgets in production to prevent accidentally expensive scans.
Quota Errors are Not Budget Errors
Quota exceeded = provider rate limit (need to wait or switch provider) Budget exceeded = RAPTOR cost limit (need to increase max_cost_per_scan)
Local Models Have Zero Cost but Lower Quality
Use Ollama for development/testing, cloud models for production security research.
Troubleshooting
Budget exceeded but scan not complete
Cause: Scan hit cost limit before processing all findingsFix: config = LLMConfig( max_cost_per_scan = 25.0 ) # Increase budget
Costs higher than expected
Possible causes:
Using expensive model (Opus) for all tasks
Long prompts with unnecessary context
Cache disabled or not working
Debug: stats = client.get_stats()
print (stats[ 'providers' ]) # See which provider costs most
Check:
Is enable_caching=True?
Are prompts identical (including whitespace)?
Does out/llm_cache/ directory exist?
Verify: response = client.generate(prompt)
print (response.finish_reason) # Should be "cached" on repeat
Further Reading
LiteLLM Cost Tracking Official LiteLLM documentation on cost tracking features
Provider Pricing Up-to-date pricing for all major LLM providers
Client Configuration Full LLM client configuration reference
Optimization Guide Comprehensive guide to optimizing LLM costs