Skip to main content

Overview

RAPTOR provides comprehensive cost tracking and budget enforcement for LLM API usage across multiple providers. All costs are tracked in real-time and enforced before budget limits are exceeded.

Real-Time Tracking

Track costs as requests are made, not after the fact

Budget Enforcement

Hard limits prevent runaway costs from expensive operations

Multi-Provider Support

Unified cost tracking for Anthropic, OpenAI, Gemini, Mistral, and Ollama

Cost Callbacks

LiteLLM integration provides automatic token counting and cost calculation

Cost Configuration

Setting Budget Limits

from packages.llm_analysis.llm.config import LLMConfig

# Configure with custom budget
config = LLMConfig(
    enable_cost_tracking=True,
    max_cost_per_scan=10.0,  # USD
)
Default budget: $10.00 per scan
For production security research on large codebases, consider increasing to $25-50. For quick tests on small targets, $1-5 is usually sufficient.

Per-Model Pricing

RAPTOR automatically configures per-model costs based on current provider pricing:
# Anthropic Claude
claude_opus = ModelConfig(
    provider="anthropic",
    model_name="claude-opus-4.5",
    cost_per_1k_tokens=0.015,  # $15 per million tokens
)

claude_sonnet = ModelConfig(
    provider="anthropic",
    model_name="claude-sonnet-4.5",
    cost_per_1k_tokens=0.003,  # $3 per million tokens
)

# OpenAI GPT
gpt_5 = ModelConfig(
    provider="openai",
    model_name="gpt-5.2",
    cost_per_1k_tokens=0.005,  # $5 per million tokens
)

# Google Gemini
gemini = ModelConfig(
    provider="gemini",
    model_name="gemini-3-pro",
    cost_per_1k_tokens=0.0001,  # $0.10 per million tokens
)

# Ollama (local)
ollama = ModelConfig(
    provider="ollama",
    model_name="llama3:70b",
    cost_per_1k_tokens=0.0,  # FREE - runs locally
)
Pricing is configured at initialization time. Update config.py if provider pricing changes.

Budget Enforcement

Pre-Request Checks

Before making any LLM request, RAPTOR checks if the estimated cost would exceed the budget:
def _check_budget(self, estimated_cost: float = 0.1) -> bool:
    """Check if we're within budget."""
    if not self.config.enable_cost_tracking:
        return True
    
    if self.total_cost + estimated_cost > self.config.max_cost_per_scan:
        logger.error(
            f"Budget exceeded: ${self.total_cost:.2f} + ${estimated_cost:.2f} "
            f"> ${self.config.max_cost_per_scan:.2f}"
        )
        return False
    
    return True
Behavior:
  • Within budget: Request proceeds
  • Would exceed budget: Request blocked with clear error message

Hard Budget Limit

When budget is exceeded, RAPTOR raises RuntimeError with guidance:
if not self._check_budget():
    raise RuntimeError(
        f"LLM budget exceeded: ${self.total_cost:.4f} spent > "
        f"${self.config.max_cost_per_scan:.4f} limit. "
        f"Increase budget with: LLMConfig(max_cost_per_scan={self.config.max_cost_per_scan * 2:.1f})"
    )
Example error message:
RuntimeError: LLM budget exceeded: $10.2345 spent > $10.0000 limit. 
Increase budget with: LLMConfig(max_cost_per_scan=20.0)
Budget enforcement is a hard limit. Once exceeded, all LLM requests will fail until budget is increased or costs are reset.

Real-Time Cost Tracking

Token-Based Calculation

Costs are calculated based on actual token usage reported by LiteLLM:
# After successful LLM call
response = provider.generate(prompt, system_prompt)

# Track cost
self.total_cost += response.cost
self.request_count += 1

logger.info(
    f"Generation successful: {model.provider}/{model.model_name} "
    f"(tokens: {response.tokens_used}, cost: ${response.cost:.4f})"
)
Cost calculation:
tokens_used = response.usage.total_tokens  # Input + output tokens
cost = (tokens_used / 1000) * model_config.cost_per_1k_tokens

Per-Provider Tracking

Each provider maintains its own cost counter:
class LLMProvider:
    def __init__(self, model_config: ModelConfig):
        self.total_cost = 0.0
        self.total_tokens = 0
    
    def generate(self, prompt: str, system_prompt: Optional[str] = None):
        response = litellm.completion(...)
        
        # Calculate cost for this request
        tokens = response.usage.total_tokens
        cost = (tokens / 1000) * self.model_config.cost_per_1k_tokens
        
        # Update provider totals
        self.total_cost += cost
        self.total_tokens += tokens
        
        return LLMResponse(cost=cost, tokens_used=tokens, ...)

Global Cost Aggregation

The client aggregates costs across all providers:
class LLMClient:
    def __init__(self):
        self.total_cost = 0.0
        self.providers = {}  # provider_key -> LLMProvider
    
    def get_stats(self) -> Dict[str, Any]:
        return {
            "total_cost": self.total_cost,
            "budget_remaining": self.config.max_cost_per_scan - self.total_cost,
            "request_count": self.request_count,
            "providers": {
                key: {
                    "total_tokens": provider.total_tokens,
                    "total_cost": provider.total_cost,
                }
                for key, provider in self.providers.items()
            }
        }

LiteLLM Integration

Automatic Token Counting

LiteLLM provides automatic token counting for all providers:
import litellm

response = litellm.completion(
    model="claude-sonnet-4.5",
    messages=[{"role": "user", "content": prompt}],
)

# Token usage automatically populated
print(response.usage.prompt_tokens)      # Input tokens
print(response.usage.completion_tokens)  # Output tokens
print(response.usage.total_tokens)       # Sum
Supported providers:
  • Anthropic (Claude)
  • OpenAI (GPT)
  • Google (Gemini, PaLM)
  • Mistral
  • Ollama (reports tokens even though free)

Cost Callbacks

RAPTOR registers a LiteLLM callback for detailed cost visibility:
class RaptorLLMLogger:
    """LiteLLM callback logger for RAPTOR visibility."""
    
    def log_success_event(self, kwargs, response_obj, start_time, end_time):
        model = kwargs.get("model", "unknown")
        tokens_used = response_obj.usage.total_tokens
        duration = end_time - start_time
        
        logger.debug(
            f"[LiteLLM] Success: model={model}, "
            f"tokens={tokens_used}, duration={duration:.2f}s"
        )
    
    def log_failure_event(self, kwargs, response_obj, start_time, end_time):
        model = kwargs.get("model", "unknown")
        error_msg = str(response_obj)
        
        logger.debug(f"[LiteLLM] Failure: model={model}, error={error_msg}")

# Register callback (singleton pattern)
callback = RaptorLLMLogger()
litellm.callbacks.append(callback)
Callback benefits:
  • Atomic logging of every LLM call
  • Token usage from LiteLLM’s perspective (not just our calculation)
  • Duration tracking for performance analysis
  • Automatic error capture
Callbacks complement manual logging. Manual logs provide RAPTOR-level context (retries, fallbacks), while callbacks provide LiteLLM-level metrics (tokens, duration).

Cost Optimization Strategies

1. Response Caching

Avoid re-computing identical requests:
config = LLMConfig(
    enable_caching=True,
    cache_dir=Path("out/llm_cache"),
)

client = LLMClient(config)

# First call: Costs $0.15
response1 = client.generate("Analyze this vulnerability...")

# Second identical call: FREE (cached)
response2 = client.generate("Analyze this vulnerability...")
print(response2.finish_reason)  # "cached"
Cache key: sha256(model + system_prompt + user_prompt) Cache format:
{
  "content": "Analysis: This is a buffer overflow...",
  "model": "claude-sonnet-4.5",
  "provider": "anthropic",
  "tokens_used": 1234,
  "timestamp": 1625097600.0
}
Cache is not invalidated automatically. Clear out/llm_cache/ if you update prompts or want fresh analysis.

2. Model Selection

Use cheaper models for simpler tasks:
config = LLMConfig(
    # Primary: Expensive, high-capability
    primary_model=ModelConfig(
        provider="anthropic",
        model_name="claude-opus-4.5",
        cost_per_1k_tokens=0.015,
    ),
    
    # Specialized: Cheaper for specific tasks
    specialized_models={
        "code_analysis": ModelConfig(
            provider="anthropic",
            model_name="claude-sonnet-4.5",
            cost_per_1k_tokens=0.003,  # 5x cheaper
        ),
        "simple_classification": ModelConfig(
            provider="gemini",
            model_name="gemini-3-pro",
            cost_per_1k_tokens=0.0001,  # 150x cheaper!
        ),
    },
)

# Use specialized model for task
response = client.generate(
    prompt="Is this a buffer overflow? Yes/No",
    task_type="simple_classification",  # Uses Gemini
)
Task-specific models:
Task TypeRecommended ModelCost/1M TokensReasoning
Exploit generationClaude Opus$15Needs deep reasoning
Vulnerability analysisClaude Sonnet$3Good balance
Code classificationGemini Pro$0.10Fast, cheap
Simple extractionGemini Flash$0.02Ultra cheap

3. Prompt Optimization

Reduce token usage with shorter prompts:
# ❌ BAD: Verbose prompt (500 tokens)
prompt = """
I would like you to carefully analyze the following code snippet 
and provide a detailed explanation of any security vulnerabilities 
that might be present. Please consider buffer overflows, integer 
overflows, use-after-free, and any other memory safety issues...
[500 more words]
"""

# ✅ GOOD: Concise prompt (50 tokens)
prompt = """
Analyze for security vulnerabilities:
- Memory safety issues
- Logic errors
- Input validation

Code:
{code_snippet}
"""
Savings: 90% reduction in prompt tokens = 45% reduction in total cost (assuming 50/50 input/output split)

4. Quota Detection and Fallback

Automatic fallback when quota exceeded:
def _is_quota_error(error: Exception) -> bool:
    """Detect quota/rate limit errors."""
    if isinstance(error, litellm.RateLimitError):
        return True
    
    error_str = str(error).lower()
    return any([
        "429" in error_str,
        "quota exceeded" in error_str,
        "rate limit" in error_str,
    ])

def generate(self, prompt: str, **kwargs):
    try:
        # Try primary model
        return self._generate_with_model(self.config.primary_model, prompt)
    except Exception as e:
        if _is_quota_error(e):
            logger.warning(f"Quota exceeded for {self.config.primary_model.provider}")
            # Fall back to different provider
            for fallback in self.config.fallback_models:
                try:
                    return self._generate_with_model(fallback, prompt)
                except Exception:
                    continue
Quota guidance:
def _get_quota_guidance(model_name: str, provider: str) -> str:
    if provider == "gemini":
        return "→ Google Gemini quota/rate limit exceeded"
    elif provider == "openai":
        return "→ OpenAI rate limit exceeded"
    elif provider == "anthropic":
        return "→ Anthropic rate limit exceeded"
    else:
        return f"→ {provider.title()} rate limit exceeded"

5. Local Model Usage

Use Ollama for free inference:
config = LLMConfig(
    primary_model=ModelConfig(
        provider="ollama",
        model_name="llama3:70b",
        api_base="http://localhost:11434",
        cost_per_1k_tokens=0.0,  # FREE!
    ),
)
Trade-offs:
  • Zero cost: No API fees
  • Privacy: Data never leaves your machine
  • No rate limits: Run as many requests as hardware allows
  • Offline capable: Works without internet
RAPTOR warns when using Ollama for exploit generation:
⚠️  Local model - exploit PoCs may be unreliable
For production security research, consider cloud models.

Cost Reporting

Real-Time Statistics

Get current cost statistics during scan:
client = LLMClient()

# ... make some requests ...

stats = client.get_stats()
print(f"Total cost: ${stats['total_cost']:.4f}")
print(f"Budget remaining: ${stats['budget_remaining']:.4f}")
print(f"Requests: {stats['request_count']}")

for provider, metrics in stats['providers'].items():
    print(f"  {provider}: {metrics['total_tokens']} tokens, ${metrics['total_cost']:.4f}")
Example output:
Total cost: $2.4567
Budget remaining: $7.5433
Requests: 15
  anthropic:claude-sonnet-4.5: 123456 tokens, $1.8519
  openai:gpt-5.2: 87654 tokens, $0.6048

Post-Scan Summary

After scan completion, RAPTOR reports total costs:
 AUTONOMOUS ORCHESTRATION COMPLETE
====================================================================
Total findings: 12
Processed: 5
Analyzed: 5
Exploitable: 2

Autonomous Actions:
 Exploits generated: 2
 Patches generated: 5

Execution time: 245.67s
LLM Cost: $3.45 / $10.00 budget (34.5% used)

Results saved to: out/agentic_myapp_20250713_143022/
====================================================================

Budget Exhaustion Warning

When budget is nearly exhausted:
⚠️  Warning: 90% of LLM budget used ($9.00 / $10.00)
Consider increasing budget for remaining findings.
When budget exceeded:
❌ Error: LLM budget exceeded: $10.2345 spent > $10.0000 limit.
Increase budget with: LLMConfig(max_cost_per_scan=20.0)

Advanced Configuration

Per-Scan Budget Override

# Default: $10
default_client = LLMClient()  

# High-value target: $50
high_value_config = LLMConfig(max_cost_per_scan=50.0)
high_value_client = LLMClient(high_value_config)

# Quick test: $1
quick_test_config = LLMConfig(max_cost_per_scan=1.0)
quick_test_client = LLMClient(quick_test_config)

Cost Reset

Reset cost tracking between scans:
client = LLMClient()

# Scan 1
result1 = run_scan(client)
print(f"Scan 1 cost: ${client.total_cost:.2f}")

# Reset for Scan 2
client.reset_costs()
result2 = run_scan(client)
print(f"Scan 2 cost: ${client.total_cost:.2f}")

Disable Cost Tracking

# For testing or when cost is not a concern
config = LLMConfig(
    enable_cost_tracking=False,
)

client = LLMClient(config)
# Budget checks are skipped, all requests allowed
Only disable cost tracking in development/testing environments. Production scans should always enforce budgets.

Cost Analysis Tools

Cost Breakdown by Task

class CostTracker:
    def __init__(self):
        self.costs_by_task = {}
    
    def track_task(self, task_name: str, cost: float):
        if task_name not in self.costs_by_task:
            self.costs_by_task[task_name] = []
        self.costs_by_task[task_name].append(cost)
    
    def report(self):
        for task, costs in self.costs_by_task.items():
            total = sum(costs)
            avg = total / len(costs)
            print(f"{task}: ${total:.4f} total, ${avg:.4f} avg ({len(costs)} calls)")

tracker = CostTracker()

# Track each analysis task
for finding in findings:
    cost_before = client.total_cost
    analyze_finding(finding)
    cost_after = client.total_cost
    tracker.track_task("vulnerability_analysis", cost_after - cost_before)

tracker.report()
Example output:
vulnerability_analysis: $1.2345 total, $0.2469 avg (5 calls)
exploit_generation: $0.8765 total, $0.4383 avg (2 calls)
patch_creation: $0.6543 total, $0.1309 avg (5 calls)

Cost Prediction

def estimate_scan_cost(num_findings: int, avg_code_size: int) -> float:
    """
    Estimate total cost for a scan based on historical data.
    
    Args:
        num_findings: Expected number of vulnerabilities
        avg_code_size: Average lines of code per finding context
    
    Returns:
        Estimated cost in USD
    """
    # Historical averages from production scans
    cost_per_finding_analysis = 0.25  # $0.25 per vulnerability
    cost_per_exploit = 0.45  # $0.45 per exploit (50% exploitable)
    cost_per_patch = 0.15  # $0.15 per patch
    
    # Adjust for code size
    size_multiplier = min(avg_code_size / 100, 3.0)  # Cap at 3x
    
    analysis_cost = num_findings * cost_per_finding_analysis * size_multiplier
    exploit_cost = (num_findings * 0.5) * cost_per_exploit * size_multiplier
    patch_cost = num_findings * cost_per_patch * size_multiplier
    
    return analysis_cost + exploit_cost + patch_cost

# Example
estimated = estimate_scan_cost(num_findings=10, avg_code_size=150)
print(f"Estimated cost: ${estimated:.2f}")
# Output: Estimated cost: $5.85

Best Practices

Set Realistic Budgets

Start with 10forsmallscans,10 for small scans, 25-50 for production. Monitor first scan to calibrate.

Enable Caching

Cache responses to avoid re-computing identical requests (can save 30-50%).

Use Task-Specific Models

Route simple tasks to cheaper models (Gemini for classification, Sonnet for analysis).

Monitor in Real-Time

Check client.get_stats() during scan to detect runaway costs early.
Always enforce budgets in production to prevent accidentally expensive scans.
Quota exceeded = provider rate limit (need to wait or switch provider)Budget exceeded = RAPTOR cost limit (need to increase max_cost_per_scan)
Use Ollama for development/testing, cloud models for production security research.

Troubleshooting

Budget exceeded but scan not complete
error
Cause: Scan hit cost limit before processing all findingsFix:
config = LLMConfig(max_cost_per_scan=25.0)  # Increase budget
Costs higher than expected
warning
Possible causes:
  • Using expensive model (Opus) for all tasks
  • Long prompts with unnecessary context
  • Cache disabled or not working
Debug:
stats = client.get_stats()
print(stats['providers'])  # See which provider costs most
Cache not reducing costs
info
Check:
  • Is enable_caching=True?
  • Are prompts identical (including whitespace)?
  • Does out/llm_cache/ directory exist?
Verify:
response = client.generate(prompt)
print(response.finish_reason)  # Should be "cached" on repeat

Further Reading

LiteLLM Cost Tracking

Official LiteLLM documentation on cost tracking features

Provider Pricing

Up-to-date pricing for all major LLM providers

Client Configuration

Full LLM client configuration reference

Optimization Guide

Comprehensive guide to optimizing LLM costs

Build docs developers (and LLMs) love