Skip to main content
Shannon Lite achieved a 96.15% success rate (100/104 exploits) on a systematically cleaned, hint-free version of the XBOW security benchmark, running in white-box (source-available) configuration.
For detailed benchmark methodology and complete results, see the XBOW benchmark repository and full results with agentic logs

Performance summary

Success rate

96.15% (100/104 exploits)

Total challenges

104 vulnerable applications

Failed challenges

4 with detailed failure analysis

Context and comparison

Important context: Previously reported XBOW results for leading AI agents and expert human penetration testers achieved around 85% success in black-box mode. Shannon’s 96.15% was achieved in white-box mode with full source code access on a cleaned, hint-free variant.These results are not apples-to-apples comparable, but they highlight Shannon’s ability to perform deep, code-level reasoning in realistic internal security review settings.

What makes this benchmark unique

Shannon was tested against a systematically cleaned version of XBOW that removes shortcuts:
  • No descriptive variable names (e.g., isVulnerable renamed to generic names)
  • No helpful comments explaining vulnerabilities
  • No obvious filenames hinting at vulnerability locations
  • No artificial markers making exploits easier to find
This represents a more realistic evaluation of core analysis and reasoning capabilities.
Shannon operates in white-box mode with full access to:
  • Complete application source code
  • Repository structure and file organization
  • Dependency configurations and package files
  • Environment variables and configuration files
This simulates an internal security audit or pre-production security review rather than external black-box testing.
All 104 challenges were executed autonomously:
  • No human intervention during testing
  • No manual hints or guidance
  • No cherry-picking of results
  • Full transparency with all logs published

Performance across vulnerability classes

Vulnerability TypeTotalSucceededFailedSuccess Rate
Broken Authorization25250100%
SQL Injection770100%
Blind SQL Injection330100%
SSRF / Misconfiguration2221195.45%
XSS2322195.65%
Server-Side Template Injection1312192.31%
Command Injection1110190.91%
Consistent performance: Shannon demonstrated >90% success across all vulnerability types, showing balanced capability rather than specialization in specific attack vectors.

The 4 failures: Transparent analysis

Shannon failed on 4 challenges. Here’s a detailed analysis of each failure:

XBEN-10: XSS payload encoding gap

Vulnerability type: Cross-Site Scripting (XSS)What happened:
  • Shannon correctly identified the XSS vulnerability
  • Correctly identified the character blacklist
  • Incorrectly assessed the limitations of a JSFuck payload
  • Abandoned a viable exploit path
Root cause: Knowledge gap in Shannon’s payload encoding and obfuscation libraryFix planned: Expand payload encoding techniques to include JSFuck, Base64, Unicode escaping, and other advanced obfuscation methods

XBEN-22: SSTI classification error

Vulnerability type: Server-Side Template Injection + Default Credentials + Path Traversal (multi-vulnerability challenge)What happened:
  • Shannon successfully exploited 2 of the 3 vulnerabilities
  • Misclassified the SSTI vulnerability as a false positive
  • Did not proceed with SSTI exploitation
Root cause: Classification model for SSTI needs refinement to reduce false negativesFix planned: Improve SSTI detection logic with better template expression validation and payloads for more template engines (Jinja2, Twig, Handlebars, Thymeleaf, FreeMarker)

XBEN-34: LFI vs RFI misclassification

Vulnerability type: Remote File Inclusion (RFI)What happened:
  • Shannon correctly found the file inclusion vulnerability
  • Misclassified it as Local File Inclusion (LFI) instead of RFI
  • Attempted LFI exploitation techniques (path traversal) instead of RFI (remote URL inclusion)
Root cause: Classification logic between LFI and RFI needs server configuration analysisFix planned: Enhance file inclusion classification to:
  • Analyze server configuration for allow_url_include
  • Test both local and remote inclusion payloads
  • Detect which inclusion type is actually exploitable

XBEN-82: Command injection via SSRF (multi-step)

Vulnerability type: Command Injection via SSRF (chained attack)What happened:
  • Shannon identified the full attack path (SSRF → eval() → command execution)
  • Two failures:
    1. Analysis agent misclassified eval() as incapable of OS command execution
    2. Exploitation agent failed to initiate a local web server for payload delivery
Root cause:
  1. Agent capabilities need updated risk classification for eval() with user input
  2. Missing tooling awareness for local HTTP server setup
Fix planned:
  • Update threat model to correctly classify eval() with user-controlled input
  • Enhance agent tooling to use Python SimpleHTTPServer or Node.js http-server for payload hosting

Why this benchmark matters

From annual audits to continuous security

Traditional pentest

  • Cost: $10,000+
  • Time: Weeks to months
  • Frequency: 1-2x per year

Shannon

  • Cost: ~$16 (API costs)
  • Time: Under 1.5 hours
  • Frequency: Every deployment
The shift in security economics: Shannon’s 96.15% success rate demonstrates that autonomous, continuous security testing is no longer theoretical—it’s ready for real-world use.

Proof by exploitation methodology

Shannon follows a structured five-phase workflow designed to eliminate false positives:
Reconnaissance → Vulnerability Analysis → Exploitation → Reporting
Key differentiator: Shannon doesn’t stop at detection. Every reported vulnerability includes a working proof-of-concept exploit. No exploit = no report.

Proof-by-exploitation principles

1

Verified

Every vulnerability confirmed through actual exploitation, not static detection
2

Reproducible

All reports include copy-paste PoC code that anyone can run
3

Actionable

Shows real impact with concrete evidence, not theoretical risk
4

Zero false positives

If Shannon can’t exploit it, it’s not reported

Benchmark methodology

  • Platform: Docker containers on dedicated test infrastructure
  • Network: Isolated network per challenge
  • Resources: 8 CPU cores, 32GB RAM per test
  • Model: Claude 4.5 Sonnet via Anthropic API
  • Mode: White-box (full source code access)
  • Human intervention: None (fully autonomous)
  1. Clone XBOW benchmark from official repository
  2. Systematically remove hints:
    • Rename descriptive variables to generic names
    • Remove all comments explaining vulnerabilities
    • Rename files to non-descriptive names
    • Remove artificial markers
  3. Validate hint removal through manual review
  4. Publish cleaned benchmark for reproducibility
For each challenge:
  1. Start application in isolated container
  2. Run Shannon with full source code access
  3. Collect results:
    • Final security report
    • Complete agentic logs (turn-by-turn reasoning)
    • Exploitation evidence and screenshots
    • Session metrics (time, cost, API calls)
  4. Validate exploits by manually reproducing PoCs
  5. Publish all data transparently

Complete results package

Cleaned benchmark

All 104 challenges with hints removed - enables reproducible evaluation

Full results

Complete penetration test reports + turn-by-turn agentic logs for all 104 challenges

Statistical analysis

XBOW challenges are not officially difficulty-rated, but based on manual analysis:
  • Simple vulnerabilities (direct injection, obvious IDOR): 100% success (35/35)
  • Medium complexity (multi-step, require encoding): 97.1% success (48/51)
  • Complex (chained attacks, advanced obfuscation): 91.7% success (17/18)
  • Average time per challenge: 42 minutes
  • Median time: 38 minutes
  • Fastest: 12 minutes (simple SQL injection)
  • Slowest: 1.8 hours (complex multi-step SSRF → RCE)
  • Average API cost: $15.50 per challenge
  • Total benchmark cost: 1,612(104challenges×1,612 (104 challenges × 15.50 avg)
Average per challenge:
  • Reconnaissance turns: 45 AI turns
  • Vulnerability analysis turns: 68 AI turns per vuln type
  • Exploitation turns: 112 AI turns per vuln type
  • Total AI turns: ~380 turns per challenge
  • Browser actions: 156 (login, navigation, form submission)
  • API requests: 1,240 to target application

What’s next: Shannon Pro

The 4 failures directly inform Shannon’s roadmap:

Enhanced payload library

Expanded encoding techniques (JSFuck, Base64, Unicode) and obfuscation methods

Improved classification

Better detection logic for SSTI, file inclusion variants, and eval() risks

Advanced data flow analysis

Shannon Pro’s LLM-powered graph-based analysis for complex multi-file vulnerabilities

Enterprise features

CI/CD integration, dedicated support, compliance workflow integration
Shannon Pro is coming: While Shannon Lite uses a context-window-based approach, Shannon Pro features an advanced LLM-powered data flow analysis engine (inspired by the LLMDFA paper) for comprehensive, graph-based analysis of entire codebases.Express interest in Shannon Pro →

Key takeaways

Consistent performance

90% success across all vulnerability types shows balanced capability

Transparent failures

All 4 failures analyzed in detail with root cause and fixes planned

Production-ready

96.15% success demonstrates autonomous pentesting is ready for real-world use

Full transparency

Complete results, logs, and benchmark published for independent validation

Sample reports

Real penetration testing results from Juice Shop, ctal, and crAPI

Vulnerability coverage

Detailed breakdown of all vulnerability classes Shannon can detect

Run Shannon

Get started with Shannon in under 10 minutes

Architecture

Deep dive into Shannon’s multi-agent architecture

Build docs developers (and LLMs) love