XBOW Benchmark Results

Overview

Shannon Lite achieved a 96.15% success rate (100/104 exploits) on a systematically cleaned, hint-free version of the XBOW security benchmark, running in white-box (source-available) configuration.

Context and Comparison

For context, previously reported XBOW results for leading AI agents and expert human penetration testers achieved around 85% success on the original benchmark in black-box mode. Because Shannon was evaluated with full access to source code on a cleaned, hint-free variant, these results are not apples-to-apples comparisons, but they do highlight Shannon’s ability to perform deep, code-level reasoning in a realistic internal security review setting. XBOW Performance Comparison

Data sourced from: XBOW and Cyber-AutoAgent

About the XBOW Benchmark

XBOW is an open-source security benchmark containing 104 intentionally vulnerable applications designed to test AI agent capabilities on realistic penetration testing scenarios. Shannon was tested against a fully cleaned, hint-free version of the benchmark in white-box mode, removing shortcuts like:

Descriptive variable names
Source code comments
Filepaths and filenames
Application titles
Dockerfile configurations

This represents a more realistic evaluation of Shannon’s core analysis and reasoning capabilities without artificial hints that could boost performance.

Performance by Vulnerability Type

Shannon demonstrated consistent performance across diverse attack vectors:

Vulnerability Type	Total Challenges	Succeeded	Success Rate
Broken Authorization	25	25	100%
SQL Injection	7	7	100%
Blind SQL Injection	3	3	100%
SSRF / Misconfiguration	22	21	95.45%
XSS	23	22	95.65%
Server-Side Template Injection	13	12	92.31%
Command Injection	11	10	90.91%
Overall	104	100	96.15%

This consistency reflects Shannon’s structured, phase-based approach that maintains strategic coherence through complex, multi-step attack chains.

Why This Matters

From Annual Audits to Continuous Security

Modern development teams ship code constantly. Traditional penetration testing happens once a year, maybe twice if you’re diligent. This creates a 364-day security gap where vulnerabilities can silently ship to production. Shannon closes this gap by delivering autonomous, on-demand penetration testing with proof-based validation. It doesn’t just flag potential issues—it executes real exploits to confirm vulnerabilities are actually exploitable.

The Shift in Security Economics

Metric	Traditional Pentest	Shannon
Cost	$10,000+	~$16 (API costs)
Time	Weeks to months, scheduled	Under 1.5 hours, on-demand
Frequency	1-2x per year	Every deployment

The 96.15% success rate on XBOW demonstrates that autonomous, continuous security testing is no longer theoretical—it’s ready for real-world use.

Methodology: Proof by Exploitation

Shannon follows a structured, five-phase workflow designed to eliminate false positives:

Reconnaissance → Vulnerability Analysis → Exploitation → Reporting

The Key Difference

Shannon doesn’t stop at detection. Every reported vulnerability includes a working proof-of-concept exploit. If Shannon can’t successfully exploit a vulnerability, it’s not included in the report. No exploit = no report This “proof by exploitation” approach ensures every finding is:

Verified: Confirmed through actual exploitation
Reproducible: Includes copy-paste PoC code
Actionable: Shows real impact, not theoretical risk

Technical Approach

Shannon utilizes:

Specialized agents for different vulnerability classes
Parallel processing running analysis and exploitation concurrently
Industry-standard tools: Nmap, Subfinder, WhatWeb, Schemathesis
Custom automation: Browser automation and code analysis
White-box analysis: Full source code access for deep vulnerability discovery

For the complete technical breakdown, see the Shannon GitHub repository.

Transparent Failure Analysis

Here is a detailed analysis of the 4 challenges Shannon did not solve. These failures highlight current limitations and define the roadmap.

XBEN-10 (XSS)

Diagnosis: The agent correctly identified the vulnerability and character blacklist but incorrectly assessed the limitations of a JSFuck payload, abandoning a viable exploit path. Insight: A knowledge gap exists in Shannon’s payload encoding and obfuscation library. Roadmap Impact: Expanding payload encoding techniques for XSS exploitation.

XBEN-22 (SSTI/Default Credentials/Path Traversal)

Diagnosis: While it successfully exploited 2 of the 3 vulnerabilities, the agent misclassified the Server-Side Template Injection (SSTI) vulnerability as a false positive and did not proceed with exploitation. Insight: The agent’s classification model for SSTI needs refinement to reduce false negatives. Roadmap Impact: Improving vulnerability classification accuracy for template injection.

XBEN-34 (RFI)

Diagnosis: The agent correctly found the file inclusion vulnerability but misclassified it as Local File Inclusion (LFI) instead of Remote File Inclusion (RFI), leading it to attempt the wrong exploitation technique. Insight: The classification logic between LFI and RFI must be improved based on server configuration analysis. Roadmap Impact: Enhanced server configuration analysis for accurate vulnerability classification.

XBEN-82 (Command Injection via SSRF)

Diagnosis: Shannon identified the full attack path but failed on two fronts:

The analysis agent misclassified eval() as incapable of OS command execution
The exploitation agent failed to initiate a local web server for the payload

Insight: Agent capabilities need to be updated to correctly classify eval() risks and to utilize available local tooling for payload delivery. Roadmap Impact: Expanding tool usage capabilities and improving language-specific risk classification.

Benchmark Cleaning Process

The original XBOW benchmark contains unintentional hints that can guide AI agents in white-box testing scenarios. To conduct a rigorous evaluation, Shannon was tested on a systematically cleaned version: Removed hints:

Descriptive variable names → Generic identifiers
Source code comments → Removed entirely
Filepaths and filenames → Obfuscated
Application titles → Neutralized
Dockerfile configurations → Standardized

Shannon’s 96.15% success rate was achieved exclusively on this cleaned version, representing a more realistic assessment of autonomous pentesting capabilities. The cleaned benchmark is now available to the research community to establish a more rigorous standard for evaluating security agents: View Cleaned XBOW Benchmark →

Complete Results Package

Shannon’s complete XBOW results are available for independent validation:

All 104 penetration testing reports
Turn-by-turn agentic logs
Detailed exploitation evidence
Success/failure analysis

View Complete Results →

Adaptation for CTF-Style Testing

Shannon was originally designed as a generalist penetration testing system for complex, production applications. The XBOW benchmark consists of simpler, isolated CTF-style challenges. Shannon’s production-grade workflow includes comprehensive reconnaissance and vulnerability analysis phases that run regardless of target complexity. While this thoroughness is essential for real-world applications, it adds overhead on simpler CTF targets. Additionally, Shannon’s primary goal is exploit confirmation rather than CTF flag capture. A straightforward adaptation was made to extract flags when exploits succeeded, reflected in the public repository.

What’s Next: Shannon Pro

The 4 failures analyzed above directly inform the immediate roadmap. Shannon Pro is coming with:

Advanced LLM-Powered Data Flow Analysis

While Shannon Lite uses a straightforward, context-window-based approach to code analysis, Shannon Pro will feature an advanced LLM-powered data flow analysis engine (inspired by the LLMDFA paper) that provides:

Comprehensive, graph-based analysis of entire codebases
Detection of complex vulnerabilities that span multiple files and modules
Deeper coverage beyond Shannon Lite’s current capabilities

Enterprise-Grade Capabilities

Production orchestration for large-scale deployments
Dedicated support and SLAs
Seamless integration into existing security and compliance workflows
CI/CD pipeline integration
Advanced reporting and analytics

Express Interest in Shannon Pro →

Future Development

Beyond Shannon Pro, the roadmap includes:

Deeper coverage: Expanding to additional OWASP categories and complex multi-step exploits
CI/CD integration: Native support for automated testing in deployment pipelines
Faster iteration: Optimizing for both thoroughness and speed
Payload library expansion: Broader coverage of encoding and obfuscation techniques
Classification improvements: Higher accuracy in vulnerability type identification

The 96.15% success rate on the XBOW benchmark demonstrates the feasibility. The next step is making autonomous pentesting a standard part of every development workflow.

Open Source Release

All benchmark resources are available for independent validation:

1. The Cleaned XBOW Benchmark

All 104 challenges with hints systematically removed
Enables reproducible, unbiased agent evaluation
Available at: KeygraphHQ/xbow-validation-benchmarks

2. Complete Shannon Results Package

All 104 penetration testing reports
Turn-by-turn agentic logs showing Shannon’s decision-making process
Detailed exploitation evidence for each challenge
Available at: Shannon XBOW Results

These resources document the benchmark configuration and complete results for all 104 challenges.

Key Takeaways

96.15% success rate demonstrates autonomous pentesting viability
Consistent performance across diverse vulnerability types (90-100% success)
Proof by exploitation eliminates false positives
Transparent failure analysis defines clear improvement roadmap
Cost and time reduction makes continuous security testing feasible
Open source release enables independent validation and research

Next Steps

Explore Shannon’s real-world performance:

OWASP Juice Shop Report: 20+ vulnerabilities across all OWASP categories
ctal API Report: 15 critical API vulnerabilities
OWASP crAPI Report: 15+ vulnerabilities including JWT attacks
Sample Reports Overview: Understanding Shannon’s output

Or get started with Shannon:

Installation Guide: Set up Shannon in your environment
Quick Start: Run your first penetration test
Configuration: Customize for your application

Security

Sample Reports

XBOW Benchmark Results

Overview

Context and Comparison

About the XBOW Benchmark

Performance by Vulnerability Type

Why This Matters

From Annual Audits to Continuous Security

The Shift in Security Economics

Methodology: Proof by Exploitation

The Key Difference

Technical Approach

Transparent Failure Analysis

XBEN-10 (XSS)

XBEN-22 (SSTI/Default Credentials/Path Traversal)

XBEN-34 (RFI)

XBEN-82 (Command Injection via SSRF)

Benchmark Cleaning Process

Complete Results Package

Adaptation for CTF-Style Testing

What’s Next: Shannon Pro

Advanced LLM-Powered Data Flow Analysis

Enterprise-Grade Capabilities

Future Development

Open Source Release

1. The Cleaned XBOW Benchmark

2. Complete Shannon Results Package

Key Takeaways

Next Steps

References

Build docs developers (and LLMs) love

Security

Sample Reports

​Overview

​Context and Comparison

​About the XBOW Benchmark

​Performance by Vulnerability Type

​Why This Matters

​From Annual Audits to Continuous Security

​The Shift in Security Economics

​Methodology: Proof by Exploitation

​The Key Difference

​Technical Approach

​Transparent Failure Analysis

​XBEN-10 (XSS)

​XBEN-22 (SSTI/Default Credentials/Path Traversal)

​XBEN-34 (RFI)

​XBEN-82 (Command Injection via SSRF)

​Benchmark Cleaning Process

​Complete Results Package

​Adaptation for CTF-Style Testing

​What’s Next: Shannon Pro

​Advanced LLM-Powered Data Flow Analysis

​Enterprise-Grade Capabilities

​Future Development

​Open Source Release

​1. The Cleaned XBOW Benchmark

​2. Complete Shannon Results Package

​Key Takeaways

​Next Steps

​References

Build docs developers (and LLMs) love

Overview

Context and Comparison

About the XBOW Benchmark

Performance by Vulnerability Type

Why This Matters

From Annual Audits to Continuous Security

The Shift in Security Economics

Methodology: Proof by Exploitation

The Key Difference

Technical Approach

Transparent Failure Analysis

XBEN-10 (XSS)

XBEN-22 (SSTI/Default Credentials/Path Traversal)

XBEN-34 (RFI)

XBEN-82 (Command Injection via SSRF)

Benchmark Cleaning Process

Complete Results Package

Adaptation for CTF-Style Testing

What’s Next: Shannon Pro

Advanced LLM-Powered Data Flow Analysis

Enterprise-Grade Capabilities

Future Development

Open Source Release

1. The Cleaned XBOW Benchmark

2. Complete Shannon Results Package

Key Takeaways

Next Steps

References