Benchmarks - Browser Debugger CLI

Benchmark Overview

We benchmarked bdg (CLI) against Chrome DevTools MCP Server on five real developer debugging workflows to answer: which interface paradigm serves AI agents better?

CLI Score

77/100 points

MCP Score

60/100 points

Token Efficiency

+33% better for CLI (202.1 vs 152.3 TES)

Time Comparison

MCP 27% faster (323s vs 441s)

Test Suite

Test	Difficulty	Task	Time Limit
Basic Error	⭐ Easy	Find and diagnose one JS error	2 min
Multiple Errors	⭐⭐ Moderate	Capture and categorize 5+ errors	3 min
SPA Debugging	⭐⭐⭐ Advanced	Debug React app, correlate console/network	5 min
Form Validation	⭐⭐⭐⭐ Expert	Test validation logic, find bugs	5 min
Memory Leak	⭐⭐⭐⭐⭐ Master	Detect and quantify DOM memory leak	8 min

Methodology

Same URLs and time limits for both tools
Alternating test order to prevent learning bias
Metrics: task score, completion time, tokens consumed
Token Efficiency Score (TES) = (Score × 100) / (Tokens / 1000)

Overall Results

Metric	bdg (CLI)	MCP	Difference
Total Score	77/100	60/100	+28%
Total Time	441s	323s	+37% slower
Total Tokens	~38.1K	~39.4K	-3% fewer
Token Efficiency	202.1	152.3	+33%

Winner: CLI - Higher score with similar token usage = better efficiency

Test-by-Test Analysis

Test 1: Basic Error Detection (⭐ Easy)

Results
What Happened
Key Difference

Metric	bdg	MCP
Score	18/20	14/20
Time	69s	46s
Tokens	~3.6K	~4.8K
Penalty	-2 pts	0

Error detail depthbdg output:

{
  "type": "error",
  "message": "Uncaught ReferenceError: $ is not defined",
  "stackTrace": [
    {"functionName": "updateSiteTitle", "lineNumber": 4, "columnNumber": 17},
    {"functionName": "foo", "lineNumber": 8, "columnNumber": 5},
    {"functionName": "", "lineNumber": 11, "columnNumber": 1}
  ]
}

MCP output:

Error: $ is not defined
Location: line 4

Takeaway: bdg’s structured JSON with full stack traces saves developer investigation time

Test 2: Multiple Error Collection (⭐⭐ Moderate)

Results
What Happened
Key Difference

Metric	bdg	MCP
Score	18/20	12/20
Time	75s	48s
Tokens	~18.7K	~9.3K
Errors Found	18 (14 unique)	3

Page had 17 “Run” buttons, each triggering different errors.bdg: Used JavaScript evaluation to click all 17 buttons with timeouts in a single command:

bdg cdp Runtime.evaluate --params '{
  "expression": "document.querySelectorAll(\"button\").forEach((b,i) => setTimeout(() => b.click(), i*100))"
}'

Captured 18 errors (14 unique types) with full stack traces.MCP: Made 11 individual click calls, missing 6 buttons. Captured only 3 errors.

Batch operation capabilitybdg’s Runtime.evaluate access enables efficient batch operations that are impossible with MCP’s click-by-click approach.Despite using 2× the tokens, bdg found 6× more errors - better value for token spend.

MCP doesn’t expose arbitrary JavaScript execution - each interaction requires a separate tool call

Test 3: SPA Debugging (⭐⭐⭐ Advanced)

Results
What Happened
Key Difference

Metric	bdg	MCP
Score	14/20	13/20
Time	100s	57s
Tokens	~4.7K	~6.6K

HAR export capabilitybdg’s HAR export enables offline analysis and integration with other tools:

bdg network har - | jq '.log.entries[] | select(.response.status >= 400)'

MCP was 43% faster but both found similar results on this clean application.

Takeaway: When an application has no bugs to find, tools perform similarly. bdg’s advantage comes from deeper analysis capabilities.

Test 4: Form Validation Testing (⭐⭐⭐⭐ Expert)

Results
What Happened
Key Difference

Metric	bdg	MCP
Score	15/20	13/20
Time	93s	102s
Tokens	~3.5K	~15.2K
Penalty	0	-2 pts (timeout)

Verbosity problemMCP’s accessibility tree approach becomes unwieldy with complex forms:

Element Type	bdg Tokens	MCP Tokens
Single input	~10	~10
195-option dropdown	~50	~5,000

4.3× more efficient with better coverage

MCP exceeded time limit by 42s due to verbose accessibility tree dumps on every interaction

Test 5: Memory Leak Detection (⭐⭐⭐⭐⭐ Master)

Results
What Happened
Key Difference

Metric	bdg	MCP
Score	12/20	8/20
Time	104s	70s
Tokens	~7.6K	~3.5K

bdg: Used CDP HeapProfiler methods to measure memory:

# Baseline
bdg cdp Runtime.getHeapUsage
# {"usedSize": 833000, "totalSize": 1500000}

# Trigger leak (30 clicks)
bdg dom click "button"

# Re-measure
bdg cdp Runtime.getHeapUsage
# {"usedSize": 790000, "totalSize": 3000000}

Detected 100% embedder heap growth (1.5MB → 3MB)MCP: Observed messages appearing in DOM, no memory measurement capability

Fundamental capability gap

Capability	bdg	MCP
HeapProfiler access	✓	✗
Memory snapshots	✓	✗
Quantifiable measurements	✓	✗
Visual DOM observation	✓	✓

MCP could see elements appearing but couldn’t prove memory was leaking.

Critical advantage: Memory debugging requires CDP access - bdg’s core strength, MCP’s blind spot

Capability Comparison

Coverage Matrix

CDP Domain	bdg Access	MCP Access
Runtime	Full (21 methods)	evaluate_script only
DOM	Full (39 methods)	Via accessibility tree
Network	Full + HAR export	list_network_requests
Console	Full + streaming	list_console_messages
HeapProfiler	Full (17 methods)	None
Debugger	Full (35 methods)	None
Performance	Full (22 methods)	None
Accessibility	Selective queries	Full tree dumps

Feature Comparison

bdg Strengths
bdg Weaknesses
MCP Strengths
MCP Weaknesses

✓ Comprehensive error data - Full stack traces, line numbers, function names✓ Efficient batch operations - JavaScript eval for clicking multiple elements✓ Advanced CDP access - Direct access to HeapProfiler, Runtime, all 53 domains✓ Structured output - JSON format with detailed error categorization✓ Network analysis - HAR export capability for detailed request inspection✓ Memory profiling - Native heap measurement via CDP✓ Unix composability - Pipes naturally with jq, grep, etc.✓ Selective queries - Fetch only what’s needed, control token usage

Token Efficiency Analysis

Test-by-Test Token Usage

Test	bdg Tokens	MCP Tokens	Ratio	bdg Score	MCP Score
Test 1: Basic Error	3,600	4,800	0.75×	18/20	14/20
Test 2: Multiple Errors	18,700	9,300	2.0×	18/20	12/20
Test 3: SPA Debugging	4,700	6,600	0.71×	14/20	13/20
Test 4: Form Validation	3,500	15,200	0.23×	15/20	13/20
Test 5: Memory Leak	7,600	3,500	2.17×	12/20	8/20
Total	38,100	39,400	0.97×	77/100	60/100

Similar total tokens, but bdg achieved 28% higher score - tokens spent more effectively

Token Efficiency Score (TES)

TES = (Score × 100) / (Tokens / 1000)

bdg: (77 × 100) / 38.1 = 202.1
MCP: (60 × 100) / 39.4 = 152.3

Advantage: +33% for CLI

Real-World Token Examples

Amazon product page:

MCP: 52,000 tokens (single snapshot, truncated at system limit)
bdg: 1,200 tokens (two targeted queries)
43× more efficient

Form with 195-option dropdown:

MCP: ~5,000 tokens per interaction (full accessibility tree)
bdg: ~50 tokens per interaction (targeted selector)
100× more efficient

When to Use Each Tool

Use bdg
Use MCP

Complex Debugging

Detailed error analysis with full stack traces

Memory Profiling

Heap analysis and memory leak detection

Automated Testing

Structured JSON output for test automation

Network Analysis

HAR export for detailed request inspection

Batch Operations

JavaScript evaluation for efficient testing

Token Efficiency

Context-constrained agents, cost optimization

Key Findings

1. Token efficiency through selective queries

CLI’s selective query approach (fetch what you need) vs MCP’s bulk dumps (dump everything) resulted in 33% better token efficiency while achieving higher scores.Example: 195-option dropdown consumed 50 tokens with bdg vs 5,000 tokens with MCP on every interaction.

2. Capability coverage matters

Memory profiling, HAR export, and batch JavaScript execution are core debugging workflows, not edge features. bdg’s 300+ CDP method access vs MCP’s limited tool set was decisive.

3. Unix composability extends capability

bdg’s output pipes naturally with jq, grep, sort - enabling transformations and filtering without additional implementation. MCP responses require internal parsing.

4. Predictable output size

Agents can estimate bdg token cost before calling (proportional to results, bounded by limits). MCP’s snapshots are unpredictable (5K-52K tokens).

5. Speed vs depth trade-off

MCP was 27% faster but provided less actionable information. bdg’s extra time bought comprehensive data that saves developer investigation time.

Conclusion

For developer debugging workflows, CLI provides more capability with better efficiency:

+28% higher score (77 vs 60)
+33% better token efficiency (202.1 vs 152.3 TES)
Capabilities MCP lacks: Memory profiling, HAR export, batch operations
Token efficiency wins: Form testing (4.3× better), selective queries (43× better)

The margin wasn’t close. CLI completed tasks that MCP structurally couldn’t (memory profiling), and did shared tasks with less overhead (selective queries vs full dumps).

This doesn’t mean MCP is “bad” - it optimizes for different constraints (ecosystem integration, sandboxing). For power-user debugging, CLI is superior.

Full Benchmark Documentation

Benchmark Article

Complete MCP vs CLI analysis with detailed explanations

Benchmark Results

Raw benchmark data and test-by-test breakdowns

Next Steps

Overview

Why bdg is built for AI agents

Discovery Pattern

How agents discover capabilities

Error Handling

Semantic exit codes and recovery

Get Started

Core Concepts

Guides

For AI Agents

Advanced

​Benchmark Overview

CLI Score

MCP Score

Token Efficiency

Time Comparison

​Test Suite

​Methodology

​Overall Results

​Test-by-Test Analysis

​Test 1: Basic Error Detection (⭐ Easy)

​Test 2: Multiple Error Collection (⭐⭐ Moderate)

​Test 3: SPA Debugging (⭐⭐⭐ Advanced)

​Test 4: Form Validation Testing (⭐⭐⭐⭐ Expert)

​Test 5: Memory Leak Detection (⭐⭐⭐⭐⭐ Master)

​Capability Comparison

​Coverage Matrix

​Feature Comparison

​Token Efficiency Analysis

​Test-by-Test Token Usage

​Token Efficiency Score (TES)

​Real-World Token Examples

​When to Use Each Tool

Complex Debugging

Memory Profiling

Automated Testing

Network Analysis

Batch Operations

Token Efficiency

Quick Testing

UI-Focused Testing

Simple Errors

Accessibility Audits

MCP Ecosystem

Sandboxed Environments

​Key Findings

​Conclusion

​Full Benchmark Documentation

Benchmark Article

Benchmark Results

​Next Steps

Overview

Discovery Pattern

Error Handling

Build docs developers (and LLMs) love

Benchmark Overview

Test Suite

Methodology

Overall Results

Test-by-Test Analysis

Test 1: Basic Error Detection (⭐ Easy)

Test 2: Multiple Error Collection (⭐⭐ Moderate)

Test 3: SPA Debugging (⭐⭐⭐ Advanced)

Test 4: Form Validation Testing (⭐⭐⭐⭐ Expert)

Test 5: Memory Leak Detection (⭐⭐⭐⭐⭐ Master)

Capability Comparison

Coverage Matrix

Feature Comparison

Token Efficiency Analysis

Test-by-Test Token Usage

Token Efficiency Score (TES)

Real-World Token Examples

When to Use Each Tool

Key Findings

Conclusion

Full Benchmark Documentation

Next Steps