Benchmark Overview
We benchmarked bdg (CLI) against Chrome DevTools MCP Server on five real developer debugging workflows to answer: which interface paradigm serves AI agents better?CLI Score
77/100 points
MCP Score
60/100 points
Token Efficiency
+33% better for CLI (202.1 vs 152.3 TES)
Time Comparison
MCP 27% faster (323s vs 441s)
Test Suite
| Test | Difficulty | Task | Time Limit |
|---|---|---|---|
| Basic Error | ⭐ Easy | Find and diagnose one JS error | 2 min |
| Multiple Errors | ⭐⭐ Moderate | Capture and categorize 5+ errors | 3 min |
| SPA Debugging | ⭐⭐⭐ Advanced | Debug React app, correlate console/network | 5 min |
| Form Validation | ⭐⭐⭐⭐ Expert | Test validation logic, find bugs | 5 min |
| Memory Leak | ⭐⭐⭐⭐⭐ Master | Detect and quantify DOM memory leak | 8 min |
Methodology
- Same URLs and time limits for both tools
- Alternating test order to prevent learning bias
- Metrics: task score, completion time, tokens consumed
- Token Efficiency Score (TES) = (Score × 100) / (Tokens / 1000)
Overall Results
| Metric | bdg (CLI) | MCP | Difference |
|---|---|---|---|
| Total Score | 77/100 | 60/100 | +28% |
| Total Time | 441s | 323s | +37% slower |
| Total Tokens | ~38.1K | ~39.4K | -3% fewer |
| Token Efficiency | 202.1 | 152.3 | +33% |
Winner: CLI - Higher score with similar token usage = better efficiency
Test-by-Test Analysis
Test 1: Basic Error Detection (⭐ Easy)
- Results
- What Happened
- Key Difference
| Metric | bdg | MCP |
|---|---|---|
| Score | 18/20 | 14/20 |
| Time | 69s | 46s |
| Tokens | ~3.6K | ~4.8K |
| Penalty | -2 pts | 0 |
Takeaway: bdg’s structured JSON with full stack traces saves developer investigation time
Test 2: Multiple Error Collection (⭐⭐ Moderate)
- Results
- What Happened
- Key Difference
| Metric | bdg | MCP |
|---|---|---|
| Score | 18/20 | 12/20 |
| Time | 75s | 48s |
| Tokens | ~18.7K | ~9.3K |
| Errors Found | 18 (14 unique) | 3 |
Test 3: SPA Debugging (⭐⭐⭐ Advanced)
- Results
- What Happened
- Key Difference
| Metric | bdg | MCP |
|---|---|---|
| Score | 14/20 | 13/20 |
| Time | 100s | 57s |
| Tokens | ~4.7K | ~6.6K |
Takeaway: When an application has no bugs to find, tools perform similarly. bdg’s advantage comes from deeper analysis capabilities.
Test 4: Form Validation Testing (⭐⭐⭐⭐ Expert)
- Results
- What Happened
- Key Difference
| Metric | bdg | MCP |
|---|---|---|
| Score | 15/20 | 13/20 |
| Time | 93s | 102s |
| Tokens | ~3.5K | ~15.2K |
| Penalty | 0 | -2 pts (timeout) |
Test 5: Memory Leak Detection (⭐⭐⭐⭐⭐ Master)
- Results
- What Happened
- Key Difference
| Metric | bdg | MCP |
|---|---|---|
| Score | 12/20 | 8/20 |
| Time | 104s | 70s |
| Tokens | ~7.6K | ~3.5K |
Critical advantage: Memory debugging requires CDP access - bdg’s core strength, MCP’s blind spot
Capability Comparison
Coverage Matrix
| CDP Domain | bdg Access | MCP Access |
|---|---|---|
| Runtime | Full (21 methods) | evaluate_script only |
| DOM | Full (39 methods) | Via accessibility tree |
| Network | Full + HAR export | list_network_requests |
| Console | Full + streaming | list_console_messages |
| HeapProfiler | Full (17 methods) | None |
| Debugger | Full (35 methods) | None |
| Performance | Full (22 methods) | None |
| Accessibility | Selective queries | Full tree dumps |
Feature Comparison
- bdg Strengths
- bdg Weaknesses
- MCP Strengths
- MCP Weaknesses
✓ Comprehensive error data - Full stack traces, line numbers, function names✓ Efficient batch operations - JavaScript eval for clicking multiple elements✓ Advanced CDP access - Direct access to HeapProfiler, Runtime, all 53 domains✓ Structured output - JSON format with detailed error categorization✓ Network analysis - HAR export capability for detailed request inspection✓ Memory profiling - Native heap measurement via CDP✓ Unix composability - Pipes naturally with jq, grep, etc.✓ Selective queries - Fetch only what’s needed, control token usage
Token Efficiency Analysis
Test-by-Test Token Usage
| Test | bdg Tokens | MCP Tokens | Ratio | bdg Score | MCP Score |
|---|---|---|---|---|---|
| Test 1: Basic Error | 3,600 | 4,800 | 0.75× | 18/20 | 14/20 |
| Test 2: Multiple Errors | 18,700 | 9,300 | 2.0× | 18/20 | 12/20 |
| Test 3: SPA Debugging | 4,700 | 6,600 | 0.71× | 14/20 | 13/20 |
| Test 4: Form Validation | 3,500 | 15,200 | 0.23× | 15/20 | 13/20 |
| Test 5: Memory Leak | 7,600 | 3,500 | 2.17× | 12/20 | 8/20 |
| Total | 38,100 | 39,400 | 0.97× | 77/100 | 60/100 |
Similar total tokens, but bdg achieved 28% higher score - tokens spent more effectively
Token Efficiency Score (TES)
Real-World Token Examples
Amazon product page:- MCP: 52,000 tokens (single snapshot, truncated at system limit)
- bdg: 1,200 tokens (two targeted queries)
- 43× more efficient
- MCP: ~5,000 tokens per interaction (full accessibility tree)
- bdg: ~50 tokens per interaction (targeted selector)
- 100× more efficient
When to Use Each Tool
- Use bdg
- Use MCP
Complex Debugging
Detailed error analysis with full stack traces
Memory Profiling
Heap analysis and memory leak detection
Automated Testing
Structured JSON output for test automation
Network Analysis
HAR export for detailed request inspection
Batch Operations
JavaScript evaluation for efficient testing
Token Efficiency
Context-constrained agents, cost optimization
Key Findings
1. Token efficiency through selective queries
1. Token efficiency through selective queries
CLI’s selective query approach (fetch what you need) vs MCP’s bulk dumps (dump everything) resulted in 33% better token efficiency while achieving higher scores.Example: 195-option dropdown consumed 50 tokens with bdg vs 5,000 tokens with MCP on every interaction.
2. Capability coverage matters
2. Capability coverage matters
Memory profiling, HAR export, and batch JavaScript execution are core debugging workflows, not edge features. bdg’s 300+ CDP method access vs MCP’s limited tool set was decisive.
3. Unix composability extends capability
3. Unix composability extends capability
bdg’s output pipes naturally with jq, grep, sort - enabling transformations and filtering without additional implementation. MCP responses require internal parsing.
4. Predictable output size
4. Predictable output size
Agents can estimate bdg token cost before calling (proportional to results, bounded by limits). MCP’s snapshots are unpredictable (5K-52K tokens).
5. Speed vs depth trade-off
5. Speed vs depth trade-off
MCP was 27% faster but provided less actionable information. bdg’s extra time bought comprehensive data that saves developer investigation time.
Conclusion
For developer debugging workflows, CLI provides more capability with better efficiency:- +28% higher score (77 vs 60)
- +33% better token efficiency (202.1 vs 152.3 TES)
- Capabilities MCP lacks: Memory profiling, HAR export, batch operations
- Token efficiency wins: Form testing (4.3× better), selective queries (43× better)
This doesn’t mean MCP is “bad” - it optimizes for different constraints (ecosystem integration, sandboxing). For power-user debugging, CLI is superior.
Full Benchmark Documentation
Benchmark Article
Complete MCP vs CLI analysis with detailed explanations
Benchmark Results
Raw benchmark data and test-by-test breakdowns
Next Steps
Overview
Why bdg is built for AI agents
Discovery Pattern
How agents discover capabilities
Error Handling
Semantic exit codes and recovery

