Overview
Coraza Proxy includes simple but effective bot detection based on User-Agent string matching. This helps protect your applications from unwanted crawlers, scrapers, and automated tools.Environment Variables
Enable or disable bot blocking. Set to
true to activate User-Agent-based bot filtering.Comma-separated list of bot identifiers to block. These strings are matched case-insensitively against the User-Agent header.
How It Works
Initialization
Bot blocking is configured at startup (frommain.go:375-376):
Detection Logic
The bot detection happens early in the request pipeline (frommain.go:447-458):
Matching Algorithm
- Convert User-Agent header to lowercase
- Split
PROXY_BOTSby comma - Check if any bot identifier appears in the User-Agent string
- Block if match found (case-insensitive substring match)
Default Bot List
The default configuration blocks these common bots:| Identifier | Description |
|---|---|
python | Python HTTP libraries (requests, urllib, etc.) |
googlebot | Google’s web crawler |
bingbot | Microsoft Bing crawler |
yandex | Yandex search engine crawler |
baiduspider | Baidu search engine crawler |
Configuration Examples
Block Common Scrapers
Block Search Engine Crawlers
Block Only Automated Tools
Allow Search Engines, Block Others
Extended Bot List
From .env.example
The project’s example configuration:User-Agent Examples
Blocked User-Agents
With default settings, these would be blocked:Allowed User-Agents
These would pass through:Response Format
When a bot is detected and blocked:Logging
Bot blocks are logged with the client IP:Docker Compose Example
Common Bot Identifiers
Automation Tools
Web Scraping Frameworks
Headless Browsers
Search Engine Crawlers
Social Media Bots
Generic Patterns
Advanced Patterns
Allow Monitoring Services
Block Aggressive Scrapers Only
Testing
Test bot detection with curl:Bypassing Detection
Bots can bypass this simple detection by:- Spoofing User-Agent: Setting a browser-like User-Agent
- Using real browsers: Automation through actual browser instances
- Rotating User-Agents: Changing User-Agent per request
- WAF rules (OWASP CRS has bot detection rules)
- Rate limiting
- JavaScript challenges
- CAPTCHA systems
Combining with WAF
The bot detection runs before WAF processing. For comprehensive protection:Performance Impact
- Minimal CPU overhead (simple string matching)
- O(n) where n = number of bot patterns
- Executed early in request pipeline
- No database lookups or external calls
Best Practices
- Start with common patterns: Use the default list and expand based on logs
- Monitor legitimate bots: Don’t block monitoring services you use (UptimeRobot, Pingdom, etc.)
- Consider SEO: Blocking search engine crawlers affects search visibility
- Use lowercase: All matching is case-insensitive, but use lowercase for clarity
- Combine protections: Use with rate limiting and WAF for defense in depth
- Log and analyze: Review blocked requests to refine your bot list
- Whitelist important bots: If you need Googlebot, don’t include “googlebot” in PROXY_BOTS
Limitations
- Easily bypassed: Sophisticated bots can spoof User-Agents
- False positives: Some legitimate tools may be blocked
- Substring matching: Pattern “bot” blocks “robot”, “robotics”, etc.
- No pattern complexity: Only simple substring matching (no regex)
- User-Agent only: Doesn’t analyze behavior, IP reputation, or other signals
SEO Considerations
If you want search engines to index your site:When to Use Bot Blocking
Good use cases:- Private APIs that should only serve real users
- Admin panels and internal tools
- Preventing content scraping
- Reducing load from automated requests
- Blocking known malicious tools
- Public websites that need SEO
- APIs with legitimate automation clients
- Services that integrate with third-party tools
- When you need detailed bot analytics
