Skip to main content

Overview

Crawlith implements multiple layers of security to prevent Server-Side Request Forgery (SSRF) attacks, protect internal networks, and ensure safe operation in production environments.

SSRF Protection

IP Guard System

The IPGuard class (from ipGuard.ts:9) prevents requests to internal/private IP addresses: Blocked IPv4 Ranges:
  • 127.0.0.0/8 - Loopback
  • 10.0.0.0/8 - Private network
  • 192.168.0.0/16 - Private network
  • 172.16.0.0/12 - Private network
  • 169.254.0.0/16 - Link-local
  • 0.0.0.0/8 - Unspecified
Blocked IPv6 Ranges:
  • ::1 - Loopback
  • fc00::/7 - Unique Local Address (ULA)
  • fe80::/10 - Link-local
  • IPv4-mapped IPv6 (e.g., ::ffff:10.0.0.1)

Two-Layer IP Validation

From fetcher.ts:88:
// 1. Fast fail for IP literals
if (net.isIP(urlObj.hostname)) {
    if (IPGuard.isInternal(urlObj.hostname)) {
        return this.errorResult('blocked_internal_ip', currentUrl, redirectChain)
    }
}

// 2. DNS lookup validation (prevents TOCTOU attacks)
const secureDispatcher = IPGuard.getSecureDispatcher()
Layer 1: Immediate blocking of IP literals like http://127.0.0.1 Layer 2: Custom DNS lookup function validates resolved IPs before connection From ipGuard.ts:93:
static secureLookup(
    hostname: string,
    options: dns.LookupOneOptions,
    callback: (err, address, family) => void
): void {
    dns.lookup(hostname, options, (err, address, family) => {
        if (IPGuard.isInternal(address)) {
            const blockedError = new Error(`Blocked internal IP: ${address}`)
            blockedError.code = 'EBLOCKED'
            return callback(blockedError, address, family)
        }
        callback(null, address, family)
    })
}

SSRF Attack Prevention

Blocked:
# Direct IP literals
crawlith crawl http://127.0.0.1:8080
crawlith crawl http://192.168.1.1
crawlith crawl http://[::1]:3000

# DNS rebinding attacks
crawlith crawl http://evil.com  # Resolves to 10.0.0.1
Result:
status: 'blocked_internal_ip'
Allowed:
crawlith crawl https://example.com  # Public IP

Domain Filtering

Whitelist (Allow List)

Restrict crawling to specific domains:
crawlith crawl https://example.com --allow example.com,cdn.example.com
From domainFilter.ts:29:
isAllowed(hostname: string): boolean {
    const normalized = this.normalize(hostname)

    // 1. Deny list match -> Reject
    if (this.denied.has(normalized)) {
        return false
    }

    // 2. Allow list not empty AND no match -> Reject
    if (this.allowed.size > 0 && !this.allowed.has(normalized)) {
        return false
    }

    return true
}
Behavior:
  • If --allow is specified, only listed domains are crawled
  • All other domains return status: blocked_by_domain_filter
  • Useful for restricting crawls to trusted domains

Blacklist (Deny List)

Exclude specific domains:
crawlith crawl https://example.com --deny api.example.com,admin.example.com
Use cases:
  • Skip API endpoints
  • Avoid admin panels
  • Exclude third-party tracking domains

Domain Normalization

Domains are normalized before filtering:
private normalize(hostname: string): string {
    let h = hostname.toLowerCase().trim()
    if (h.endsWith('.')) h = h.slice(0, -1)
    return new URL(`http://${h}`).hostname
}
Examples:
  • Example.Comexample.com
  • example.com.example.com
  • 例え.jpxn--r8jz45g.jp (Punycode)

Subdomain Policy

Control whether subdomains are included in the crawl scope.

Include Subdomains

crawlith crawl https://example.com --include-subdomains
From subdomainPolicy.ts:17:
isAllowed(hostname: string): boolean {
    // Exact match always allowed
    if (target === this.rootHost) return true

    if (!this.includeSubdomains) return false

    // Must end with .rootHost
    if (!target.endsWith(`.${this.rootHost}`)) return false
    return true
}
Allowed:
  • example.com
  • www.example.com
  • blog.example.com
  • api.staging.example.com
Blocked:
  • notexample.com
  • exampleXcom

Exclude Subdomains (Default)

crawlith crawl https://example.com
Allowed:
  • example.com
Blocked:
  • www.example.comblocked_subdomain
  • blog.example.comblocked_subdomain

Subdomain + Whitelist

Combine subdomain policy with explicit whitelist:
crawlith crawl https://example.com \
  --include-subdomains \
  --allow example.com,cdn.example.net
Allowed:
  • example.com
  • www.example.com ✓ (subdomain)
  • cdn.example.net ✓ (explicit whitelist)
Blocked:
  • api.example.net ✗ (not in whitelist)

Scope Manager

The ScopeManager (from scopeManager.ts:13) combines all security policies:
export type EligibilityResult = 
    | 'allowed' 
    | 'blocked_by_domain_filter' 
    | 'blocked_subdomain'

isUrlEligible(url: string): EligibilityResult {
    // 1. Domain filter check
    if (!this.domainFilter.isAllowed(hostname)) {
        return 'blocked_by_domain_filter'
    }

    // 2. Explicit whitelist bypass
    if (this.explicitAllowed.has(hostname)) {
        return 'allowed'
    }

    // 3. Subdomain policy check
    if (!this.subdomainPolicy.isAllowed(hostname)) {
        return 'blocked_subdomain'
    }

    return 'allowed'
}
Evaluation order:
  1. Denied domains (blacklist) - highest priority
  2. Explicit allowed domains (whitelist)
  3. Subdomain policy

Redirect Safety

Redirects are validated at each hop: From fetcher.ts:158:
if (status >= 300 && status < 400 && status !== 304) {
    const location = getHeader('location')
    if (location) {
        let targetUrl = new URL(location, currentUrl).toString()

        // Validate redirect target through scope manager
        const redirectError = redirectController.nextHop(targetUrl)
        if (redirectError) {
            return this.errorResult(redirectError, currentUrl, redirectChain)
        }

        // Continue with redirect...
    }
}
Protection against:
  • Redirect loops (detected by RedirectController)
  • Redirect to internal IPs (validated by IPGuard)
  • Redirect to blocked domains (validated by ScopeManager)
  • Redirect limit exceeded (default: 2 hops, max: 11)

Redirect Configuration

crawlith crawl https://example.com --max-redirects 5
Errors:
  • redirect_loop - Circular redirect detected
  • redirect_limit_exceeded - Too many redirect hops
From redirectController.ts:17:
nextHop(url: string): 'redirect_limit_exceeded' | 'redirect_loop' | null {
    if (this.history.has(normalized)) return 'redirect_loop'
    if (this.currentHops >= this.maxHops) return 'redirect_limit_exceeded'
    this.history.add(normalized)
    this.currentHops++
    return null
}

Response Size Limiting

Protect against memory exhaustion from large responses:
crawlith crawl https://example.com --max-bytes 2000000
From responseLimiter.ts:4:
static async streamToString(
    stream: Readable,
    maxBytes: number,
    onOversized?: (bytes: number) => void
): Promise<string> {
    let accumulated = 0
    if (accumulated > maxBytes) {
        stream.destroy()
        throw new Error('Oversized response')
    }
}
Default: 2000000 bytes (2 MB) Result for oversized responses:
status: 'oversized'
bytesReceived: 2048576

Proxy Support

Route requests through a proxy server:
crawlith crawl https://example.com --proxy http://proxy.example.com:8080

# With authentication
crawlith crawl https://example.com --proxy http://user:[email protected]:8080
From proxyAdapter.ts:6:
constructor(proxyUrl?: string) {
    if (proxyUrl) {
        new URL(proxyUrl) // Validate URL
        this.agent = new ProxyAgent(proxyUrl)
    }
}
Error statuses:
  • proxy_connection_failed - Cannot connect to proxy
Note: When using a proxy, SSRF protection still applies to final destinations.

Security Best Practices

Use --allow (whitelist) when:
  • Crawling untrusted user-provided URLs
  • Running in production environments
  • Security is critical
Use --deny (blacklist) when:
  • You control the start URL
  • Need to exclude specific subdomains
  • Flexibility is more important than strict security
No. SSRF protection is always active and cannot be disabled. This is a core security feature to prevent attacks on internal networks.If you need to crawl local development servers, use public-facing URLs or deploy Crawlith on the same network segment.
Use explicit whitelisting:
crawlith crawl https://example.com \
  --allow example.com,cdn.example.com,assets.example.net
This ensures only trusted domains are crawled, even if malicious links are discovered.
The redirect is blocked and recorded:
status: 'blocked_by_domain_filter'
redirectChain: [{ url: 'https://example.com/page', status: 301, target: 'https://blocked.com' }]
The source page is recorded with a 301 status, but the target is not fetched.

Security Error Statuses

All security errors are recorded in crawl results:
StatusCauseFix
blocked_internal_ipSSRF protection triggeredDon’t crawl internal IPs
blocked_by_domain_filterFailed domain whitelist/blacklistUpdate --allow or --deny
blocked_subdomainSubdomain not allowedAdd --include-subdomains
proxy_connection_failedCannot connect to proxyVerify proxy URL and credentials
redirect_loopCircular redirectCheck site configuration
redirect_limit_exceededToo many redirectsIncrease --max-redirects or fix site
oversizedResponse exceeds --max-bytesIncrease limit or skip large resources

Monitoring Security Events

Enable debug logging to see security decisions:
crawlith crawl https://example.com \
  --allow example.com \
  --log-level debug
Look for:
  • Blocked URLs with reasons
  • Redirect chain validation
  • DNS resolution and IP checks

Technical Details

Source Files

  • plugins/core/src/core/security/ipGuard.ts - SSRF protection
  • plugins/core/src/core/scope/domainFilter.ts - Whitelist/blacklist
  • plugins/core/src/core/scope/subdomainPolicy.ts - Subdomain control
  • plugins/core/src/core/scope/scopeManager.ts - Unified scope validation
  • plugins/core/src/core/network/redirectController.ts - Redirect safety
  • plugins/core/src/core/network/responseLimiter.ts - Response size limiting
  • plugins/core/src/core/network/proxyAdapter.ts - Proxy support

Build docs developers (and LLMs) love