Network Architecture
The system uses two separate Docker networks to isolate concerns:app_net
Purpose: Internal application communicationConnected services:
postgrestorscraper
vpn_net
Purpose: VPN-routed trafficConnected services:
vpnscraper
Why Two Networks?
The scraper connects to both networks, allowing it to:- Access database and TOR via
app_net - Route traffic through VPN via
vpn_net - Choose routing strategy dynamically based on availability
Service Network Configuration
PostgreSQL (Database)
- Internal hostname:
postgres(DNS provided by Docker) - Internal port:
5432 - External access:
localhost:${POSTGRES_PORT}from host machine - Accessible by:
scraperservice viaapp_net
TOR Proxy
Port 9050 - SOCKS Proxy
Port 9050 - SOCKS Proxy
The SOCKS5 proxy port used for routing HTTP/HTTPS traffic through the TOR network.Usage in scraper:
Port 9051 - Control Port
Port 9051 - Control Port
Control port for sending commands to TOR, such as requesting a new circuit (IP rotation).Usage:See
infrastructure/network/tor_rotator.py:1 for the implementation.TOR Configuration Flags
TOR Configuration Flags
--SocksPort 0.0.0.0:9050: Listen on all interfaces--ControlPort 0.0.0.0:9051: Enable control port--HashedControlPassword '': No password required (container isolated)--CookieAuthentication 0: Disable cookie auth
VPN Container (Gluetun)
Required Linux capability to modify network routes and create VPN tunnels.
VPN provider integration. Gluetun supports 50+ providers including NordVPN, ExpressVPN, Surfshark.
Preferred server location for geolocation masking.
Proxy Strategy Layers
The scraper uses a cascading proxy strategy with automatic fallback:Layer 1: Premium Proxy (DataImpulse)
When all proxy environment variables are set, the scraper uses premium residential/datacenter proxies:- Fast, low latency
- Rotating residential IPs
- High success rate
infrastructure/network/proxy_provider.py:1
Layer 2: TOR Network (Fallback)
If premium proxy is unavailable or fails:- Free and unlimited
- High anonymity
- Automatic IP rotation via
SIGNAL NEWNYM
- Higher latency
- Slower speeds
- May be blocked by some sites
infrastructure/network/tor_rotator.py:1
Layer 3: VPN (Always Active)
All outbound traffic from the scraper passes through the VPN network, regardless of proxy choice:- Geolocation masking
- ISP-level anonymity
- Additional encryption layer
IP Rotation Mechanism
TOR IP Rotation
infrastructure/network/tor_rotator.py:1 for the full implementation.
Premium Proxy Rotation
DataImpulse and similar providers rotate IPs automatically on each request. No manual rotation needed.IP Validation
The scraper validates its external IP before and after rotation:Retry Logic with IP Rotation
When a request fails, the scraper:- Logs the failure with current IP
- Rotates to a new IP (if using TOR)
- Waits with exponential backoff
- Retries the request
infrastructure/scraper/imdb_scraper.py:1 for the complete implementation.
Network Troubleshooting
Check Service Connectivity
Common Network Issues
TOR SOCKS Port Not Responding
TOR SOCKS Port Not Responding
Symptoms: Connection refused on Wait 10-15 seconds for TOR to bootstrap.
tor:9050Solution:VPN Not Connecting
VPN Not Connecting
Symptoms: VPN container exits or fails to connectSolution:Verify
VPN_USERNAME and VPN_PASSWORD in .env.Scraper Can't Reach Database
Scraper Can't Reach Database
Symptoms:
psycopg2.OperationalError: could not connect to serverSolution:IP Not Rotating
IP Not Rotating
Symptoms: Same IP appears in logs after rotationSolution:
Port Reference Table
| Service | Port | Type | Purpose | External Access |
|---|---|---|---|---|
| postgres | 5432 | TCP | PostgreSQL database | localhost:${POSTGRES_PORT} |
| tor | 9050 | TCP | SOCKS5 proxy | localhost:9050 |
| tor | 9051 | TCP | Control port (rotation) | localhost:9051 |
| vpn | 8888 | TCP | HTTP proxy (via VPN) | localhost:8888 |
Security Considerations
DNS Leaks
Use
socks5h:// instead of socks5:// to ensure DNS resolution happens through TOR, not locally.Kill Switch
Gluetun includes a built-in kill switch. If VPN drops, the container blocks all traffic.
No Logs
TOR and VPN providers (like ProtonVPN) have no-logs policies. Verify your provider’s policy.
Multi-Hop
For maximum anonymity, traffic flows:
Scraper → VPN → Proxy/TOR → InternetPerformance Optimization
Connection Pooling
Concurrent Requests
ThreadPoolExecutor to scrape multiple movie detail pages in parallel.
See presentation/cli/run_scraper.py:1 for concurrency implementation.
Request Timeout
Next Steps
Docker Setup
Return to container orchestration guide
Environment Variables
Configure credentials and settings