Skip to main content
The IMDb Scraper is fully containerized using Docker Compose, enabling one-command deployment with all dependencies included.

Architecture Overview

The Docker environment consists of four main services:
  • postgres: PostgreSQL 15 database for structured data storage
  • tor: TOR proxy for IP rotation and anonymity
  • vpn: Gluetun VPN container for geolocation changes
  • scraper: The main application container

Docker Compose Configuration

The complete orchestration is defined in docker-compose.yml:
version: "3.9"

services:
  postgres:
    image: postgres:15
    container_name: imdb_postgres
    restart: always
    environment:
      POSTGRES_DB: ${POSTGRES_DB}
      POSTGRES_USER: ${POSTGRES_USER}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
    ports:
      - "${POSTGRES_PORT}:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data
      - ./sql:/docker-entrypoint-initdb.d
    networks:
      - app_net

  tor:
      image: dperson/torproxy
      container_name: tor_proxy
      restart: always
      ports:
        - "9050:9050" # SOCKS port for traffic
        - "9051:9051" # Control port for commands (IP rotation)
      command: >
        sh -c "tor --SocksPort 0.0.0.0:9050 --ControlPort 0.0.0.0:9051 --HashedControlPassword '' --CookieAuthentication 0"
      networks:
        - app_net

  vpn:
    image: qmcgaw/gluetun
    container_name: vpn
    cap_add:
      - NET_ADMIN
    environment:
      - VPN_SERVICE_PROVIDER=protonvpn
      - OPENVPN_USER=${VPN_USERNAME}
      - OPENVPN_PASSWORD=${VPN_PASSWORD}
      - SERVER_COUNTRIES=Argentina
    ports:
      - "8888:8888"
    networks:
      - vpn_net

  scraper:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: imdb_scraper
    depends_on:
      - postgres
      - tor
      - vpn
    networks:
      - app_net
      - vpn_net
    env_file:
      - .env
    volumes:
      - .:/app
    command: >
      sh -c "
        echo 'Waiting for Postgres to be ready...' &&
        while ! nc -z postgres 5432; do sleep 1; done &&
        echo 'Postgres ready. Starting scraper...' &&

        echo 'Waiting for Tor to be ready...' &&
        while ! nc -z tor 9050; do sleep 1; done &&
        echo 'SOCKS port ready. Tor fully initialized.' &&

        python presentation/cli/run_scraper.py &&
        echo 'Scraper finished. Running queries.sql...' &&
        PGPASSWORD=$POSTGRES_PASSWORD psql -h postgres -U $POSTGRES_USER -d $POSTGRES_DB -f sql/queries.sql &&
        echo 'SQL queries executed. Keeping container active...' &&
        tail -f /dev/null
      "

volumes:
  pgdata:

networks:
  app_net:
    driver: bridge
  vpn_net:
    driver: bridge

Dockerfile Breakdown

The scraper container is built from this Dockerfile:
FROM python:3.11-slim

ENV DEBIAN_FRONTEND=noninteractive

WORKDIR /app

# Install system dependencies including postgresql-client
RUN apt-get update && apt-get install -y --no-install-recommends \
    gcc \
    libpq-dev \
    tor \
    curl \
    netcat-openbsd \
    gnupg \
    ca-certificates \
    postgresql-client \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

# Copy and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the rest of the project
COPY . .

# Default command
CMD ["python", "presentation/cli/run_scraper.py"]

Key Components

Uses python:3.11-slim for a lightweight Python environment with minimal footprint.
  • gcc: Required for compiling Python packages with C extensions
  • libpq-dev: PostgreSQL development libraries for psycopg2
  • tor: TOR network client
  • netcat-openbsd: Network utility for health checks
  • postgresql-client: CLI tools for database operations
Installed from requirements.txt without cache to reduce image size.

Service Dependencies

The scraper container has explicit dependencies:
depends_on:
  - postgres
  - tor
  - vpn
And includes health checks in the startup command:
  • Waits for PostgreSQL on port 5432
  • Waits for TOR SOCKS proxy on port 9050
  • Only starts scraping when both services are responsive

Volume Mounts

Database Persistence

pgdata:/var/lib/postgresql/data ensures PostgreSQL data survives container restarts

SQL Initialization

./sql:/docker-entrypoint-initdb.d auto-runs SQL scripts on first startup

Application Code

.:/app mounts the entire project for development (can be removed in production)

Build and Run

Initial Build

docker-compose build --no-cache
This builds the scraper image from scratch, ensuring all dependencies are fresh.

Start All Services

docker-compose up
Or run in detached mode:
docker-compose up -d

View Logs

# All services
docker-compose logs -f

# Specific service
docker-compose logs -f scraper

Stop Services

docker-compose down

Rebuild After Changes

docker-compose down
docker-compose build --no-cache
docker-compose up

Port Mappings

ServiceInternal PortExternal PortPurpose
postgres5432$Database connections
tor90509050SOCKS proxy traffic
tor90519051Control port (IP rotation)
vpn88888888VPN HTTP proxy

Accessing Services

1

PostgreSQL Database

Connect from host machine:
psql -h localhost -p 5432 -U aruiz -d imdb_scraper
2

Scraper Logs

Real-time logs available in logs/scraper.log or via Docker:
docker logs -f imdb_scraper
3

Generated Data

CSV files appear in data/ directory:
  • movies.csv
  • actors.csv
  • movie_actor.csv

Troubleshooting

If the scraper exits immediately, check that:
  • The .env file exists with all required variables
  • PostgreSQL container is healthy: docker ps
  • TOR proxy is responding: docker logs tor_proxy

Common Issues

Database Connection Failed
# Check if postgres is running
docker ps | grep postgres

# View postgres logs
docker logs imdb_postgres
TOR Not Ready
# Verify TOR is listening
docker exec tor_proxy netstat -tuln | grep 9050
VPN Connection Issues
# Check VPN status
docker logs vpn

Production Considerations

Remove Volume Mount

Change - .:/app to only mount necessary files, not entire codebase

Use Secrets

Replace .env file with Docker secrets or external secret management

Health Checks

Add explicit healthcheck directives to docker-compose.yml

Resource Limits

Set memory and CPU limits for each service

Next Steps

Environment Variables

Configure database, proxy, and VPN credentials

Network Configuration

Understand Docker networks and proxy setup

Build docs developers (and LLMs) love