Installation

Installation Overview

The IMDb Scraper can be deployed in two ways:

Docker Compose (Recommended) - Fully automated, production-ready deployment
Manual Python Setup - For local development and debugging

Docker Compose is the recommended approach as it handles all dependencies, networking, and service orchestration automatically.

System Requirements

Minimum Requirements

OS: Linux (Ubuntu 20.04+), macOS (10.15+), or Windows 10/11 with WSL2
RAM: 2GB minimum, 4GB recommended
Disk Space: 5GB free (for Docker images and data)
CPU: 2 cores minimum, 4+ recommended for optimal concurrency
Network: Stable internet connection (bandwidth ~100MB for 250 movies)

Software Dependencies

Docker Installation
Manual Installation

Docker Engine 20.10+
Docker Compose 1.29+ (or Docker Compose V2)
Git

Docker Installation (Recommended)

Install Docker

Ubuntu/Debian
macOS
Windows (WSL2)

# Update package index
sudo apt update

# Install prerequisites
sudo apt install -y ca-certificates curl gnupg lsb-release

# Add Docker's official GPG key
sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | \
  sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg

# Set up repository
echo "deb [arch=$(dpkg --print-architecture) \
  signed-by=/etc/apt/keyrings/docker.gpg] \
  https://download.docker.com/linux/ubuntu \
  $(lsb_release -cs) stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

# Install Docker Engine
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io \
  docker-compose-plugin

# Install Homebrew if not already installed
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Install Docker Desktop
brew install --cask docker

# Start Docker Desktop from Applications
open /Applications/Docker.app

Install WSL2:

wsl --install

Download and install Docker Desktop for Windows
Enable WSL2 backend in Docker Desktop settings
Restart your system

Verify Docker Installation

# Check Docker version
docker --version

# Check Docker Compose version
docker compose version

# Test Docker with hello-world
docker run hello-world

Expected output:

Docker version 24.0.x, build xxxxx
Docker Compose version v2.x.x
Hello from Docker!

Clone the Repository

# Clone the repository
git clone https://github.com/frankdevg/imdb-scraper.git

# Navigate to project directory
cd imdb-scraper

# Verify project structure
ls -la

You should see:

├── docker-compose.yml
├── Dockerfile
├── requirements.txt
├── .env.example
├── application/
├── domain/
├── infrastructure/
├── presentation/
├── shared/
├── sql/
└── data/

Configure Environment

Create your .env file from the example template:

cp .env.example .env
nano .env  # or use your preferred editor

Configure the following variables:

.env

# Database Configuration
POSTGRES_DB=imdb_scraper
POSTGRES_USER=your_username
POSTGRES_PASSWORD=your_secure_password
POSTGRES_PORT=5432
POSTGRES_HOST=postgres

# Proxy Configuration (Optional - DataImpulse)
PROXY_HOST=gw.dataimpulse.com
PROXY_PORT=823
PROXY_USER=your_proxy_username
PROXY_PASS=your_proxy_password

# VPN Configuration (Optional - ProtonVPN)
VPN_PROVIDER=protonvpn
VPN_USERNAME=your_vpn_username
VPN_PASSWORD=your_vpn_password
VPN_COUNTRY=Argentina

If you don’t have proxy or VPN credentials, the scraper will automatically fall back to the TOR network for IP rotation.

Build and Deploy

# Build the Docker images
docker-compose build --no-cache

# Start all services
docker-compose up -d

# View logs
docker-compose logs -f scraper

The scraper will automatically:

Initialize PostgreSQL with schema
Connect to TOR/VPN/proxies
Scrape IMDb Top 250
Save data to CSV and database

Verify Installation

Check running containers:

docker-compose ps

Expected output:

NAME                COMMAND                  SERVICE   STATUS
imdb_postgres       "docker-entrypoint.s…"   postgres  Up
imdb_scraper        "/bin/sh -c 'echo 'E…"   scraper   Up
tor_proxy           "tor --SocksPort 0.0…"   tor       Up
vpn                 "/gluetun-entrypoint…"   vpn       Up

Verify data extraction:

# Check CSV files
ls -lh data/

# Check database records
docker exec -it imdb_postgres psql -U your_username -d imdb_scraper \
  -c "SELECT title, year, rating FROM movies LIMIT 5;"

Manual Python Installation

For local development or environments without Docker:

Install Python and PostgreSQL

Ubuntu/Debian
macOS
Windows

# Install Python 3 and pip
sudo apt update
sudo apt install -y python3 python3-pip python3-venv

# Install PostgreSQL
sudo apt install -y postgresql postgresql-contrib

# Start PostgreSQL service
sudo systemctl start postgresql
sudo systemctl enable postgresql

# Install Python via Homebrew
brew install [email protected]

# Install PostgreSQL
brew install postgresql@15

# Start PostgreSQL
brew services start postgresql@15

Set Up Database

# Connect to PostgreSQL
sudo -u postgres psql

# Create database and user
CREATE DATABASE imdb_scraper;
CREATE USER your_username WITH PASSWORD 'your_password';
GRANT ALL PRIVILEGES ON DATABASE imdb_scraper TO your_username;
\q

Initialize schema:

psql -U your_username -d imdb_scraper -f sql/01_schema.sql
psql -U your_username -d imdb_scraper -f sql/02_procedures.sql
psql -U your_username -d imdb_scraper -f sql/03_views.sql

Create Virtual Environment

# Create virtual environment
python3 -m venv venv

# Activate virtual environment
source venv/bin/activate  # Linux/macOS
# or
venv\Scripts\activate  # Windows

# Upgrade pip
pip install --upgrade pip

Install Python Dependencies

# Install from requirements.txt
pip install -r requirements.txt

Dependencies installed:

requests          # HTTP client
bs4               # HTML parsing
psycopg2-binary   # PostgreSQL adapter
python-dotenv     # Environment variable management
stem              # TOR control library
requests[socks]   # SOCKS proxy support

Configure Environment Variables

Create .env file in project root:

.env

POSTGRES_DB=imdb_scraper
POSTGRES_USER=your_username
POSTGRES_PASSWORD=your_password
POSTGRES_HOST=localhost
POSTGRES_PORT=5432

# Optional: Configure proxies/VPN if available
PROXY_HOST=
PROXY_PORT=
PROXY_USER=
PROXY_PASS=

Run the Scraper

# Execute the scraper
python presentation/cli/run_scraper.py

Monitor output:

INFO - Inicializando contenedor de dependencias...
INFO - Construyendo scraper...
INFO - Iniciando proceso de scraping...
INFO - [HTML] IDs obtenidos: 250
INFO - Tráfico total usado: 15.42 MB
INFO - Proceso de scraping finalizado exitosamente.

Optional: TOR Installation (Manual Setup)

For IP rotation without Docker:

Ubuntu/Debian
macOS
Windows

sudo apt install -y tor
sudo systemctl start tor
sudo systemctl enable tor

brew install tor
brew services start tor

Configure TOR in shared/config/config.py:

TOR_HOST = "localhost"
TOR_PROXY_PORT = 9050
TOR_CONTROL_PORT = 9051

Verification Steps

After installation, verify all components:

Test Database Connection

# Docker installation
docker exec -it imdb_postgres psql -U your_username -d imdb_scraper -c "SELECT version();"

# Manual installation
psql -U your_username -d imdb_scraper -c "SELECT version();"

Test TOR Connection (Optional)

# Check TOR SOCKS proxy
curl --socks5 localhost:9050 https://check.torproject.org/api/ip

Expected response:

{"IsTor": true, "IP": "xxx.xxx.xxx.xxx"}

Validate Data Output

# Check CSV files exist
ls -lh data/*.csv

# Verify database records
psql -U your_username -d imdb_scraper -c \
  "SELECT COUNT(*) FROM movies;"

Troubleshooting

Docker daemon not running

Error: Cannot connect to the Docker daemonSolution:

# Linux
sudo systemctl start docker

# macOS
open /Applications/Docker.app

# Windows
# Start Docker Desktop from Start Menu

Permission denied errors (Docker)

Error: Got permission denied while trying to connect to the Docker daemon socketSolution:

# Add user to docker group
sudo usermod -aG docker $USER

# Log out and back in, or run:
newgrp docker

PostgreSQL connection refused

Error: psql: error: connection to server at "localhost" port 5432 failedSolution:

# Check if PostgreSQL is running
sudo systemctl status postgresql

# Start if not running
sudo systemctl start postgresql

# Check port binding
sudo netstat -plnt | grep 5432

Python module import errors

Error: ModuleNotFoundError: No module named 'bs4'Solution:

# Ensure virtual environment is activated
source venv/bin/activate

# Reinstall dependencies
pip install -r requirements.txt

TOR connection timeout

Error: SOCKSHTTPConnectionPool: Max retries exceededSolution:

# Restart TOR service
sudo systemctl restart tor

# Verify TOR is listening
sudo netstat -plnt | grep 9050

# Test TOR connectivity
curl --socks5 localhost:9050 https://check.torproject.org/

CSV files empty or missing

Error: CSV files exist but have no dataSolution:

Check scraper logs for validation errors
IMDb may have changed HTML structure (update selectors in config.py)
Network issues prevented data retrieval (check proxy/VPN settings)

# View detailed logs
tail -f logs/scraper.log

Out of memory errors

Error: MemoryError or container crashesSolution:

Reduce MAX_THREADS in shared/config/config.py (default: 50)
Increase Docker memory limit in Docker Desktop settings
Process movies in smaller batches by reducing NUM_MOVIES

Performance Tuning

Optimize scraper performance based on your system:

# shared/config/config.py

# Adjust concurrent threads (default: 50)
MAX_THREADS = 30  # Lower for limited resources

# Modify retry behavior
MAX_RETRIES = 3
RETRY_DELAYS = [1, 3, 5]  # seconds

# Request timeout
REQUEST_TIMEOUT = 10  # seconds

# Scrape subset for testing
NUM_MOVIES = 50  # Default: 250

Next Steps

Environment Variables

Customize scraping behavior, selectors, and network settings

Architecture Overview

Understand Clean Architecture and DDD implementation

API Reference

Explore domain models, interfaces, and use cases

SQL Analytics

Run advanced analytical queries on scraped data

For issues not covered here, check the GitHub Issues or contact [email protected]

Get Started

Architecture

Core Features

Data & SQL

Deployment

Installation Overview

System Requirements

Minimum Requirements

Software Dependencies

Docker Installation (Recommended)

Manual Python Installation

Optional: TOR Installation (Manual Setup)

Verification Steps

Troubleshooting

Performance Tuning

Next Steps

Environment Variables

Architecture Overview

API Reference

SQL Analytics

Build docs developers (and LLMs) love

Get Started

Architecture

Core Features

Data & SQL

Deployment

​Installation Overview

​System Requirements

​Minimum Requirements

​Software Dependencies

​Docker Installation (Recommended)

​Manual Python Installation

​Optional: TOR Installation (Manual Setup)

​Verification Steps

​Troubleshooting

​Performance Tuning

​Next Steps

Environment Variables

Architecture Overview

API Reference

SQL Analytics

Build docs developers (and LLMs) love

Installation Overview

System Requirements

Minimum Requirements

Software Dependencies

Docker Installation (Recommended)

Manual Python Installation

Optional: TOR Installation (Manual Setup)

Verification Steps

Troubleshooting

Performance Tuning

Next Steps